<a href="https://colab.research.google.com/github/AVI18794/Machine-Learning/blob/main/Anomaly_Detection_Using_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### The presence of outliers in the dataset can result in the poor fit and lower predictive power of the model.
### Identifying and removing the outliers from the dataset is a tedious and challenging task with simple stats for majority of ML algorithms with large number of input variables.


### Outliers:- Outliers are observations in the dataset that doesn't fit in some way.The most common or familiar type of outlier is the observations that are far from the rest of the observations or the center of mass of observations.

### It can be important to identify and remove outliers from data when training machine learning algorithms for predictive modelling.
### Outliers can skew statistical measures and data distributions,providing a misleading representation of the underlying data and relationships.
### Removing outliers from training data before modeling can result in a better fit of the data and in turn more skillful predictions.
### There are various types of automatic model-based methods for identifying outliers in the input data.


In [8]:
#Load and summarize the dataset
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
#load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = pd.read_csv(url, header=None)

In [9]:
data = df.values

In [10]:
#Split the data into inputs and outputs
X,y = data[:,:-1],data[:,-1]


In [11]:
#Print the shape of the data
X.shape,y.shape

((506, 13), (506,))

In [21]:
#Split the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=1)
#Check the shape of the train and test sets
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)


(379, 13) (127, 13) (379,) (127,)


In [22]:
#Evaluate the models on the raw dataset
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
model = LinearRegression()
model.fit(X_train,y_train)
#Evaluate the performance of the model
y_pred = model.predict(X_test)

In [23]:
#Evaluate the predictions
mean_absolute_errorr = mean_absolute_error(y_test,y_pred)
print("MAE : %.3f" % mean_absolute_errorr)


MAE : 3.575


In [None]:
#Now try removing the outliers from the datasets


The scikit-learn library provides a number of built-in automatic methods for identifying outliers in the dataset.
In this notebook we will compare few of them and compare their performance on the dataset.


### Isolation Forest:- Isolation Forest or iForest is a tree based anomaly detection algorithm.It is based on modeling the normal data in such a way so as to isolate anomalies that are both few in numbers and different in the feature space.

### The most important hyperparameter in the model is the “contamination” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

In [24]:
#Identifying the outliers in the training dataset
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.1)
y_pred = iso.fit_predict(X_train)
#Once the outliers are detected we will remove those outliers from the training set
print("The outliers are : ",y_pred)
#Selecting the rows that are not outliers
mask = y_pred!=-1
X_train,y_train = X_train[mask,:],y_train[mask]

The outliers are :  [ 1  1  1  1  1  1  1  1  1 -1  1  1 -1 -1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1 -1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1
 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1 -1  1  1  1
  1  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1 -1 -1  1  1  1  1  1  1  1
  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 -1  1  1  1  1  1 -1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1 -1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1 -1  1  1  1  1  1  1  1 -1 -1  1  1  1 -1  1  1  1  1  1  1 -1
  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1
  1  1  1 -1 -1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1 -1 -1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1 

In [25]:
#Now build the model after removing the outliers
model  = LinearRegression()
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [27]:
#Evaluate the model
y_pred1 = model.predict(X_test)
#Evaluate the predictions
mae = mean_absolute_error(y_test,y_pred1)
print("MAE : %.3f"%mae)

MAE : 3.423


We can see that in model one when the outliers were not removed from the dataset the prediction was having an error rate MAE of 3.575 and after removing the outliers from the dataset the MAE is 3.423.

# Minimum Covariance Determinant
### If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

### For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

### This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

* The scikit learn provides access to this method via the EllipticEnvelope class
* It contains the **contamination** argument that defines the expected ratio of outliers to be observed in practice.


In [28]:
from sklearn.covariance import EllipticEnvelope
ee = EllipticEnvelope(contamination=0.01)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
y_pred2 = ee.fit_predict(X_train)
# select all rows that are not outliers
mask = y_pred2 != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
#Fit the model
model3 = LinearRegression()
model3.fit(X_train,y_train)
y_pred2 = model3.predict(X_test)

#Evaluate the predictions
mae = mean_absolute_error(y_test,y_pred2)
print('MAE %.3f'%mae)

(339, 13) (339,)
(335, 13) (335,)
MAE 3.388


## Local Outlier Factor(LOF)
### A simplest approach to identifying the outliers is to locate those datapoints that are far from the other datapoints in the feature space.
### This approach will work nice for feature space with lower dimensions(fewer features) although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

### The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

* <b>The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.</b>

* <b>The model provides the “contamination” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

In [29]:
from sklearn.neighbors import LocalOutlierFactor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

(339, 13) (339,)
(305, 13) (305,)
MAE: 3.356
