## Goals:

* An outlier is an unlikely observation in a dataset and may have one of many causes.
* How to use simple univariate statistics like standard deviation and interquartile range to
identify and remove outliers from a data sample.
* How to use an outlier detection model to identify and remove rows from a training dataset
in order to lift predictive modeling performance.

### TOC:
1. Standard Deviation Method
2. Interquartile Range Method
3. Automatic Outlier Detection

> NOTE :Great care should be taken not to hastily remove or change values, especially if the
sample size is small.

## 1. Standard deviation method:

If we know that the distribution of values in the sample is Gaussian or Gaussian-like, we can
use the standard deviation of the sample as a cut-off for identifying outliers.

Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like
distribution. 

For smaller samples of data, perhaps a value of 2 standard deviations (95 percent)
can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9 percent) can
be used.

In [1]:
# generating a test data for identifying outliers
# libraries
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
# to generate the sampe o/p
seed(1)
# generating univariate observations
data = 5*(randn(1000)) + 50   #we are using randn-> to get gaussian dist of mean 0 and std.dev 1
# in the above data, we are multiplying gaussian dist with 5 std.dev and moving the mean to 50
# summarizing
print("Mean : %0.2f, std.dev : %0.2f" %(mean(data), std(data)))

Mean : 50.19, std.dev : 4.91


In [11]:
# Standard devaition method
# summary sts
data_mean, data_std = mean(data), std(data)
#defining outliers
cut_off = data_std * 3
lower, upper = (data_mean - cut_off), (data_mean + cut_off)
#identifying and storing outilers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers using std.dev method: %d' % len(outliers))
#removing outliers (storing other values- those lie inside 3std.dev)
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))

Identified outliers using std.dev method: 29
Non-outlier observations: 9971


# 2. Interquartile Range Method

A good statistic for summarizing a non-Gaussian distribution sample of data is the Interquartile
Range, or IQR for short

In [15]:
#IQR method IQR = q75 -q25
from numpy import percentile
#calculating the 75th and 25th percentiles
q25 , q75 = percentile(data,25) , percentile(data,75)
IQR = q75 - q25
#defining outliers
cut_off = IQR * 1.5
lower, upper = (q25 -cut_off), (q75+cut_off)
print('Percentiles: 25th= %.3f, 75th= %.3f, IQR= %.3f' % (q25, q75, IQR))
#identifying and storing outilers
outliers = [x for x in data if x < lower or x > upper]
print("Outliers identified using IQR method: %d" %(len(outliers)))
#removing outliers (storing those values that lie inside the cutoff)
outliers_removed = [x for x in data if x >= lower and x <= upper]
print("Non-Outlier observations : %d" %(len(outliers_removed)))

Percentiles: 25th= 46.685, 75th= 53.359, IQR= 6.674
Outliers identified using IQR method: 81
Non-Outlier observations : 9919


# 3. Automatic Outlier Detection

In machine learning, an approach to tackling the problem of outlier detection is one-class classification.

* Local Outlier Factor
* Isolation Forest
* Minimum Covariance Determinant
* One-Class SVM

In [5]:
# Baseline model: 
# Libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
#load the data
df = pd.read_csv('housing.csv', header= None)
#storing the data values
data = df.values
#Split the data into inputs and output elements
X, y = data[:,:-1], data[:,-1]
#train-test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state =1)
#fit the model
model= LinearRegression()
model.fit(X_train,y_train)
#evaluate the model
yhat = model.predict(X_test)
#evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print("MAE without outlier removal: %0.3f " %(mae))

accuracy = model.score(X_test, y_test)
print("Accuracy of the baseline_model: %0.3f" %(accuracy*100))

MAE without outlier removal: 3.417 
Accuracy of the baseline_model: 76.494


### Local Outlier Factor : LOF  

> The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier
detection.

Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

This can work well for feature spaces with low dimensionality (few features)

In [16]:
#libraries
from sklearn.neighbors import LocalOutlierFactor
#load the data
df = pd.read_csv('housing.csv', header=None)
#store the data for np operations
data = df.values
#split the data into input and output elements
X = data[:,:-1]
y = data[:,-1]
#summarizing the shape
print(X.shape, y.shape)
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=1)
#summarise the train test shape
print("Training ds:\n",X_train.shape, X_test.shape,"\nTest ds:\n",y_train.shape, y_test.shape)

#Identify the outliers
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)
#select all rows that are not outliers
mask = yhat != -1
X_train , y_train = X_train[mask, :] , y_train[mask]
#summary of updated traning dataset
print("Updated training dataset:")
print(X_train.shape, y_train.shape)
#fit the model
model= LinearRegression()
model.fit(X_train,y_train)
#evaluate the model
yhat = model.predict(X_test)
#evaluate the predictions
mae = mean_absolute_error(y_test, yhat)
print("MAE model after=> LOF : %0.3f" %mae)

accuracy = model.score(X_test, y_test)
print("Accuracy of the model: %0.3f" %(accuracy*100))

(506, 13) (506,)
Training ds:
 (339, 13) (167, 13) 
Test ds:
 (339,) (167,)
Updated training dataset:
(305, 13) (305,)
MAE model after=> LOF : 3.356
Accuracy of the model: 77.195


### Isolation Forest : iForest

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.`

In [17]:
#libraries
from sklearn.ensemble import IsolationForest
#load the data
df = pd.read_csv('housing.csv', header=None)
#store the data for np operations
data = df.values
#split the data into input and output elements
X = data[:,:-1]
y = data[:,-1]
#summarizing the shape
print(X.shape, y.shape)
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=1)
#summarise the train test shape
print("Training ds:\n",X_train.shape, X_test.shape,"\nTest ds:\n",y_train.shape, y_test.shape)

#Identify the outliers
iso = IsolationForest(contamination=0.1)  #“contamination” is used to help estimate the no. of outliers in the ds. 
#This is a value between 0.0 and 0.5 and by default is set to 0.1.
yhat = iso.fit_predict(X_train)
#select all rows that are not outliers
mask = yhat != -1
X_train , y_train = X_train[mask, :] , y_train[mask]
#summary of updated traning dataset
print("Updated training dataset:")
print(X_train.shape, y_train.shape)
#fit the model
model= LinearRegression()
model.fit(X_train,y_train)
#evaluate the model
yhat = model.predict(X_test)
#evaluate the predictions
mae = mean_absolute_error(y_test, yhat)
print("MAE model after=> Iso : %0.3f" %mae)

accuracy = model.score(X_test, y_test)
print("Accuracy of the model: %0.3f" %(accuracy*100))

(506, 13) (506,)
Training ds:
 (339, 13) (167, 13) 
Test ds:
 (339,) (167,)
Updated training dataset:
(305, 13) (305,)
MAE model after=> Iso : 3.199
Accuracy of the model: 78.128


### Minimum Covariance Determinant : MCD

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

> The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.

In [18]:
#libraries
from sklearn.covariance import EllipticEnvelope
#load the data
df = pd.read_csv('housing.csv', header=None)
#store the data for np operations
data = df.values
#split the data into input and output elements
X = data[:,:-1]
y = data[:,-1]
#summarizing the shape
print(X.shape, y.shape)
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=1)
#summarise the train test shape
print("Training ds:\n",X_train.shape, X_test.shape,"\nTest ds:\n",y_train.shape, y_test.shape)

#Identify the outliers
ee = EllipticEnvelope(contamination=0.1)  #“contamination” here defines the expected ratio of outliers to be observed in practice. 
#we will set it to a value of 0.01, found with a little trial and error.
yhat = ee.fit_predict(X_train)
#select all rows that are not outliers
mask = yhat != -1
X_train , y_train = X_train[mask, :] , y_train[mask]
#summary of updated traning dataset
print("Updated training dataset:")
print(X_train.shape, y_train.shape)
#fit the model
model= LinearRegression()
model.fit(X_train,y_train)
#evaluate the model
yhat = model.predict(X_test)
#evaluate the predictions
mae = mean_absolute_error(y_test, yhat)
print("MAE model after=> MCD : %0.3f" %mae)

accuracy = model.score(X_test, y_test)
print("Accuracy of the model: %0.3f" %(accuracy*100))

(506, 13) (506,)
Training ds:
 (339, 13) (167, 13) 
Test ds:
 (339,) (167,)
Updated training dataset:
(305, 13) (305,)
MAE model after=> MCD : 3.686
Accuracy of the model: 73.093


### One-Class SVM : 

The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM; One-Class SVM is also a classification algorithm, it can be used to discover outliers in input data for both regression and classification datasets.

In [19]:
#libraries
from sklearn.svm import OneClassSVM
#load the data
df = pd.read_csv('housing.csv', header=None)
#store the data for np operations
data = df.values
#split the data into input and output elements
X = data[:,:-1]
y = data[:,-1]
#summarizing the shape
print(X.shape, y.shape)
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=1)
#summarise the train test shape
print("Training ds:\n",X_train.shape, X_test.shape,"\nTest ds:\n",y_train.shape, y_test.shape)

#Identify the outliers
svm = OneClassSVM(nu=0.01)  #“nu” argument that specifies the approximate ratio of outliers in the dataset
#we will set it to a value of 0.01, found with a little trial and error.
yhat = svm.fit_predict(X_train)
#select all rows that are not outliers
mask = yhat != -1
X_train , y_train = X_train[mask, :] , y_train[mask]
#summary of updated traning dataset
print("Updated training dataset:")
print(X_train.shape, y_train.shape)
#fit the model
model= LinearRegression()
model.fit(X_train,y_train)
#evaluate the model
yhat = model.predict(X_test)
#evaluate the predictions
mae = mean_absolute_error(y_test, yhat)
print("MAE model after=> MCD : %0.3f" %mae)

accuracy = model.score(X_test, y_test)
print("Accuracy of the model: %0.3f" %(accuracy*100))

(506, 13) (506,)
Training ds:
 (339, 13) (167, 13) 
Test ds:
 (339,) (167,)
Updated training dataset:
(336, 13) (336,)
MAE model after=> MCD : 3.431
Accuracy of the model: 76.438


## API :

* Isolation Forest : 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

* Local Outlier Factor :
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

* Min Cov Determinant :
https://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html

* One class SVM : 
https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html

* https://scikit-learn.org/stable/modules/outlier_detection.html
