## Outlier Detection 

https://machinelearningmastery.com/model-based-outlier-detection-and-removal-in-python/

What is an outlier?
- observations that don't fit in with the rest of the data

Outlier detection gets difficult, when dealing with a high dimensional space

Problems caused by Outliers:
- skew statistical measures and data distribution
- misleading representation of the underlying data and relationships

Positive impact of removing outliers:
- better fit of the data and a more skillful prediction


##### IMPORTANT:
- please run the train-test cell each time before doing the outlier detection to make sure that X_train contains untransformed data


### Results: 
- Baseline: 3.417
- Isolation Forest: 3.223
- Minimum Covariance Determinant: 3.387
- Local Outlier Factor: 3.355
- One-Class SVM: 3.223

In [1]:
#imports 
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [2]:
df = pd.read_csv('housing.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


In [3]:
data = df.values

In [4]:
X, y = data[:, :-1], data[:, -1]

In [61]:
# run this cell each time befor doing another outlier detection
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [6]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [7]:
y_pred = model.predict(X_test)

In [8]:
mean_absolute_error(y_test, y_pred)

3.4174722788016623

### Automatic outlier detection

#### Isolation Forest
- tree based anomaly detection algorithm

> based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

can be called from scikit learn: IsolationForest:
- important hyperparameter: contamination

Returns:
- negative scores represent outliers and positive scores represent inliers

In [9]:
from sklearn.ensemble import IsolationForest

In [10]:
iso = IsolationForest()
y_hat = iso.fit_predict(X_train)

In [15]:
y_hat

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1, -1,  1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1, -1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
       -1, -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1, -1, -1, -1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
       -1,  1,  1,  1,  1

In [11]:
# create a mask to get the data without the outliers
mask = y_hat != -1 # creats boolean values for each instance
X_train, y_train = X_train[mask, :], y_train[mask]

In [12]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [13]:
y_pred_iso = model.predict(X_test)

In [14]:
mean_absolute_error(y_test, y_pred_iso)

3.223821977818619

#### Minimum Covariance Determinant 
> The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter. It also serves as a convenient and effective tool for outlier detection 

- Simple statistical methods can be used to detect outliers, when the input variables have a gaussian distribution

This method can be implemented with scikit learn: EllipticEnvelope class

Returns:
- 1 for inliers, -1 for outliers.

In [17]:
from sklearn.covariance import EllipticEnvelope

In [27]:
ee = EllipticEnvelope(contamination=0.01)
y_hat = ee.fit_predict(X_train)

In [28]:
y_hat

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1

In [29]:
mask = y_hat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

In [30]:
model_ee = LinearRegression()
model_ee.fit(X_train, y_train)

LinearRegression()

In [31]:
yhat_ee = model_ee.predict(X_test)

In [32]:
mae_ee = mean_absolute_error(y_test, yhat_ee)
mae_ee

3.38756842102794

#### Local Outlier Factor
> harness the idea of nearest neighbors to detect outliers. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.


Limitation:
- works well for a feature space with low dimensionality, but becomes less reliable as the number of features increase

This method can be implemented with scikit learn: <a href='https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html'>LocalOutlierFactor</a> class

In [34]:
from sklearn.neighbors import LocalOutlierFactor

In [41]:
lof = LocalOutlierFactor()
y_hat = lof.fit_predict(X_train)

In [42]:
y_hat

array([ 1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1, -1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1, -1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1

In [43]:
mask = y_hat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

In [47]:
model_lof = LinearRegression()
model_lof.fit(X_train, y_train)

LinearRegression()

In [48]:
yhat_lof = model_lof.predict(X_test)

In [50]:
mae_lof = mean_absolute_error(y_test, yhat_lof)
mae_lof

3.355992329285227

#### One-Class SVM
> Support Vector Machines can be used for outlier detection. When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers.

This method can be implemented with scikit learn: <a href='https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html'>OneClassSVM</a> class

In [52]:
from sklearn.svm import OneClassSVM

In [62]:
ocs = OneClassSVM(nu=0.01)
y_hat = ocs.fit_predict(X_train)

In [63]:
y_hat

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1

In [64]:
mask = y_hat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

In [65]:
model_svm = LinearRegression()
model_svm.fit(X_train, y_train)

LinearRegression()

In [66]:
yhat_svm = model.predict(X_test)

In [67]:
mae_svm = mean_absolute_error(y_test, yhat_svm)
mae_svm

3.223821977818619