# 1. Data Processing

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM

In [None]:
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = pd.read_csv(url, header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [None]:
data = df.values
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, y_train.shape)

(339, 13) (339,)


# 2. Novelty and Outlier Detection

[SkLearn Overview of outlier detection methods](https://scikit-learn.org/stable/modules/outlier_detection.html) | [SkLearn Anomaly Detection Algorithm](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html)

## 2.1 Isolation Forest

Anomalies, due to their nature, they have the shortest path in the trees than normal instances.

[Isolation Forest Explained](https://towardsdatascience.com/isolation-forest-the-anomaly-detection-algorithm-any-data-scientist-should-know-1a99622eec2d)
| [Isolation Forest from Scratch](https://towardsdatascience.com/isolation-forest-from-scratch-e7e5978e6f4c)


In [None]:
ifo = IsolationForest(n_estimators=100, contamination=0.1)
mask = ifo.fit_predict(X_train)
# 34 outlier
X_train[(mask != 1), :].shape

(34, 13)

## 2.2 Minimum Covariance Determinant

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

In [None]:
ee = EllipticEnvelope(contamination=0.1)
ee_mask = ee.fit_predict(X_train)
# 34 outlier
X_train[(ee_mask != 1), :].shape

(34, 13)

In [None]:
ee.covariance_.shape

(13, 13)

In [None]:
ee.location_

array([1.80820711e+00, 8.51605505e+00, 9.05880734e+00, 4.12844037e-02,
       5.21805046e-01, 6.29650000e+00, 6.32330275e+01, 4.14108807e+00,
       8.54587156e+00, 3.73894495e+02, 1.84642202e+01, 3.87783165e+02,
       1.14200459e+01])

## 2.3 Local Outlier Factor
The anomaly score of each sample is called the Local Outlier Factor. It measures the local deviation of the density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. More precisely, locality is given by k-nearest neighbors, whose distance is used to estimate the local density. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.

Brute Force may be the most accurate method due to the consideration of all data points. Hence, no data point is assigned to a false cluster. For small data sets, Brute Force is justifiable, however, for increasing data the KD or Ball Tree is better alternatives due to their speed and efficiency. The KD-tree and its variants can be termed “projective trees,” meaning that they categorize points based on their projection into some lower-dimensional space. (Kumar, Zhang & Nayar, 2008) or low-dimensional data, the KD Tree Algorithm might be the best solution. As seen above, the node divisions of the KD Tree are axis-aligned and cannot take a different shape. So the distribution might not be correctly mapped, leading to poor performance. For a high-dimensional space, the Ball Tree Algorithm might be the best solution. Its performance depends on the amount of training data, the dimensionality, and the structure of the data. Having many data points that are noise can also lead to a bad performance due to no clear structure.

Note: Measures distances of n nearest neighbours. Only works well for feature spaces with low dimensionality.

[Medium Ball Tree vs. KD Tree vs. Brute Force](https://towardsdatascience.com/tree-algorithms-explained-ball-tree-algorithm-vs-kd-tree-vs-brute-force-9746debcd940) | [Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html)
 | [SkLearn LocalOutlierFactor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor)



In [None]:
lof = LocalOutlierFactor()
lof_mask = lof.fit_predict(X_train)
X_train[(lof_mask != 1), :].shape

(34, 13)

In [None]:
# find nearest neightbour
print('Distance to neighbours', lof.kneighbors()[0][0])
print('Nearest Neighbours', lof.kneighbors()[1][0])

Distance to neighbours [ 4.23092261  5.10651982  7.05271079  7.84372942  8.70051763 10.69672104
 10.80775442 11.186697   14.49075635 14.97747497 15.58867539 15.81042453
 16.15736732 16.70677642 17.22434599 17.23959374 17.63102864 18.14385094
 18.23231502 18.29698129]
Nearest Neighbours [ 87 113 123  60 255 168 284  26 170 176 329 128  17 311 169 315 333 323
 163 336]


## 2.4 One-Class SVM

Only one class for categorization, the boundary is set against the origin

The OneClassSVM is known to be sensitive to outliers and thus does not perform very well for outlier detection. This estimator is best suited for novelty detection when the training set is not contaminated by outliers. That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging, and a One-class SVM might give useful results in these situations depending on the value of its hyperparameters.

SGDOneClassSVN has a linear complexity in the number of training samples and is thus better suited than the sklearn.svm.OneClassSVM implementation for datasets with a large number of training samples (say > 10,000).

[One Class SVM for Anomaly Detection](https://machinelearninginterview.com/topics/machine-learning/what-is-one-class-svm-how-to-use-it-for-anomaly-detection/)

In [None]:
ocs = OneClassSVM(nu=0.01)
ee_mask = ee.fit_predict(X_train)
X_train[(ee_mask != 1), :].shape

(34, 13)