# Isolation Forest

* Builds an ensemble of extremely random trees
* Suitable for processing large datasets
  * Linear time-complexity
  * Low memory usage
* Used for detecting outliers

In [1]:
import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

### Create dataset

* Dataset has 4 values
* 3 are low [-1.1, 0.5]
* One outlier (100)

In [2]:
X = np.array([-1.1, 0.3, 0.5, 100]).reshape(-1, 1)
X

array([[ -1.1],
       [  0.3],
       [  0.5],
       [100. ]])

### Fit model

* Use Isolation Forest to learn what is an anomaly
  * Randomly sub-sampled data is processed in a tree
  * Use randomly selected features
  * Samples deep into the tree are not anomalies (required more cuts to isolate)
  * Samples in short branches are more likely to be anomalies (easier for the model to separate them from other observations)
* Test model on new data: 0.1, 0, 90
* Model correctly identifies the anomaly as 90
* Anomalies are labelled -1

In [3]:
clf = IsolationForest(random_state=0).fit(X)
clf.predict([[0.1], [0], [90]])

array([ 1,  1, -1])