## Generating sample data

We can limit the number of informative features. We can also make some features copies of any of the informative or redundant features.
In our current case we would make sure that all our features are informative since we are going to limit ourselves to two features only.

In [1]:
from sklearn.datasets import make_classification

In [2]:
# generating samples
x, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=2, 
                          n_clusters_per_class=2, weights=[0.98, ], class_sep=0.5,
                          scale=1.0, shuffle=True, flip_y=0, random_state=0)

## Detecting anomalies using basic statistics

Let's startby thinking about ways to detect the anomalous samples. 

Imaging measuring the traffic to your website every hour, which gives you the foolwing numbers:

In [3]:
hourly_traffic = [
    120, 123, 124, 119, 196,
    121, 118, 117, 500, 132
]

In [4]:
import pandas as pd

In [5]:
pd.Series(hourly_traffic) > pd.Series(hourly_traffic).quantile(0.95)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
9    False
dtype: bool

Let's put the preceeding code in form of an estimate with its `fit` and `predict` methods
The `fit` calculate the threshold and saves it, and the `predict` method compares the new data to the saved threshold.
Here's the code for the estimator:

In [7]:
class PercentileDetection:
    def __init__(self, percentile=0.9):
        self.percentile = percentile
    def fit(self, x, y=None):
        self.threshold = pd.Series(x).quantile(self.percentile)
    def predict(self, x, y=None):
        return (pd.Series(x) > self.threshold).values
    def fit_predict(self, x, y=None):
        self.fit(x)
        return self.predict(x)

In the following code snippet, we use the $95th$ percentile for our estimator.
We then put the resulting predictions alongside the priginal data into a data frame

In [8]:
outlierd = PercentileDetection(percentile=0.95)
pd.DataFrame({
    'hourly_traffic': hourly_traffic,
    'is_outlier': outlierd.fit_predict(hourly_traffic)
}).style.apply(lambda row: ['font-weight: bold'] * len(row)
              if row['is_outlier'] == True
              else ['font-weight: normal'] * len(row), axis=1)

Unnamed: 0,hourly_traffic,is_outlier
0,120,False
1,123,False
2,124,False
3,119,False
4,196,False
5,121,False
6,118,False
7,117,False
8,500,True
9,132,False


## Using percentiles for multi-dimentional data

The data we generate using `make_classification` function is multi-dimentional. We have more than one feature to check.

We can check each feature seperately, i.e:

In [10]:
outlierd = PercentileDetection(percentile=0.98)
y_pred = outlierd.fit_predict(x[:,0])

In [11]:
# we can do the same for the other feature as well
y_pred = outlierd.fit_predict(x[:,1])

In [18]:
class PercentileDetection:
    def __init__(self, percentile=0.9):
        self.percentile = percentile
    def fit(self, x, y=None):
        self.thresholds = [
            pd.Series(x[:,i]).quantile(self.percentile) for i in range(x.shape[1])
        ]
    def predict(self, x, y=None):
        return (x > self.thresholds).max(axis=1)
    def fit_predict(self, x, y=None):
        self.fit(x)
        return self.predict(x)

In [19]:
# we can use the tweaked estimator as follows
outlierd = PercentileDetection(percentile=0.98)
y_pred = outlierd.fit_predict(x)

We can also use the labels we ignored earlier to calculate the precision and recall of our new estimator.
Since we care about the minority class, whose label is 1, we set `pos_label` to `1` in the following code snippet

In [20]:
from sklearn.metrics import precision_score, recall_score

In [21]:
print('Precision: {:.02%}, Recall: {:.02%} [Percentile Detection]'.format(
precision_score(y, y_pred, pos_label=1), recall_score(y, y_pred, pos_label=1),))

Precision: 4.00%, Recall: 5.00% [Percentile Detection]


Ouur method checks each point and sees whether it is extreme on one of the two axes.
Despite the fct that the outliers are furthe away from the inliers, there are still inliers that share the same horizontal or vertical position of each point of the outliers. 
In other words, if you project your points onto any of the two axes, you will not be able to sepearte the outliers from the inliers anymore. So we need a way to consider the two axes at once.
What if we find the mean point of the two axes - that is, the center of our data - and then draw an ellipse around it?. Then we can consider any point that falls outside this ellipse an outlier,
which is what `EllipticEnvelope`

The `EllipticEnvelope` algorithm finds the enter of the data samples and then draws an Ellipsoid around that center. 
The radii of the ellipsoid in each axis are measured in the `Mahalanobis` distance.

You can think of Mahalanobis distance as a `Euclidean` distance whose units are the number of standard deviations in each direction.
After the ellipsoid is drawn, the points that faloutside it can be considered outliers

In [31]:
from sklearn.covariance import EllipticEnvelope

In [32]:
ee = EllipticEnvelope(random_state=0)
y_pred = ee.fit_predict(x) == -1

We can calculate the precision and the recall scores for the predictions using the exact same code from the previous section:

In [33]:
print('Precision: {:.02%}, Recall: {:.02%} [EllipticEnvelope]'.format(
precision_score(y, y_pred, pos_label=1),
recall_score(y, y_pred, pos_label=1),
))

Precision: 9.00%, Recall: 45.00% [EllipticEnvelope]


## Outlier and Novelty detection using LOF

LOF compares the density of a sample to the local densities of its neighbors. A sample existing in a low-density are compared to its neighbors is considered an outlier. 

Here's how we use `LocalOutlierFactor` for outlier detection

In [35]:
from sklearn.neighbors import LocalOutlierFactor

In [43]:
lof = LocalOutlierFactor(n_neighbors=50)
y_pred = lof.fit_predict(x) == -1

In [44]:
print('Precision: {:.02%}, Recall: {:.02%} [LOC]'.format(
precision_score(y, y_pred, pos_label=1),
recall_score(y, y_pred, pos_label=1),
))

Precision: 26.00%, Recall: 65.00% [LOC]


Outlier detection algorithms not only give us binary predictions, but can also tell us how confident they are that a sample is an outlier. 

A sample is more likely to be an outlier if the score is closer to `-1`. So, we can use this score and set its bottom 1%, 2%, or 10% values as outliers, and consider the rest inliers.

Hers is a omparison for the different performance metrics at each of the aforementioned thresholds:

In [40]:
lof.fit(x)

for quantile in [0.01, 0.02, 0.1]:
    y_pred = lof.negative_outlier_factor_ < np.quantile(
    lof.negative_outlier_factor_, quantile)
    
    print('LOF: Precision: {:.02%}, Recall: {:.02%} [Quantile={:.0%}]'.format(
    precision_score(y, y_pred, pos_label=1),
    recall_score(y, y_pred, pos_label=1),
    quantile))

LOF: Precision: 80.00%, Recall: 40.00% [Quantile=1%]
LOF: Precision: 50.00%, Recall: 50.00% [Quantile=2%]
LOF: Precision: 14.00%, Recall: 70.00% [Quantile=10%]


In [39]:
import numpy as np

## Novelty detection using LOF

Aside from its use for outlier detection, the LOF algorithm can also be used for novelty detection

When used for outlier detection, the algorithm has to be fitted on the dataset with both its inliers and outliers.

In the case of Novelyt Detection, we are expected to fit the algorithm on inliers only and then predict the contaminated dataset later on.

In [45]:
x_inliers = x[y==0]

lof_ = LocalOutlierFactor(n_neighbors=50, novelty=True)

In [47]:
lof_.fit(x_inliers)
y_pred = lof_.predict(x) == -1

In [48]:
print('Precision: {:.02%}, Recall: {:.02%} [LOC Novelty Detector]'.format(
precision_score(y, y_pred, pos_label=1),
recall_score(y, y_pred, pos_label=1),
))

Precision: 26.53%, Recall: 65.00% [LOC Novelty Detector]


## Detecting outliers using isolation forests

In [49]:
from sklearn.ensemble import IsolationForest

In [51]:
iforest = IsolationForest(n_estimators=200, n_jobs=-1, random_state=10)
y_pred = iforest.fit_predict(x) == -1

In [52]:
print('Precision: {:.02%}, Recall: {:.02%} [Isolation Forest]'.format(
precision_score(y, y_pred, pos_label=1),
recall_score(y, y_pred, pos_label=1),
))

Precision: 6.45%, Recall: 60.00% [Isolation Forest]
