# Anomaly Detection

More often then ever, we need the ability to scan through data and find anomilies or novelties.

Scikit learn provides several novelty detection algorithms. We'll focus on Local Outlier Factor (LOF) and Isolation Forest.

References:<br>
Novelty detection with Local Outlier Factor: https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_novelty_detection.html <br>
IsolationForest example: https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn
%pip install seaborn
%pip install -U matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

## Novelty Detection with LOF

After providing sample data, LOF can determine if new samples are different enough to be considered novel.

LOF uses KNN to detect the neighbors for each point, then determines if it's novel based on the density of the point relative to the densities of the neighbors. If the point's density to too low relative to the neighbors' densities, then it is considered novel.

### Generating the data

Let's create the data and see what it looks like.

In [None]:
# Generate training and testing data
# Note that the sets are not labelled, so this is not supervised!


# generate abnormal data


# plot the data
plt.scatter(X_train[:, 0], X_train[:, 1], color='b', label="train")
plt.scatter(X_test[:, 0], X_test[:, 1], color='g', label="test")
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='m', label="outliers")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.title("Ground truth clusters")
plt.legend(loc="lower right")
plt.show()

<details><summary>Click to cheat</summary>

```python
# Generate training and testing data
# Note that the sets are not labelled, so this is not supervised!
X = 0.3 * np.random.randn(50, 2)
X_train = np.r_[X + 2, X - 2]
X = 0.3 * np.random.randn(50, 2)
X_test = np.r_[X + 2, X - 2]

# generate abnormal data
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))

# plot the data
plt.scatter(X_train[:, 0], X_train[:, 1], color='b', label="train")
plt.scatter(X_test[:, 0], X_test[:, 1], color='g', label="test")
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='m', label="outliers")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.title("Ground truth clusters")
plt.legend(loc="lower right")
plt.show()
```
</details>

### Creating our Novelty Detector

In [None]:
from sklearn.neighbors import LocalOutlierFactor

# Create the LOF detector with k=20, novelty=True, and contamination=0.1


# train the LOF


# get the predictions


<details><summary>Click to cheat</summary>

```python
from sklearn.neighbors import LocalOutlierFactor

# Create the LOF detector with k=20, novelty=True, and contamination=0.1
lof = LocalOutlierFactor(n_neighbors=20, novelty=True, contamination=0.1)

# train the LOF
lof.fit(X_train)

# get the predictions
y_pred_test = lof.predict(X_test)
y_pred_outliers = lof.predict(X_outliers)
```
</details>

### Getting the number of Errors

In [None]:
# Get the number of test errors


# Get the number of outlier errors


<details><summary>Click to cheat</summary>

```python
# Get the number of test errors
n_errors_test = y_pred_test[y_pred_test == -1].size

# Get the number of outlier errors
n_errors_outliers = y_pred_outliers[y_pred_outliers == 1].size
```
</details>

### Plotting the Data

In [None]:
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))

# plot the learned frontier, the points, and the nearest vectors to the plane
Z = lof.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

fig = plt.figure(figsize=(8, 8))

plt.title("Novelty Detection with LOF")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="darkred")
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors="palevioletred")

s = 40
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c="white", s=s, edgecolors="k")
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c="blueviolet", s=s, edgecolors="k")
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c="gold", s=s, edgecolors="k")
plt.axis("tight")
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend(
    [a.collections[0], b1, b2, c],
    [
        "learned frontier",
        "training observations",
        "new regular observations",
        "new abnormal observations",
    ],
    loc="upper left",
    prop=matplotlib.font_manager.FontProperties(size=11),
)
plt.xlabel(
    f"errors novel regular: {n_errors_test}/40 ; errors novel abnormal: {n_errors_outliers}/40"
)
plt.show()

## Isolation Forests

Alternative to LOF is isolation forests. These forests work by randomly selecting a feature, then randomly splitting the feature between the max and min values of that feature.

### Data

We're going to use the same synthetic data, so no need to generate new data.

### Create the Novelty Detector

In [None]:
from sklearn.ensemble import IsolationForest

# Create the IF that uses 100 maximum samples


# train the model


# get the predictions for all three datasets


<details><summary>Click to cheat</summary>

```python
from sklearn.ensemble import IsolationForest

# Create the IF that uses 100 maximum samples
isoForest = IsolationForest(max_samples=100)

# train the model
isoForest.fit(X_train)

# get the predictions for all three datasets
y_pred_train = isoForest.predict(X_train)
y_pred_test = isoForest.predict(X_test)
y_pred_outliers = isoForest.predict(X_outliers)
```
</details>

### Plot the results

In [None]:
fig = plt.figure(figsize=(7, 7))
# plot the line, the samples, and the nearest vectors to the plane
xx, yy = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
Z = isoForest.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c="white", s=20, edgecolor="k")
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c="green", s=20, edgecolor="k")
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c="red", s=20, edgecolor="k")
plt.axis("tight")
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend(
    [b1, b2, c],
    ["training observations", "new regular observations", "new abnormal observations"],
    loc="upper left",
)
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
plt.show()