# Local Outlier Factor script for the video

## General notes
- works well for moderately high-dimensional datasets
- The returned LOF score represents abnormality of each data point
- LOF is the local density deviation of a sample compared to its neighbors
- Outliers are marked if they have substantially lower density than their neighbors
- The advantage of LOF is that it takes both local and global properties of the dataset
- The aim is not to find isolated data points like in Isolation Forest but only to find isolated data points with respect to its neighborhood

## Outline

1. What is LOF?
2. How to calculate LOF score for each data point
3. Fitting LOF from PyOD using a sample dataset
4. Rules for choosing n_neighbors

## The script

### What is LOF?

In the previous lessons, you have learned how to use a distance-based model - KNN. In this lesson, you will learn about a popular density-based algorithm called Local Outlier Factor.

LOF is a well-known algorithm that has existed since 2000. It works well with moderately high-dimensional datasets and is one of the fastest outlier classifiers. 

LOF classifies data points into inliers and outliers using a local outlier factor score, which is where the name is taken from. 

The LOF score is based on the concept of local density, where locality is defined by choosing *k* nearest neighbors, like in KNN. The density itself is calculated between a data point and its distances to its chosen neighbors.

Data points with similar densities will form a cluster, while samples with substantially lower densities than their local neighborhood are classified as outliers.

Here, the word "local" is very important. The LOF score is not compared to the rest of the samples in the dataset, but only to their local neighborhood.

In the plot, we can see two clusters of data points and a dozen clear outliers. The size of the red circles represent how anomalous they are compared to their local neighborhood. The higher their LOF score, the bigger the circles. 

I want you to pay attention to the two points I highlighted. Point A is an outlier but the circle size isn't very large. That's because it is much closer to its local neighborhood compared to point B. Point B is far away and therefore, have much more deviation and less density.

### Fitting LOF from PyOD using a sample dataset

Now, let's see LOF in action using the Statlog Shuttle dataset from UCI machine learning repository. I have preprocessed the data for this course and turned it into a simple multi-class classification problem.

In [32]:
import pandas as pd

shuttle = pd.read_csv("data/shuttle_preprocessed.csv")
shuttle.head()

Unnamed: 0,time,attribute_1,attribute_2,attribute_3,attribute_4,attribute_5,attribute_6,attribute_7,class
0,50,21,77,0,28,0,27,48,2
1,53,0,82,0,52,-5,29,30,1
2,37,0,76,0,28,18,40,48,4
3,37,0,79,0,34,-26,43,46,1
4,85,0,88,-4,6,1,3,83,2


The dataset has 8 features and one target column, which is named class. 

In [33]:
shuttle.shape

(49097, 9)

As LOF uses distance metrics, we are required to normalize the dataset before fitting LOF. Therefore, in lines 1-2, we are importing QuantileTransformer after the LocalOutlierFactor estimator from PyOD. 

Next, we define a `transform_detect` function with 5 parameters - X and y for feature and target arrays, k for the number of neighbors and two parameters for contamination and distance calculation.

Inside `transform_detect`, we initialize QuantileTransformer and use it to normalize the X array. Then, we fit LOF with given hyperapameters and predict outlier labels.

Finally, we return the X and y arrays, dropping the found outliers.

In [45]:
from pyod.models.lof import LocalOutlierFactor
from sklearn.preprocessing import QuantileTransformer


def transform_detect(X, y, k=20, contamination=0.1, metric="manhattan"):
    qt = QuantileTransformer(output_distribution="normal")
    X.loc[:, :] = qt.fit_transform(X)

    lof = LocalOutlierFactor(n_neighbors=k, contamination=contamination, metric=metric)
    labels = lof.fit_predict(X)

    return X[labels == 1], y[labels == 1]

Let's test the function on the Shuttle dataset. To do so, we first extract the feature and target arrays using Pandas and pass them to transform_detect function with default parameters:

In [113]:
X = shuttle.drop("class", axis=1)
y = shuttle[["class"]]

X_transformed, y_transformed = transform_detect(X, y)

By checking the shape of the transformed X array, we learn that LOF dropped about 5000 rows which it found as outliers.

In [114]:
X_transformed.shape

(44187, 8)

Now, we create another function for evaluating a simple classifier on our outlier-free dataset. In the body of the `evaluate_classifier` function, we first partition the data into training and validation sets. 

Next, we fit the classifier to the training set and generate predictions. In the final line, we return a balanced accuracy score, which takes class imbalance into account.

In [51]:
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import train_test_split


def evaluate_classifier(clf, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y)

    clf.fit(X_train, y_train)
    preds = clf.predict(X_val)

    return balanced_accuracy_score(y_val, preds)

Let's put everything together.

In [115]:
X = shuttle.drop("class", axis=1)
y = shuttle[["class"]]

X_transformed, y_transformed = transform_detect(X, y)

After we generate the outlier-free dataset with `transform_detect` function once again, we pass it to `evaluate_classifier` with Random Forest as our model:

In [116]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
b_accuracy = evaluate_classifier(clf, X_transformed, y_transformed)

In [117]:
b_accuracy

0.9780409356032493

The final accuracy is close-to-perfect 98%!

### Simple rules for choosing n_neighbors

Like in KNN, the most important parameters of LOF are contamination and n_neighbors. As we can only tune contamination by trial and error, we will focus on tuning n_neighbors.

For n_neighbors, Sklearn documentation suggests choosing 20 if the contamination level is below 10%. Anything higher than that, you increase n_neighbors accordingly.

Now, it is your turn to fit an LOF estimator and tune its hyperparameters on another dataset!

### Exercises

Let's see how to do that by trying out 6 different values for n_neighbors in a loop. We first create a list of ks and an empty dictionary to store balanced accuracy score for each k.

In [71]:
k_list = [25, 30, 40, 50, 80, 100]
scores = dict()

for k in k_list:
    X_transformed, y_transformed = transform_detect(X, y, k=k)

    clf = RandomForestClassifier()
    b_accuracy = evaluate_classifier(clf, X_transformed, y_transformed)

    scores[f"k={k}"] = b_accuracy

After the for loop finishes, we have a `scores` dictionary that contains n_neighbors and accuracy score key-value pairs. We can see that setting n_neighbors to 100 gives the best score, even though the improvement on the old score is not significant.

In [72]:
scores

{'k=25': 0.9785592822823691,
 'k=30': 0.9789278206214174,
 'k=40': 0.9772928732958232,
 'k=50': 0.9789704082619046,
 'k=80': 0.9794938801789923,
 'k=100': 0.9800616132195346}