[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## UnSupervised Learning - Anomaly Detection - Local Outlier Factor (LOF)

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 26/02/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0044AnomalyDetectionLocalOutlierFactor.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import make_moons
from sklearn.neighbors import LocalOutlierFactor

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
from matplotlib.colors import LogNorm, Normalize, PowerNorm
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Fixel Algorithms Packages


## Anomaly Detection by Local Outlier Factor (LOF)

In this note book we'll use the PCA approach for dimensionality reduction.

This notebook introduces:

1. Working on synthetic data.
2. Working with the `LocalOutlierFactor` class.
3. Effect of the parameters on the detection.

* <font color='brown'>(**#**)</font> PCA is the most basic dimensionality reduction operator.
* <font color='brown'>(**#**)</font> The PCA output is a linear combination of the input.
* <font color='brown'>(**#**)</font> Conceptually we may think of Dimensionality Reduction as a _soft_ feature selection / mixture.

In [None]:
# Parameters

# Data
numSamples = 500
noiseLevel = 0.1

# Model
numNeighbors        = 30
contaminationRatio  = 0.05


In [None]:
# Auxiliary Functions

def PlotScatterData(mX: np.ndarray, vL: np.ndarray = None, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, markerSize: int = MARKER_SIZE_DEF, edgeColor: int = EDGE_COLOR, axisTitle: str = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    numSamples = mX.shape[0]

    if vL is None:
        vL = np.zeros(numSamples)
    
    vU = np.unique(vL)
    numClusters = len(vU)

    for ii in range(numClusters):
        vIdx = vL == vU[ii]
        hA.scatter(mX[vIdx, 0], mX[vIdx, 1], s = markerSize, edgecolor = edgeColor, label = ii)
    
    hA.set_xlabel('${{x}}_{{1}}$')
    hA.set_ylabel('${{x}}_{{2}}$')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.grid()
    hA.legend()

    return hA



## Generate / Load Data

In this notebook we'll use the [`make_moons()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html) data generator.


In [None]:
# Loading / Generating Data

mX, vY = make_moons(n_samples = numSamples, noise = noiseLevel)

vX1 = np.linspace(-1.00, -0.50, 3)
vX2 = np.linspace(-0.75, -0.25, 3)

mX = np.concatenate((mX, np.column_stack((vX1, vX2))), axis = 0)

vX1 = np.linspace(1.50, 2.50, 3)
vX2 = np.ones(3)

mX = np.concatenate((mX, np.column_stack((vX1, vX2))), axis = 0)


print(f'The features data shape: {mX.shape}')
print(f'The features data type: {mX.dtype}')

### Plot the Data

In [None]:
# Plot the Data

hF, hA = plt.subplots(figsize = (8, 8))
hA = PlotScatterData(mX, markerSize = 50, hA = hA)
hA.set_aspect(1)
hA.set_title('Data')

plt.show()

## Applying Outlier Detection - Local Outlier Factor (LOF)

The LOF algorithm basically learns the density of the distance to local neighbors and when the density is much lower than expected it sets the data as an outlier.

In [None]:
# Applying the Model

oLofOutDet = LocalOutlierFactor(n_neighbors = numNeighbors, contamination = contaminationRatio)
vL         = oLofOutDet.fit_predict(mX)
vLofScore  = -oLofOutDet.negative_outlier_factor_

### Plot the Model Results

We can use the model to show the LOF Score.

In [None]:
# Plot the Model

hF, hA = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 7))

hPathColl = hA[0].scatter(mX[:, 0], mX[:, 1], s = 50, c = vLofScore, norm = PowerNorm(0.5), edgecolors = EDGE_COLOR)
# hA[0].axis('equal')
hA[0].set_ylim((-1, 1.5))
hA[0].set_xlabel('${{x}}_{{1}}$')
hA[0].set_ylabel('${{x}}_{{2}}$')
hA[0].set_title('The LOF Score')

hA[1].scatter(mX[:, 0], mX[:, 1], s = 50, c = vL, edgecolors = EDGE_COLOR)
# hA[1].axis('equal')
hA[1].set_ylim((-1, 1.5))
hA[1].set_xlabel('${{x}}_{{1}}$')
hA[1].set_ylabel('${{x}}_{{2}}$')
hA[1].set_title(f'The LOF Outliers: Threshold = {contaminationRatio:0.2%}')

hF.colorbar(hPathColl, ax = hA[0])

plt.show()

### Analysis of the LOF Score Histogram

In [None]:
hF, hA = plt.subplots(figsize = (14, 7))

sns.histplot(x = vLofScore, ax = hA)
plt.show()

* <font color='red'>(**?**)</font> Will a change in the `contamination` parameter change the histogram above?
* <font color='green'>(**@**)</font> Think of strategy to have an adaptive threshold of outliers based on the histogram.