[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## UnSupervised Learning - Anomaly Detection - Local Outlier Factor (LOF) - Exercise Solution

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 27/02/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0045AnomalyDetectionLocalOutlierFactorExerciseSolution.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.neighbors import LocalOutlierFactor

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

DATA_FILE_URL = r'https://github.com/FixelAlgorithmsTeam/FixelCourses/raw/master/DataSets/NewYorkTaxiDrives.csv'

In [None]:
# Fixel Algorithms Packages


## Anomaly Detection by Local Outlier Factor (LOF)

In this exercise we'll use the LOF algorithm to identify outlier in a time series data.  
The data we'll use is the number of taxi drives in New York City at 01/07/2014-01/02/2015 (Over 6 months).

In this notebook:

 - We'll build a time series features.
 - Fit the LOF model to data.
 - Visualize outliers.

In [None]:
# Parameters

# Feature Generation
lWinLength      = [12, 24, 48, 12, 24, 48, 24, 48]
lWinOperators   = ['Mean', 'Mean', 'Mean', 'Mean', 'Standard Deviation', 'Standard Deviation', 'Standard Deviation', 'Median', 'Median']

# Model
#===========================Fill This===========================#
# 1. Set the parameters of the LOF Model.
# !! Tweak this after looking at the data.
numNeighbors        = 30
contaminationRatio  = 0.05
#===============================================================#

# Anomaly
#===========================Fill This===========================#
# 1. Set the threshold for the LOF score.
# !! Tweak this after looking at the data.
# !! Use the guidelines as studied.
lofScoreThr = 1.5
#===============================================================#


In [None]:
# Auxiliary Functions

def PlotScatterData(mX: np.ndarray, vL: np.ndarray, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, markerSize: int = MARKER_SIZE_DEF, lineWidth: int = LINE_WIDTH_DEF, axisTitle: str = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    vU = np.unique(vL)
    numClusters = len(vU)

    for ii in range(numClusters):
        vIdx = vL == vU[ii]
        hA.scatter(mX[vIdx, 0], mX[vIdx, 1], s = ELM_SIZE_DEF, edgecolor = EDGE_COLOR, label = ii)
    
    hA.set_xlabel('${{x}}_{{1}}$')
    hA.set_ylabel('${{x}}_{{2}}$')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.grid()
    hA.legend()

    return hA


## Generate / Load Data

The data set is composed of a timestamp (Resolution on 30 minutes) and the number of drives.

In [None]:
## Generate / Load Data

dfData = pd.read_csv(DATA_FILE_URL)


print(f'The features data shape: {dfData.shape}')

In [None]:
# Display the Data Frame

dfData.head(10)

### Pre Process

Convert the string into a Date Time format of Pandas.

In [None]:
# Convert the `Time Stamp` column into valid Pandas time stamp

#===========================Fill This===========================#
# 1. Use Pandas' `to_datetime()` to convert the `Time Stamp` column.
dfData['Time Stamp'] = pd.to_datetime(dfData['Time Stamp'])
#===============================================================#

### Plot the Data 

In [None]:
# Plot the Data Using PlotLy
# This will create an interactive plot of the data (You may zoom in and out).
hF = px.line(data_frame = dfData, x = 'Time Stamp', y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()

* <font color='red'>(**?**)</font> Do you see some patterns in data?
* <font color='red'>(**?**)</font> Can you spot some outliers? Why?

## Feature Engineering

Time series features engineering is an art.  
Yet the basic features are the work on windows to extract statistical features: Mean, Standard Deviation, Median, etc...  

The `Pandas` package has simple way to generate windows using the [`rolling()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) method.

In [None]:
# Resample Data for Hour Resolution
dfData = dfData.set_index('Time Stamp', drop = True, inplace = False)

# Resample per hour by summing
dfData = dfData.resample('H', axis = 0).sum()

In [None]:
# Display Result

dfData.head(10)

In [None]:
# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()

In [None]:
# Rolling Window Operator

def ApplyRollingWindow( dsI: pd.Series, winLength: int, winOperator: str ) -> pd.Series:
    # dsI - Input data series.
    # winLength - The window length to calculate the feature.
    # winOperator - The operation to apply on the window.

#===========================Fill This===========================#
# 1. Apply window functions by the string in `winOperator`: 'Standard Deviation', 'Median', 'Mean'.
# 2. Look at `rolling()`, `std()`, `median()` and `mean()`.
# 3. The pattern should be chaining the operation to the rolling operation: `dsI.rolling(winLength).std()`.
    if winOperator == 'Standard Deviation':
        dsO = dsI.rolling(winLength).std()
    elif winOperator == 'Median':
        dsO = dsI.rolling(winLength).median()
    else:
        dsO = dsI.rolling(winLength).mean()
#===============================================================#
    
    return dsO


* <font color='green'>(**@**)</font> You may add more statistical features.
* <font color='red'>(**?**)</font> Are those features applicable for this method?

In [None]:
# Apply the Feature Extraction / Generation

lColNames = ['Drives']
for winLen, opName in zip(lWinLength, lWinOperators):
    colName = opName + f'{winLen:03d}'
    lColNames.append(colName)
    dfData[colName] = ApplyRollingWindow(dfData['Drives'], winLen, opName)

* <font color='green'>(**@**)</font> You may tweak the selection of window length and operation.

In [None]:
# Display Results on the Data Frame

dfData.head(20)

* <font color='red'>(**?**)</font> Why are there `NaN` values?

In [None]:
# Plot the Data Using PlotLy
hF = px.line(data_frame = dfData, x = dfData.index, y = lColNames, title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')
hF.show()

* <font color='green'>(**@**)</font> Replace the features with local features such as:
  - Ratio between the value to the mean value (Scaled by STD).
  - Ratio between the value to the median value (Scaled by Median deviation).

### Handle Missing Values

Our model can not handle missing values.  
Hence we must impute or remove them.

In [None]:
# Set the NaN Values to the first not NaN value in the column

#===========================Fill This===========================#
# 1. Loop over each column of the data frame.
# 2. Find the first valid index in each column (Use `first_valid_index()`).
# 3. Fill the NaN's up to the first valid value with the valid value.
for colName in lColNames:
    dsT = dfData[colName]
    firstValIdx = dsT.first_valid_index()
    dfData.loc[:firstValIdx, colName] = dfData.loc[firstValIdx, colName]
#===============================================================#

In [None]:
# Display the Results
# Should be no NaN's.

dfData

## The LOF Model

In [None]:
# Build the Model

#===========================Fill This===========================#
# 1. Construct the model.
# 2. Use `fit_predict()` on the data.
# 3. Extract the LOF Score.
# !! Mind the default LOF score sign.
oLofOutDet = LocalOutlierFactor(n_neighbors = numNeighbors, contamination = contaminationRatio)
vL         = oLofOutDet.fit_predict(dfData)
vLofScore  = -oLofOutDet.negative_outlier_factor_
#===============================================================#

In [None]:
# Plot the Data Using PlotLy
hF = px.histogram(x = vLofScore, title = 'LOF', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400)

hF.show()

* <font color='red'>(**?**)</font> What threshold would you set?

In [None]:
# Set the LOF Score
dfData['LOF Score'] = vLofScore

In [None]:
# Set Anomaly

dfData['Anomaly'] = 0

dfData.loc[dfData['LOF Score'] > lofScoreThr,'Anomaly'] = 1

In [None]:
# Plot Anomalies 
hF = px.line(data_frame = dfData, x = dfData.index, y = ['Drives'], title = 'NYC Taxi Drives', template = 'plotly_dark')
hF.update_layout(autosize = False, width = 1200, height = 400, legend_title_text = 'Legend')

hF.add_scatter(x = dfData[dfData['Anomaly'] == 1].index, y = dfData.loc[dfData['Anomaly'] == 1, 'Drives'], name = 'Anomaly', mode = 'markers')

hF.show()