[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io/)

# Machine Learning Methods

## Exercise 008 - Anomaly Detection

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 05/03/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/Exercise0008AnomalyDetection.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from lightgbm import LGBMClassifier

from sklearn.ensemble import IsolationForest
from sklearn.feature_selection import RFE
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.neighbors import LocalOutlierFactor
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_sample_weight

# Miscellaneous
import itertools
import os
from platform import python_version
import random
import re

# Typing
from typing import Callable, List, Tuple

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

DATA_FILE_URL = r'https://raw.githubusercontent.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/master/creditcard.csv'

In [None]:
# Fixel Algorithms Packages


## Exercise

In this exercise we'll use _Dimensionality Reduction_ and _UnSupervised Anomaly Detection_ for _Supervised Anomaly Detection_ based on a classifier:

 - The _dimensionality reduction_ method used is based on feature selection and not feature mixing.
 - The _unsupervised anomaly detection_ is based on _Isolation Forest_ or _Local Outlier Factor_.

The idea is as following:

 - Use _dimensionality reduction_ to reduce the number of features.
 - Use the the score of the anomaly detector as a feature.

In this exercise:

1. We'll process real world credit card fraud data.
2. We'll use a classifier to identify the fraud transactions.
3. We'll use SciKit Learn's `RFE` for feature selection.
4. We'll build an unsupervised anomaly detection model to use its decision function score as a feature.
5. We'll optimize some of the hyper parameters using a labels vector predicted by cross validation.

* <font color='brown'>(**#**)</font> Some stages requires the full implementation of the stage (Not only a single line completion).
* <font color='brown'>(**#**)</font> The idea of this exercise is to show that there are clever way to generate features.  
Yet it still might be that the process suggested here won't improve results (Such in a case the base model used all information the anomaly detector used).  
Still, the idea of using models (Usually unsupervised) to generate features is a skill to master.

**Objective**: Get `F1` of above 80% for the predict classification using cross validation.

In [None]:
# Parameters

# Data
numSamples = 6_000
lClass      = ['Legit', 'Fraud']

# Cross Validation
numFolds    = 5


In [None]:
# Auxiliary Functions

def PlotFeatureHistogram(dfData: pd.DataFrame, featColName: str, classColName: str, hA: plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    
    sns.histplot(data = dfData, x = featColName, hue = classColName, stat = 'density', common_norm = False, multiple = 'dodge', ax = hA)
    sns.kdeplot(data = dfData, x = featColName, hue = classColName, common_norm = False, ax = hA)

    return hA

def PlotLabelsHistogram(vY: np.ndarray, hA = None, lClass = None, xLabelRot: int = None) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = (8, 6))
    
    vLabels, vCounts = np.unique(vY, return_counts = True)

    hA.bar(vLabels, vCounts, width = 0.9, align = 'center')
    hA.set_title('Histogram of Classes / Labels')
    hA.set_xlabel('Class')
    hA.set_ylabel('Number of Samples')
    hA.set_xticks(vLabels)
    if lClass is not None:
        hA.set_xticklabels(lClass)
    
    if xLabelRot is not None:
        for xLabel in hA.get_xticklabels():
            xLabel.set_rotation(xLabelRot)

    return hA

def CrossValPredWeighted( modelEst, mX: np.ndarray, vY: np.ndarray, vW: np.ndarray = None, numFolds: int = 5, stratifyMode: bool = True, seedNum: int = None ) -> np.ndarray:
    """
    modelEst - A model with `fit()` and `predict()` methods.
    mX - A NumPy array of the data.
    vY - A NumPy array of the labels.
    vW - A NumPy array of the per sample weight.
    numFolds - An integer of the number of folds.
    stratifyMode - A boolean, if `True` use stratified split, if False use regular random split.
    seedNum - An integer to set the seed of the splitters.
    """

    numSamples  = mX.shape[0]
    vYPred      = np.zeros_like(vY)

    #===========================Fill This===========================#
    # 1. Construct the K-Fold split object using `StratifiedKFold` or `KFold` according to `stratifyMode`.
    # 2. Iterate over the splits, per split, fit the model and predict the labels on the rest of the data.
    # !! Set `shuffle = True` for the splitters.
    ????
    
    #==============================================================#
    
    return vYPred



## Generate / Load Data

In this notebook we'll use the [`creditcard`](https://www.openml.org/search?type=data&id=1597) data set.

The datasets contains transactions made by credit cards in September 2013 by european cardholders.  
This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.  
The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a **PCA transformation** in order to preserve confidentiality.

* <font color='brown'>(**#**)</font> The features: `V1`, `V2`, ..., `V28` the PCA transformed data.
* <font color='brown'>(**#**)</font> The `Class` column is the labeling where `Class = 1` means a fraud transaction.


The **tasks** in this section:

1. Normalize the data such each feature has zero mean and unit variance.
2. Create a sample weight vector `vW`. The `ii` -th element of the vector is the training weight of the sample.   
   The weight should be according to the `balanced` policy: ${w}_{i} = \frac{ \sum_{ k = 1 }^{ N } \mathbb{I} \left( {y}_{k} \neq {y}_{i} \right ) }{N}$.  
   You may use SciKit Learn's [`compute_sample_weight()`](https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_sample_weight.html).

In [None]:
# Loading / Generating Data

dfData = pd.read_csv(DATA_FILE_URL)


print(f'The features data shape: {dfData.shape}')

### Plot the Data

In [None]:
# Plot the Data

dfData.head()

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(dfData['Class'], lClass = lClass)

In [None]:
# Distribution of Data

featNameDropdown = Dropdown(options = dfData.columns, value = dfData.columns[0], description = 'Feature Name')

hPlotFeatHist = lambda featColName: PlotFeatureHistogram(dfData, featColName, 'Class')
interact(hPlotFeatHist, featColName = featNameDropdown)

plt.show()

* <font color='red'>(**?**)</font> Look at different features above. Which features are good? Why?

### Pre Process Data

In [None]:
# Data

mX = dfData.drop(columns = ['Time', 'Class']).to_numpy()
vY = dfData['Class'].to_numpy()

* <font color='brown'>(**#**)</font> If the time had actual hour or something to understand time of day, we could have used it.

In [None]:
# Sub Sample Data
# Data is large, hence we'll keep a sub sample of it to make things run fast.

# Identify Anomalies
vAnomalyIdx = np.flatnonzero(vY == 1)
numAnomalies = len(vAnomalyIdx)

# Sub Sample Indices
vIdx = np.random.choice(np.flatnonzero(vY != 1), numSamples - numAnomalies)
vIdx = np.concatenate((vIdx, vAnomalyIdx), axis = 0)

mX = mX[vIdx]
vY = vY[vIdx]


In [None]:
# Normalize Data

#===========================Fill This===========================#
# 1. Normalize data to have zero mean and unit standard deviation per feature.
?????
#===============================================================#

In [None]:
# Samples Weights

#===========================Fill This===========================#
# 1. Create a vector `vW` of length `numSamples`.
# 2. Set `vW[ii]` to have a weight which balances the classes.
# !! You may use SciKit Learn `compute_sample_weight()`.

?????
#===============================================================#

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(vY, lClass = lClass)

In [None]:
print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')

## Stage 001

In this stage we'll build a base classifier and reduce the number of features by feature selection.  
We'll also implement the equivalent of `cross_val_predict()` which supports a weighted samples.

The **tasks** in this section:

1. Implement the function `CrossValPredWeighted()` in `Auxiliary Functions` section above.
2. Set a base model (You may choose any method of a supervised classifier which has the attribute `coef_` or `feature_importances_`).
3. Apply a feature selection using SciKit Learn's [`RFE`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) (Recursive Feature Elimination).
4. Optimize, using the cross validation loop of `CrossValPredWeighted()`, the number of feature in `RFE` and at least one hyper parameter of the classifier model of your choice.
5. Generate `mXX` which a sub set of the selected features by the optimization of `RFE`. You may use the `support_` attribute of the `RFE` object.

**Objective**: Get `F1` of above 65% for the predict classification using cross validation.

* <font color='red'>(**?**)</font> What's the maximum number of features to set in `RFE`? What's the minimum?

In [None]:
?????

## Stage 002

In this section we'll generate a feature for the classifier using an _unsupervised anomaly detector_.  
To generate the feature we'll the models: `IsolationForest` and `LocalOutlierFactor`.  
The feature will be generated based on the model's decision function.

The **tasks** in this section:

1. Set the models and the hyper parameters of the models to optimize.
2. Optimize, using the cross validation loop of `CrossValPredWeighted()`, the hyper parameters of the models:
  - Set the Anomaly Detection model per hyper parameter combination.
  - Fit it to data (`mXX`).
  - Generate the score per sample using `decision_function()`.
  - Concatenate the score to the features (`mXX`).
  - Optimize the classifier on the enriched features using `CrossValPredWeighted()`.

* <font color='brown'>(**#**)</font> You must use `novelty = True` in the `LocalOutlierFactor` model in order to have the `decision_function()` method available.

**Objective**: Improve the previous step `F1` score and get at least 80%.

In [None]:
?????

* <font color='brown'>(**#**)</font> While in this notebook we optimized each task on its own, in production we'll optimize all at once.