# Credit Card Fraud Detection using Anomaly Detection | Part 2b (Tuning)

**Problem**: Predict whether a credit card transaction is fraudulent or not based on its details. Extract the patterns that hint towards fraud by modeling the past transactions such that all frauds are detected and false positives are minimised.

**Evaluation**: Recall, PR-AUC, f1, Precision @t will be used for fine-tuning and evaluation using available labels

**Potential Solution Framework**: Since we have enough labeled data, we are using the fully-supervised anomaly detection setting (learning data structure from labels) using below two approaches. Note: One this common though, we would be trying to learn the underlying "normal" distribution & draw threshold boundary to weed out anomalies

**-----b. "End-to-end fully-supervised"-----**
    - pass whatever (normal or anomolous or both) that available training-fold labeled data is
    - then use those models to check and validate on whatever that remaining available valid-fold data is
    - Pros/Cons:
      i. We could use gridsearchcv directly
      ii. OCSVM would not perform well on such mixed data**

**Existing intuitions on algorithms (based on performance on 2D datasets)** -
Source (https://scikit-learn.org/stable/auto_examples/plot_anomaly_comparison.html#sphx-glr-auto-examples-plot-anomaly-comparison-py):
- IF and LOF are good when we have multimodal data. LOF is better when modes have different desities (local aspect of LOF)
- OCSVM is sensitive to outliers and doesn't generally perform well for OD (but good for ND when training data is uncontaminated), but depending on values of hyperparamters it could still give useful results
- EllipticEnvelope assumes Gaussian distribution and thus learns ellipse. Not good for multimodal data but robust to outliers

**Useful material and references**
- https://escholarship.org/uc/item/1f03f6hb#main
- https://www.hindawi.com/journals/complexity/2019/2686378/
- https://imada.sdu.dk/~zimek/InvitedTalks/TUVienna-2016-05-18-outlier-evaluation.pdf
- https://www.gta.ufrj.br/~alvarenga/files/CPE826/Ahmed2016-Survey.pdf
 

## 1. Importing Libraries

In [56]:
import numpy as np
import pandas as pd; pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt; 
%matplotlib inline
import seaborn as sns
import warnings; warnings.filterwarnings("ignore")
import pickle

#Importing data processing and prep libraries
from sklearn.preprocessing import StandardScaler, RobustScaler   #RobustScaler robust to outliers
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV  #for hyperparameter tuning

#Importing machine learning algo libraries
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM

#Importing evaluation focussed libraries
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve
from sklearn.metrics import f1_score, recall_score, average_precision_score

#Other useful libraries
#!pip install missingno   
import missingno as missviz   #Custom library for missing value inspections
from sklearn.manifold import TSNE   #For visualising high dimensional data

## 2. Getting relevant data

In [21]:
#Importing
with open("ADS_b.pkl", "rb") as f:
    X_train, X_test, y_train, y_test, X_cols, y_cols = pickle.load(f)
    
#Basic details
print("Details of training data - shape of predictor matrix is {I} and # ones in target series is {J}".format(I=X_train.shape, J=sum(y_train)))
print("Details of testing data - shape of predictor matrix is {I} and # ones in target series is {J}".format(I=X_test.shape, J=sum(y_test)))

Details of training data - shape of predictor matrix is (227845, 30) and # ones in target series is 389
Details of testing data - shape of predictor matrix is (56962, 30) and # ones in target series is 103


## 3. Hyperparameter tuning and model selection

**3.1 Splitting data into "small" and "big" parts**
- Note: We'll train and tune using the whole data in Google colab with GPUs

In [79]:
#Splitting train and test into "small" and "big" parts to expedite training and tuning
seed=123
small_frac=0.05
X_train_big, X_train_small, y_train_big, y_train_small = train_test_split(X_train, y_train, test_size=small_frac, stratify=np.array(y_train), random_state=seed)
X_test_big, X_test_small, y_test_big, y_test_small = train_test_split(X_test, y_test, test_size=small_frac, stratify=np.array(y_test), random_state=seed)

**3.2 Creating dictionary of "intialized" models and their hyperparameters**

In [109]:
models={
    "IF": IsolationForest(random_state=seed),
    "LOF": LocalOutlierFactor(novelty=True),
    "OCSVM": OneClassSVM(random_state=seed)}

model_hp={
    "IF": {"contamination": [0.0001, 0.001, 0.0025, 0.005, 0.01], "max_samples": list(range(10,300,60)), "n_estimators": [10,50,100,200,500]},
    "LOF": {"contamination": [0.0001, 0.001, 0.0025, 0.005, 0.01], "n_neighbors": [5,10,20,50,100]},
    "OCSVM": {"nu": [0.0001, 0.001, 0.0025, 0.005, 0.01], "kernel": ["linear", "rbf", "poly"], "gamma": np.power(10.0, range(-3,2))}}

def my_f1_score(model, X, y):
    y_pred=model.predict(X)
    return f1_score(y, np.where(y_pred==1,0,1))

**3.3 Training and tuning models**

In [190]:
results=[]

for model_name in models.keys():
    print("-----Running GridsearchCV for {M}-----".format(M=model_name))
    model=models[model_name]
    hp=model_hp[model_name]
    GSCV=GridSearchCV(model, hp, cv=3, scoring=my_f1_score, n_jobs=-1, verbose=10)
    %time GSCV.fit(X_train_small, y_train_small)
    results.append([model_name, GSCV.best_params_, GSCV.best_score_, pd.DataFrame(GSCV.cv_results_)])
    
results_df=pd.DataFrame(results, columns=["model_name", "best_hp", "best_score", "CV_results"])

-----Running GridsearchCV for IF-----
Fitting 3 folds for each of 125 candidates, totalling 375 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.6s
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:   21.3s
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:   27.2s
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:   31.7s
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:   37.3s
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:   44.0s
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:   50.2s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   59.2s
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1

CPU times: user 4.03 s, sys: 1.6 s, total: 5.62 s
Wall time: 2min 38s
-----Running GridsearchCV for LOF-----
Fitting 3 folds for each of 25 candidates, totalling 75 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   36.5s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   47.1s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done  68 out of  75 | elapsed:  2.5min remaining:   15.5s
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:  2.7min finished


CPU times: user 7.38 s, sys: 255 ms, total: 7.63 s
Wall time: 2min 45s
-----Running GridsearchCV for OCSVM-----
Fitting 3 folds for each of 75 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:    2.7s
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    4.4s
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:   11.5s
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:   13.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   16.4s
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:   34.6s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   

CPU times: user 2.15 s, sys: 302 ms, total: 2.45 s
Wall time: 1min 16s


[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:  1.3min finished


In [189]:
results_df["CV_results"].apply(lambda x: x["mean_test_score"].max())

0    0.396827
1    0.599412
2    0.245169
Name: CV_results, dtype: float64