__NAME:__ __FULLNAME__  
__SECTION:__ __NUMBER__  
__CS 5703: Machine Learning Practice__

# Homework 8: Support Vector Machines

## Assignment Overview
Follow the TODOs and read through and understand any provided code.  
Post any questions regarding the assignment, to Slack.
For all plots, make sure all necessary axes and curves are clearly and 
accurately labeled. Include figure/plot titles appropriately as well.

### Task
For this assignment you will be exploring support vector machines (SVMs)
using GridsearchCV and working with highly a unbalanced datasets.


### [Data set](https://www.kaggle.com/kerneler/starter-credit-card-fraud-detection-e6d0de2d-9)
European Cardholder Credit Card Transactions, September 2013  
This dataset presents transactions that occurred over two days. There were 377 incidents of 
fraud out of 191,828 transactions. The dataset is highly unbalanced, the positive class 
(frauds) accounts for 0.197% of all transactions.

__Features__  
* V1, V2, ... V28: are principal components obtained with PCA from a large feature vector
* Time: the seconds elapsed between each transaction and the first transaction  
* Amount: is the transaction Amount  
* Class: the predicted variable; 1 in case of fraud and 0 otherwise.  

Given the class imbalance, it is recommended to use precision, recall and the 
Area Under the Precision-Recall Curve (AUPRC) to evaluate skill. Traditional accuracy 
and AUC are not meaningful for highly unbalanced classification, as these scores are 
misleading due to the high impact of the large number of negative cases that can easily
be identified. 

Examining precision and recall is more informative as these disregard the number of 
correctly identified negative cases (i.e. TN) and focus on the number of correctly 
identified positive cases (TP) and mis-identified negative cases (FP). Another useful 
metric is the F1 score which is the harmonic mean of the precision and recall; 1 is the 
best F1 score.

Confusion Matrix  
[TN  FP]  
[FN  TP]

Accuracy = $\frac{TN + TP}{TN + TP + FN + FP}$  
TPR = $\frac{TP}{TP + FN}$  
FPR = $\frac{FP}{FP + TN}$  

Recall = TPR = $\frac{TP}{TP + FN}$  
Precision = $\frac{TP}{TP + FP}$  
F1 Score = 2 * $\frac{precision \; \times \; recall}{precision \; + \; recall}$  

See the references below for more details on precision, recall, and the F1 score.


The dataset was collected and analysed during a research collaboration of 
Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université 
Libre de Bruxelles) on big data mining and fraud detection [1]

[1] Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi.
Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium
on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.
http://mlg.ulb.ac.be/BruFence . http://mlg.ulb.ac.be/ARTML


### Objectives
* Understanding Support Vector Machines
* GridSearch with Classification
* Creating Scoring functions
* Stratification

### Notes
* Save your work in your own Google Drive or on your own computer
* Note that there are three supporting python files that must be placed in the same folder as your notebook

### General References
* [Guide to Jupyter](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Numpy Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [DataCamp: Matplotlib](https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python?utm_source=adwords_ppc&utm_campaignid=1565261270&utm_adgroupid=67750485268&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=1t1&utm_creative=332661264365&utm_targetid=aud-299261629574:dsa-473406587955&utm_loc_interest_ms=&utm_loc_physical_ms=9026223&gclid=CjwKCAjw_uDsBRAMEiwAaFiHa8xhgCsO9wVcuZPGjAyVGTitb_-fxYtkBLkQ4E_GjSCZFVCqYCGkphoCjucQAvD_BwE)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Linear Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)
* [Scoring Parameter](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)
* [Scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring)
* [Plot ROC](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html)
* [Precision, Recall, F1 Score](https://en.wikipedia.org/wiki/Precision_and_recall)
* [Precision-Recall Curve](https://acutecaretesting.org/en/articles/precision-recall-curves-what-are-they-and-how-are-they-used)
* [Probability Plot](https://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm)

### Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook (from Jupyter or Colab):
  + Submit this file (.ipynb) to the Gradescope Notebook HW8 dropbox
* Note: there is no need to submit a PDF file or to submit directly to Canvas

In [None]:
%reload_ext autoreload
%autoreload 2
#%matplotlib inline

import pandas as pd
import numpy as np
#import seaborn
import scipy.stats as stats
import os, re, fnmatch
import pathlib, itertools
import time as timelib
import matplotlib.pyplot as plt

from math import floor, ceil
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import learning_curve, StratifiedKFold
from sklearn.metrics import make_scorer, precision_recall_curve
from sklearn.metrics import confusion_matrix, precision_score
from sklearn.metrics import roc_curve, auc, f1_score, recall_score
from sklearn.svm import SVC
import joblib
import pdb
#pdb.set_trace()

#HOME_DIR = pathlib.Path.home()
#CW_DIR = pathlib.Path.cwd()

# Default figure parameters
plt.rcParams['figure.figsize'] = (6,5)
plt.rcParams['font.size'] = 10
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = True
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12


In [None]:
# COLAB ONLY
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# TODO: COLAB ONLY
# THIS IMPORTS 3 CUSTOM .py FILES 
# You must seperately download these files and store them in the 
# Colab Notebooks folder
# If you are running this on a local machine, then do not execute
# this cell (execute the one below)

# this is a little weird colab doesn't play _super_ nice with local 
# python files
# note that this is not programming best practice
exec(open(
    '/content/drive/My Drive/Colab Notebooks/visualize.py', 'r'
).read())
exec(open(
    '/content/drive/My Drive/Colab Notebooks/metrics_plots.py', 'r'
).read())
exec(open(
    '/content/drive/My Drive/Colab Notebooks/pipeline_components.py', 'r'
).read())

In [None]:
# for local runtimes only: DO NOT EXECUTE IN COLAB
from visualize import *
from metrics_plots import *
from pipeline_components import DataSampleDropper, DataFrameSelector

# LOAD DATA

In [None]:
# 'None' to read whole file
nRowsRead = None 

# TODO: set appropriately
filename = '/content/drive/MyDrive/MLP_2022/datasets/creditcard.csv'
#filename = 'creditcard.csv'

# Read the CSV file and extract the table
crime_stats_full = pd.read_csv(filename, delimiter=',', nrows=nRowsRead)
crime_stats_full.dataframeName = 'creditcard.csv'
nRows, nCols = crime_stats_full.shape
print(f'There are {nRows} rows and {nCols} columns')

In [None]:
""" PROVIDED
good (negative case = 0)
fraud (positive case = 1)
"""
targetnames = ['good', 'fraud']

neg_full = crime_stats_full.loc[crime_stats_full['Class'] == 0] 
pos_full = crime_stats_full.loc[crime_stats_full['Class'] == 1] 

pos_full.shape, neg_full.shape

In [None]:
""" PROVIDED
Compute the postive and negative fractions
"""
pos_fraction = pos_full.shape[0] / nRows
neg_fraction = 1 - pos_fraction

pos_fraction, neg_fraction

In [None]:
""" PROVIDED
Select Random Subset of data
"""
np.random.seed(1138)
subset_size = 50000
selected_indices = np.random.choice(range(nRows), size=subset_size, replace=False)
selected_indices

In [None]:
""" PROVIDED
List the features and shape of the data
"""
crime_stats = crime_stats_full.loc[selected_indices,:]
crime_stats.columns, crime_stats.shape

In [None]:
""" TODO
Display summary statistics for each feature of the dataframe
"""


In [None]:
""" PROVIDED
Display whether there are any NaNs
"""
crime_stats.isna().any()

# VISUALIZE DATA

In [None]:
""" TODO
Display the distributions of the data
use visualize.featureplots(...)
to generate trace plots, histograms, boxplots, and probability plots for
each feature.

A probability plot is used to evaulate the normality of a distribution.
The data are plotted against a theoritical distribution, such that if the data 
are normal, they'll follow the diagonal line. See the reference above for 
more information.
"""

crime_stats_clean = crime_stats.dropna()

# TODO: visualize the features


In [None]:
""" PROVIDED
Display the Pearson correlation between all pairs of the features
"""
scatter_corrplots(crime_stats_clean.values, crime_stats_clean.columns, corrfmt="%.1f", FIGW=15)

## TODO Reflection #1

1. Which features correlate the most with the Amount feature?

**TODO**

2. Which features correlate the most with the class label?

**TODO**


In [None]:
""" PROVIDED
Separate the postive and negative examples
"""
neg = crime_stats.loc[crime_stats['Class'] == 0] 
pos = crime_stats.loc[crime_stats['Class'] == 1] 

pos.shape, neg.shape

In [None]:
""" PROVIDED
Compute the postive and negative fractions
"""
pos_fraction = pos.shape[0] / (pos.shape[0] + neg.shape[0])
neg_fraction = 1 - pos_fraction

pos_fraction, neg_fraction

In [None]:
""" PROVIDED
Compare the features for the positive and negative examples
"""
features_displayed = pos.columns
ndisplayed = len(features_displayed)
ncols = 5
nrows = ceil(ndisplayed/ncols)

fig, axs = plt.subplots(nrows, ncols, figsize=(15, 15))
axs = axs.ravel()

for ax, feat_name in zip(axs, features_displayed):
    bp = np.array([neg[feat_name], pos[feat_name]], copy=False, dtype=object)
    boxplot = ax.boxplot(bp, patch_artist=True, sym='.')
    boxplot['boxes'][0].set_facecolor('pink')
    boxplot['boxes'][1].set_facecolor('lightblue')
    ax.set_xticklabels(['-', '+'])
    ax.set(title=feat_name)


## TODO Reflection #2

1. Which features show a different mean value across the positive and negative classes?

**TODO**


# PRE-PROCESS DATA

## Data Clean Up and Feature Selection

In [None]:
""" PROVIDED
Construct Pipeline to pre-process data
"""
feature_names = crime_stats.columns.drop(['Class'])
pipe_X = Pipeline([
    ("NaNrowDropper", DataSampleDropper()),
    ("selectAttribs", DataFrameSelector(feature_names)),
    ("scaler", RobustScaler())
])

pipe_y = Pipeline([
    ("NaNrowDropper", DataSampleDropper()),
    ("selectAttribs", DataFrameSelector(['Class']))
])

In [None]:
""" TODO
Pre-process the data using the pipeliine

NOTE: generally, we should only fit these pipelines to the training/validation data and NOT
the test data.  However, we will take this shortcut here.
"""
X = #TODO
y = #TODO
print(X.shape)
print(y.shape)
np.any(np.isnan(X))

In [None]:
""" TODO
Re-visualize the pre-processed data
use visualize.featureplots()
"""


# SVMs: EXPLORATION

In [None]:
""" TODO
Hold out a subset of the data, before training and cross validation
using train_test_split, with stratify NOT equal to None, and a test_size 
fraction of .2.

For this exploratory section, the held out set of data is a validation set.
For the GridSearch section, the held out set of data is a test set.  Again, this is for
convenience here.  But, generally, a test set should always be treated as a test set.

Note that train_test_split() from scikit-learn does not use the data set names properly

"""
Xtrain, Xval, ytrain, yval = #TODO

In [None]:
yval.shape

In [None]:
""" TODO
Create and train SVC models. 
Explore various configurations of the hyper-parameters. 
Train the models on the training set and evaluate them for the training and
validation sets.

Try different choices for C, gamma, and class_weight. Feel free to play with other hyper-
parameters as well. See the API for more details.
C is a regularization parameter, gamma is the inverse of the radius of influence
of the support vectors (i.e. lower gamma means a higher radius of influence of the 
support vectors), and class weight determines whether to adjust the weights inversely
to the class fractions.
"""
model = #TODO
model.fit(Xtrain, ytrain)

### Train Set Performance

In [None]:
""" TODO
Evaluate training set performance. 
Display the confusion matrix, KS plot with
the cumulative distributions of the TPR and FPR, the ROC curve and the 
precision-recall curve (PRC). 

The PRC, unlike the AUC, does not consider the true negative (i.e. TN) counts,
making the PRC more sensitive to unbalanced datasets.
"""
# TODO: Compute the predictions for the training set
preds = #TODO

# TODO: Compute the confusion matrix
confusion_mtx = #TODO

# TODO: Plot the confusion matrix in graphical form (see metrics_plots)
#TODO

# TODO: Use the model's predict_proba function to compute the probabilities
#  We will use only the fraud case probabilities for our analysis.  Select these
probas = #TODO

# TODO: display the KS plot, ROC, and PRC (see metrics_plots)
roc_prc_results = #TODO

# Compute performance scores
pss_train = skillScore(ytrain, preds)
f1_train = f1_score(ytrain.ravel(), preds)
print("PSS: %.4f" % pss_train[0])
print("F1 Score %.4f" % f1_train)

### Validation Set Performance

In [None]:
""" TODO
Evaluate validation performance. 
Display the confusion matrix, KS plot with the cumulative distributions of the TPR 
and FPR, the ROC curve and the precision-recall curve (PRC).
"""
# TODO: Confusion matrix

# TODO: Curves


# Report scores
pss_val = skillScore(yval, preds_val)
f1_val = f1_score(yval, preds_val)
print("PSS: %.4f" % pss_val[0])
print("F1 Score %.4f" % f1_val)

## TODO Reflection #3
1. Compare / contrast the training and validation set results

**TODO**

2. Which metric is most sensitive to the overfitting?

**TODO**


# SVMs: STRATIFIED GRID SEARCH

## Scorers

In [None]:
""" PROVIDED
List of available scoring functions from the sklearn module
"""
import sklearn
sorted(sklearn.metrics.SCORERS.keys())

## Execute Grid Search

In [None]:
""" TODO
Estimated time: ~2 hrs on CoLab
Set up and run a grid search using GridSearchCV and the following 
settings:
* SVC for the model,
* The given scoring dictionary for scoring,
* refit set to opt_metric
* Five for the number of cv folds, 
* n_jobs=-1,
* verbose=2, 
* return_train_score=True
"""
# Optimized metric
opt_metric = 'f1'
scoring = {opt_metric:opt_metric}

# Flag to force re-execution of the learning process
force = False

# File name containing results from previous run 

#srchfname = "/content/drive/My Drive/Colab Notebooks/hw8_search_" + opt_metric + ".pkl"
srchfname = "hw8_search_" + opt_metric + ".pkl"


# SETUP EXPERIMENT HYPERPARAMETERS
Cs = np.logspace(-1, 2, num=5, endpoint=True, base=10)
gammas = np.logspace(-5, 0, num=5, endpoint=True, base=5)

# Number of each parameter type
nCs = len(Cs)
ngammas = len(gammas)

# Create th hyperparameter specification
hyperparams = {'C':Cs, 'gamma':gammas, 'tol':[1e-4],
               'class_weight':[None, 'balanced'], 
               'probability':[True]}

# RUN EXPERIMENT
time0 = timelib.time()
search = None
if force or (not os.path.exists(srchfname)):
    # TODO: Create the GridSearchCV object
    search = #TODO
    
    # TODO: Execute the grid search by calling fit using the training data
       
    
    # Save the grid search object
    joblib.dump(search, srchfname)
    print("Saved %s" % srchfname)
else:
    search = joblib.load(srchfname)
    print("Loaded %s" % srchfname)

time1 = timelib.time()
duration = time1 - time0
print("Elapsed Time: %.2f min" % (duration / 60))

search

# RESULTS

In [None]:
""" PROVIDED
Display the head of the results for the grid search
See the cv_results_ attribute
"""
all_results = search.cv_results_
df_res = pd.DataFrame(all_results)
df_res.head()

In [None]:
""" PROVIDED
Plot the mean training and validation results from the grid search as a
colormap, for C (y-axis) vs the gamma (x-axis), for class_weight=None
"""
results_grid_train = df_res['mean_train_'+opt_metric].values.reshape(nCs, 2, ngammas)
results_grid_val = df_res['mean_test_'+opt_metric].values.reshape(nCs, 2, ngammas)

fig, axs = plt.subplots(1, 2, figsize=(6,6))
axs = axs.ravel()
means = [("Training", results_grid_train),
         ("Validation", results_grid_val)]
for i, (name, result) in enumerate(means):
    img = axs[i].imshow(result[:,0,:], cmap="jet", vmin=0, vmax=1)
    axs[i].set_title(name)
    axs[i].set_xticks(range(ngammas))
    axs[i].set_yticks(range(nCs))
    axs[i].set_xticklabels(np.around(gammas, 3))
    axs[i].set_yticklabels(np.around(Cs, 3))
    axs[i].figure.colorbar(img, ax=axs[i], label=opt_metric, 
                           orientation='horizontal')
    if i == 0:
        axs[i].set_ylabel("C")
    axs[i].set_xlabel(r"$\gamma$")
    

In [None]:
""" 
Plot the mean training and validation results from the grid search as a
colormap, for C (y-axis) vs the gamma (x-axis), for class_weight='balanced'
"""
fig, axs = plt.subplots(1, 2, figsize=(6,6))
axs = axs.ravel()
means = [("Training", results_grid_train),
         ("Validation", results_grid_val)]
for i, (name, result) in enumerate(means):
    img = axs[i].imshow(result[:,1,:], cmap="jet", vmin=0, vmax=1)
    axs[i].set_title(name)
    axs[i].set_xticks(range(ngammas))
    axs[i].set_yticks(range(nCs))
    axs[i].set_xticklabels(np.around(gammas, 3))
    axs[i].set_yticklabels(np.around(Cs, 3))
    axs[i].figure.colorbar(img, ax=axs[i], label=opt_metric, 
                           orientation='horizontal')
    if i == 0:
        axs[i].set_ylabel("C")
    axs[i].set_xlabel(r"$\gamma$")


In [None]:
""" TODO
Obtain the best model from the grid search and 
fit it to the full training data
"""
best_model = search.best_estimator_
best_model.fit(Xtrain, ytrain)

### Train Set Performance

In [None]:
""" TODO
For the best model, display the confusion matrix, KS plot, ROC curve, 
and PR curve for the training set
"""
# TODO: Confusion Matrix

# TODO: Curves

# Report results
pss_res = skillScore(ytrain, preds)
f1_res = f1_score(ytrain, preds)
print("PSS: %.4f" % pss_res[0])
print("F1 Score %.4f" % f1_res)

### Validation Set Performance

In [None]:
""" TODO
For the best model, display the confusion matrix, KS plot, ROC curve, 
and PR curve for the validation set
"""
# TODO: Confustion Matrix

# TODO: Curves

# Report results
pss_res_val = skillScore(yval, preds_val)
f1_res_val = f1_score(yval, preds_val)
print("PSS: %.4f" % pss_res_val[0])
print("F1 Score %.4f" % f1_res_val)

## TODO Reflection #4
Discuss and interpret the validation results for the best model. 

**TODO**

