<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/BitcoinSupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crime detection with Supervised Machine Learning

**Introduction**


The purpose of this Jupyter notebook is to guide you through some essential ingredients when developing a machine learning model: hyperparameter tuning, model comparison and selection. 

The data we will be using was taken from Kaggle: https://www.kaggle.com/ellipticco/elliptic-data-set 
and describes blockchain transactions, some of which are flagged as "illicit" (i.e., relating to illegal activity), others as "licit" or "unknown" (the majority, about 80%). We got rid of the unknown labels for simplicity. The authors give as examples of illicit categories: "scams, malware, terrorist organizations, ransomware, Ponzi schemes, etc."


Note that there are two types of cells in this notebook: Markdown cells (that contain text, like this one), and Code cells (that execute some code, like the next cell). 

By clicking the Play button on a cell, we execute a code cell. Lines that start with a "#" are comments, and not executed. 

Your input is required whenever there is a Question (in that case: write in the Markdown cell) or whenever you find some 'xxxxx' in the code cell (in this case, some code needs to be fixed or completed).


We start by downloading the data we will be training on, which has already been splitted into "X" (features) and "y" (labels).

In [None]:
## Data import from Github
import os
if not os.path.exists('X_train_supervised.csv.zip'): # then probably nothing was downloaded yet
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/ml_utils.py
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_train_supervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_train_supervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_test_supervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_test_supervised.csv.zip

We will be using pandas for data handling, and scikit-learn (sklearn) for supervised machine learning algorithms. 

In [None]:
## Package import: pandas for data handling and manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from ml_utils import slice_gridsearch

Next, we will load the data in a so-called DataFrame (a pandas object), and inspect it by plotting the N-top rows

In [None]:
X_train = pd.read_csv('X_train_supervised.csv.zip')
X_test = pd.read_csv('X_test_supervised.csv.zip')
y_train = pd.read_csv('y_train_supervised.csv.zip')['class']
# .head() returns the first n (per default 5) rows of a DataFrame
X_train.head() 

In [None]:
# Remove unwanted feature txId
X_train = X_train.drop(columns=['txId', 'Time step'])
X_test = X_test.drop(columns=['txId', 'Time step'])

**Further documentation on this dataset:**

From the website: "There are 166 features associated with each node. Due to intellectual property issues, we cannot provide an exact description of all the features in the dataset. There is a time step associated to each node, representing a measure of the time when a transaction was broadcasted to the Bitcoin network. The time steps, running from 1 to 49, are evenly spaced with an interval of about two weeks. Each time step contains a single connected component of transactions that appeared on the blockchain within less than three hours between each other; there are no edges connecting the different time steps.

The first 94 features represent local information about the transaction – including the time step described above, number of inputs/outputs, transaction fee, output volume and aggregated figures such as average BTC received (spent) by the inputs/outputs and average number of incoming (outgoing) transactions associated with the inputs/outputs. The remaining 72 features are aggregated features, obtained using transaction information one-hop backward/forward from the center node - giving the maximum, minimum, standard deviation and correlation coefficients of the neighbour transactions for the same information data (number of inputs/outputs, transaction fee, etc.)."

We only look at the node data (i.e., ignore the network topology), although many of the features are derived from the surrounding nodes and do therefore contain information regarding the network structure. 


In [None]:
print(X_train.shape, '\n')
print(y_train.value_counts(normalize=True))

There are 33.4k data points, of which 11% is a positive (which is quite a large fraction in a financial crime context). 

## Introduction: Decision Tree classifier

In [None]:
# First we import the classes we want to use 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
# Then we instantiate the DecisionTreeClassifier and define the parameter space we want to explore
dtc = DecisionTreeClassifier() #Initialize with whatever parameters you want to

# we will vary the maximum depth of the tree, and the minimum required number of samples to make a split
param_grid = {'max_depth': [2, 5, 10, 20], 'min_samples_split': [2, 10]} #Note the dictionary notation

In [None]:
# We make use of the GridSearchCV estimator that does the parameter-space scanning for us
grid_dtc = GridSearchCV(dtc, param_grid, cv=5, scoring='roc_auc') #NB: uses StratifiedKFold when cv=int

# Finally, we fit the GridSearchCV estimator to our training data, using the .fit() method
_ = grid_dtc.fit(X_train, y_train)

In [None]:
?slice_gridsearch

We use slice_gridsearch (a self-made helper function) to show how the varied parameters influence  classifier performance
Note that the boxplots show:
- The median, 
- a box spanning the first and third quartile, 
- and whiskers that extend to the median +/- 1.5 InterQuartile Range (IQR) or the lowest/highest point. Points beyond the median +/- 1.5 IQR are considered outliers and plotted explicitly



In [None]:
df = slice_gridsearch(grid_dtc, vary_parameter_name='max_depth', fix_parameter_name='min_samples_split',
                     fix_parameter_value=2)

In [None]:
# Inspection of the returned DataFrame shows us some more interesting statistics
df

In [None]:
df = slice_gridsearch(grid_dtc, vary_parameter_name='min_samples_split', fix_parameter_name='max_depth',
                     fix_parameter_value=10)

Even with the simple Decision Tree Classifier, ROC-AUC close to 0.90 are feasible. 


## Naive Bayes classifier

The Naive Bayes classifier is a rather simple and powerful classifier, that has been used successfully in for instance spam filters. Here we will use a classifier that assumes a Gaussian distribution of its features, the Gaussian Naive Bayes classifier.  

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
param_grid = {'var_smoothing': [1.E-9, 1.E-6, 1.E-3, 1., 100.]} #Note the dictionary notation
nb = GaussianNB()
sc = StandardScaler()


In [None]:
# We make use of the GridSearchCV estimator that does the parameter-space scanning for us
grid_nb = GridSearchCV(nb, param_grid, cv=5, scoring='roc_auc') #NB: uses StratifiedKFold when cv=int

# Finally, we fit the GridSearchCV estimator to our training data, using the .fit() method
_ = grid_nb.fit(sc.fit_transform(X_train), y_train)

In [None]:
pd.DataFrame(grid_nb.cv_results_)

## Logistic Regression

Logistic Regression is the classification-counterpart of Linear Least Squares for regression. Similar to linear regression, we can impose a penalty on larger coefficient values to prevent overfitting. This is called regularization. 

Too large a penalty (small C-value in the sklearn model) will lead to stable but sub-optimal performance (underfitting), too small a penalty may result in overfitting, especially when the number of features (columns) is high and the number of samples is low. When doing logistic regression, it is important to determine the optimal regularization strength in a cross-validation cycle. 

It is important that we scale the data before fitting the model when doing regularization (why?). The correct way is to make scaling a part of a cross-validation pipeline, which is done using the Pipeline class of sklearn. 
The Pipeline object will behave just as a single classifier, having .fit() and .predict() methods. 

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
?LogisticRegression

In [None]:
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(solver='saga', n_jobs=-1, random_state=10))
])
# define the parameter grid, preceding the argument name with "lr__" when it applies to the LogisticRegression
param_grid = {'lr__C': np.logspace(-5, 3, num=5), 
              #'lr__penalty': ['l1', 'l2']
             } 
grid_lr = GridSearchCV(pipeline, param_grid, cv=5)
grid_lr.fit(X_train, y_train)

In [None]:
df = slice_gridsearch(grid_lr, vary_parameter_name='lr__C', fix_parameter_name=None,
                     fix_parameter_value=0.1)

In [None]:
df

## Random Forest

Random Forest Classifiers typically perform quite well over a wide range of parameters. The main parameter to tune is the depth of the individual trees ('max_depth'), which determines the model complexity. Typically, when the number of trees ('n_estimators') is chosen large enough (say, 100 or more), Random Forest classifiers do not easily overfit. This is because the classifier is an ensemble of many tree classifiers. 

In [None]:
from sklearn.ensemble import RandomForestClassifier


In [None]:
# Make a dictionary with the parameters you want to scan. 
param_grid = {'max_depth': [2, 5, 10, 20], 'n_estimators':[10, 100]}
rfc = RandomForestClassifier() #Initialize with whatever parameters you want to

grid_rfc = GridSearchCV(rfc, param_grid, cv=5, scoring='roc_auc') #NB: uses StratifiedKFold when cv=int
_ = grid_rfc.fit(X_train, y_train)

In [None]:
df = pd.DataFrame(grid_rfc.cv_results_)
df

In [None]:
# "Slice" the results (fix one parameter, vary another one)
df = slice_gridsearch(grid_rfc, vary_parameter_name='n_estimators', fix_parameter_name='max_depth',
                     fix_parameter_value=10)

In [None]:
# "Slice" the results (fix one parameter, vary another one)
df = slice_gridsearch(grid_rfc, vary_parameter_name='max_depth', fix_parameter_name='n_estimators',
                     fix_parameter_value=100)


In [None]:
df

## Gradient boosted Trees

Gradient boosted trees share some similarities with Random Forests, in that they are an ensemble of trees. Whereas a Random Forest classifier consists of trees grown individually, Gradient Boosting generates trees that successively address misclassifications of the previous trees. Although scikit-learn does have a Gradient Boosting implementation it is advised to use LightGBM, one of the most performant implementations in terms of speed and accuracy. 

In [None]:
# !pip install lightgbm
from sklearn.ensemble import GradientBoostingClassifier #scikit-learn implementation. Not advised
from lightgbm import LGBMClassifier #roughly 2 orders of magnitude times faster than scikit-learn's implementation

In [None]:
clf_gb = LGBMClassifier()
#clf_gb = GradientBoostingClassifier()
param_grid = {
    #'max_depth':[2, 5, 10], # sklearn and lightgbm implementation
    'num_leaves': [15, 30, 50], # lightgbm implementation
    'num_iterations': [20, 50, 100], # lightgbm implementation
    #'boosting_type': ['gbdt', 'dart', 'goss'], # lightgbm implementation
    }

grid_gb = GridSearchCV(estimator=clf_gb, param_grid=param_grid, cv=5)

In [None]:
grid_gb.fit(X_train, y_train)

In [None]:
df = slice_gridsearch(grid_gb, vary_parameter_name='num_leaves', fix_parameter_name='num_iterations',
                     fix_parameter_value=100)

In [None]:
df = slice_gridsearch(grid_gb, vary_parameter_name='num_iterations', fix_parameter_name='num_leaves',
                     fix_parameter_value=50)

In [None]:
df

## Feedforward Neural Network

The feedforward neural network is consists of consecutive layers of neurons. Their weights and biases need to be trained with the training data. We make use of Tensorflow for speed of calculation, and keras as a wrapper around Tensorflow to make the construction of the neural network easier. We will not use dropout, although this is generally recommended as a regularization measure. If desired, feel free to add a dropout layer. 

Neural networks typically have a lot of parameters that can be tuned: the number of layers, the width of the layers, the activation functions to be used, the batch size, the optimization function. It is advised to only scan for the width of the first two layers and the batch size given the short time available. 


In [None]:
from tensorflow import keras
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from keras.wrappers.scikit_learn import KerasClassifier

In [None]:
def build_clf(width):
    ann = keras.models.Sequential()
    ann.add(keras.layers.Dense(units=width, activation='relu'))
    ann.add(keras.layers.Dense(units=width, activation='relu'))
    ann.add(keras.layers.Dense(units=1, activation='sigmoid'))
    ann.compile(optimizer='adam', loss='binary_crossentropy', 
                metrics=['accuracy', keras.metrics.AUC()])
    return ann

In [None]:
model = KerasClassifier(build_fn=build_clf, epochs=2)
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('ann', model)
])

In [None]:
?KerasClassifier

In [None]:
params={'ann__batch_size':[8, 16], 
        'ann__width':[4, 10],
        }
gs_nn = GridSearchCV(estimator=pipeline, param_grid=params, cv=2)
# now fit the dataset to the GridSearchCV object. 
_ = gs_nn.fit(X_train, y_train)

In [None]:
df = slice_gridsearch(gs_nn, vary_parameter_name='ann__batch_size', fix_parameter_name='ann__width',
                     fix_parameter_value=4)

In [None]:
df = slice_gridsearch(gs_nn, vary_parameter_name='ann__width', fix_parameter_name='ann__batch_size',
                     fix_parameter_value=16)

In [None]:
df

# Evaluation

Having optimized the hyperparameters of our chosen classifier in a cross-validation, we will use this classifier to generate our predictions. 

The most straightforward option is to use .best_estimator() to access the best performing classifier 
according to the cross-validation. Per default (as determined by the `refit` argument to the gridsearch object) the entire training data is used to fit this best estimator. 

We will use the method .predict_proba() to generate scores (that may be interpreted as probabilities) on the test data. We access the probabilities of the class being 1 (True) with the "[:, 1]" operator

In [None]:
from sklearn.metrics import roc_auc_score
from ml_utils import plot_outlier_scores, plot_top_N # module with helper functions
y_test = pd.read_csv('y_test_supervised.csv.zip')['class']

In [None]:
# Generate predictions on the test data (X_test) with our selected and optimized classifier 
# We will use the Gridsearch object that has the DecisionTreeClassifier, grid_dtc
# Replace this object with yours

y_pred_dtc = grid_dtc.best_estimator_.predict_proba(X_test)[:, 1]
print(f'The ROC-AUC test score: {roc_auc_score(y_test, y_pred_dtc):.3f}')

In [None]:
?plot_outlier_scores

In [None]:
_ = plot_outlier_scores(y_test, y_pred_dtc, bw=0.01)

In [None]:
_ = plot_top_N(y_test, y_pred_dtc, N=100)

In [None]:
y_pred = grid_lr.best_estimator_.predict_proba(X_test)[:, 1]
_ = plot_outlier_scores(y_test, y_pred, bw=0.01)

In [None]:
y_pred = grid_rfc.best_estimator_.predict_proba(X_test)[:, 1]
_ = plot_outlier_scores(y_test, y_pred, bw=0.01)

In [None]:
y_pred = grid_gb.best_estimator_.predict_proba(X_test)[:, 1]
_ = plot_outlier_scores(y_test, y_pred, bw=0.01)

In [None]:
y_pred = gs_nn.best_estimator_.predict_proba(X_test)[:, 1]
_ = plot_outlier_scores(y_test, y_pred, bw=0.01)