<a href="https://colab.research.google.com/github/DonErnesto/masterclassSFI_2021/blob/main/notebooks/BitcoinSupervised_AutoML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anti-Money Laundering with AutoML

**Introduction**

This notebook shows how we can use AutoML (using H2O, but various frameworks exist) to automate the optimization and model selection loop in supervised learning. 


In [None]:
## Data import from Github
import os
force_download = False
if force_download or not os.path.exists('X_train_supervised.csv.zip'): # then probably nothing was downloaded yet
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/ml_utils.py
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_train_supervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_train_supervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/X_test_supervised.csv.zip
    !curl -O https://raw.githubusercontent.com/DonErnesto/masterclassSFI_2021/main/data/y_test_supervised.csv.zip
    

In [None]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd # data I/O and manipulation
import numpy as np # numeric operations
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import roc_auc_score, plot_roc_curve, plot_precision_recall_curve, average_precision_score
from ml_utils import grouped_boxplot_gridsearch, plot_conditional_distribution


In [None]:
# !pip install requests
# !pip install tabulate
!pip install "colorama>=0.3.8"
!pip install future
!pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o

In [None]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Data loading

In [None]:
X_train = pd.read_csv('X_train_supervised.csv.zip')
X_test = pd.read_csv('X_test_supervised.csv.zip')
y_train = pd.read_csv('y_train_supervised.csv.zip')['class']
y_test = pd.read_csv('y_test_supervised.csv.zip')['class']


In [None]:
# Remove unwanted feature txId
X_train = X_train.drop(columns=['txId', 'Time step'])
X_test = X_test.drop(columns=['txId', 'Time step'])

In [None]:
print(X_train.shape, '\n')
print(y_train.value_counts(normalize=True))

There are 33.4k data points, of which 11% is a positive (which is quite a large fraction in a financial crime context). 

In [None]:
# Data preparation
train = h2o.H2OFrame(pd.concat((X_train, y_train), axis=1))
test = h2o.H2OFrame(pd.concat((X_test, y_test), axis=1))

x = train.columns
y = "class"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()


# Model fitting
aml = H2OAutoML(max_models=20, max_runtime_secs=900, seed=1)
aml.train(x=x, y=y, training_frame=train)

In [None]:
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

In [None]:
preds = aml.predict(test)
y_pred = preds.as_data_frame()['p1']
print(f'ROC-AUC Score best AutoML model: {roc_auc_score(y_test, y_pred):.3f}')