# Part 1: train initial TransacCheck models

## Welcome

In this tutorial you will train a machine learning model to be used in production... 

We'll use a well-known dataset with creditcard transactions which are classified as legitimate or fraudulent. Most of the features are components generated by a [Principle Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis), which is done to remove any personally identifiable information. 

During these assignments we will use the widely popular python libraries [Pandas (v1.3.5)](https://pandas.pydata.org/pandas-docs/version/1.3/index.html), [sklearn (v1.0.2)](https://scikit-learn.org/1.0/user_guide.html) and [matplotlib (v.3.2.2)](https://matplotlib.org/3.2.2/tutorials/index.html)

You will be programming in pairs, so try to form pairs where at least one of you is comfortable with Python and preferably also with the used libraries. If not, it's no problem; we're here to help throughout the assignments.

[<img src="https://www.monkeyuser.com/assets/images/2020/178-pair-programming.png" alt="drawing" width="400"/>](https://www.monkeyuser.com/2020/pair-programming/)

## Download data

To start, run the next cell to download the data and other source code used during the assignments: highlight the cell and press Shift + Enter.


In [None]:
!git clone https://github.com/SIDN/tma22_ml.git

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import clone
from sklearn.base import BaseEstimator
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import *
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.naive_bayes import GaussianNB

## Data exploration

### Read data

Let's load all the available data and freely explore it, just to see what we're dealing with. For now, we'll discard the target class (fraudulent or legitimate transaction), because in real life we don't have this much ground truth data in the exploration phase.

In [None]:
url_creditcard_holdout = 'tma22_ml/data/creditcard_holdout.csv.gz'
url_creditcard_week1_2 = 'tma22_ml/data/creditcard_1-2.csv.gz'
url_creditcard_week3_52 = 'tma22_ml/data/creditcard_3-52.csv.gz'

all_data = pd.concat([pd.read_csv(url_creditcard_week1_2, compression='gzip'), 
                      pd.read_csv(url_creditcard_week3_52, compression='gzip')]).drop('target', axis=1)
all_data.describe()

## Explore data

We visualize the distribution of the raw transaction amount and the log transformed amount using [boxplots](https://en.wikipedia.org/wiki/Box_plot).

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
ax1.set_title('Transaction amount')
ax2.set_title('Transaction amount (log transformed)')

ax1.boxplot(all_data.amount)
ax2.boxplot(np.log(all_data.amount + 1))
plt.show()

It is clear the scale of the amount variable is not well suited for the models that we're dealing with. Most of the transactions are under $100, with outliers reaching thousands of dollars. The large values can skew the model disproportionally to the other variables.

Let's scale the data to a smaller range with a log transform:

In [None]:
all_data['log_amount'] = np.log(all_data.amount + 1)

**Assignment:** Explore the data further in the next cell! Create some graphs to visualize the data and get to know it better.

In [None]:
### Your code here
...

###

## Initial dataset

We select the first two weeks of data as our initial training set.
This represents a short period in which an equal number of suspicious and legitimate domain names were thoroughly examined by our Support staff to determine their label. \
Don't forget to log-transform the amount column

In [None]:
# Load all data from week 1 and 2
data_weeks_1_2 = pd.read_csv(url_creditcard_week1_2, compression='gzip')
# Transform amount to the log of amount
data_weeks_1_2['amount'] = np.log(data_weeks_1_2.amount + 1)

# We simulate the scenario where we only have the ground thruth labels for 30 fraudulent and 30 legitimate transactions; this is what our support staff has labeled.
initial_dataset = (data_weeks_1_2
                  .sample(frac=1, random_state=42)  # Shuffle the data to get random data points
                  .groupby('target')  # group the data in two groups: fraudulent and malicious
                  .head(30))  # Take the first 30 data points from each group

initial_dataset.target.value_counts().plot(kind='bar', title='Ground truth labels')
initial_dataset.describe()

In [None]:
initial_dataset.shape

## Classifier & scoring

Now we have 60 data points with their ground truth labels; we can train our initial model!

**Assignment:** define at least two classifiers (you can use those implemented in the [sklearn module](https://scikit-learn.org/stable/supervised_learning.html) and evaluate their performance using [k-fold cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#k-fold) on our limited dataset. Try at least one interpretable model. Feel free to experiment with the model parameters.

In [None]:
# Let's define some metrics by which we can measure the performance of our model
scoring = {
    'ap': make_scorer(average_precision_score),  # Average precision (weighted mean of precisions achieved at each threshold)
    'precision': make_scorer(precision_score),  # Precision (true positives / all positive classifications)
    'recall': make_scorer(recall_score),  # Recall (fraction of fraudulent samples the classifier found)
    'specificity': make_scorer(recall_score, pos_label=0),  # Speicificity (fraction of legitimate samples classified as legitimate)
}

### Your code here
# Some examples: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

clf = RandomForestClassifier()

clf2 = ...

###

classifier = clf

# Using cross-validation, we evaluate the model using the four scoring metrics defined above.
# Documentation on cross-validation https://scikit-learn.org/stable/modules/cross_validation.html
scores = pd.DataFrame(cross_validate(classifier, initial_dataset.loc[:, initial_dataset.columns != 'target'], initial_dataset.target, cv=5, scoring=scoring))
scores.agg(['mean', 'std'])

### Testing on the holdout set

If we look at the table above, the model seems to perform well. Both the scores are high and the standard deviation between folds is low indicating that the performance is stable.

Let's verify these results with a representative hold-out set. Keep in mind that in practice we do not have the luxury of this rather large holdout set to test our performance on. Often, we have to make do with heuristics and weakly labaled data.

In [None]:
holdout_dataset = pd.read_csv(url_creditcard_holdout, compression='gzip')  # Load data from csv
holdout_dataset['amount'] = np.log(holdout_dataset.amount + 1)  # Log-transform the amount column

**Assignment:** Visualize the target class distribution to get an idea about the skew we are dealing with.

In [None]:
### Your code here

###

**Assignment:** Given this distribution, which evaluation curve our model is best suited (ROC, Precicion-Recall, DET, etc.)? 

In the cell below we compute the model's probability of fraud for each transaction in the hold-out set. Use these metrics to evaluate the model's performance. 
Check out the [`sklearn.metrics` documentation](https://scikit-learn.org/stable/modules/classes.html#classification-metrics) for a list of evaluation metrics. Also scroll down to the [plotting](https://scikit-learn.org/stable/modules/classes.html#id4) section for some useful plot generators. ([example](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.PrecisionRecallDisplay.html#sklearn.metrics.PrecisionRecallDisplay.from_predictions))

In [None]:
# First we fit our classifier to the initial dataset for which we have the ground truth data
X = initial_dataset.loc[:, initial_dataset.columns != 'target']  # All columns except the target
y = initial_dataset.target  # target class
classifier.fit(X, y)

# Now let's test the performance on the holdout set. Here we plot the precision-recall curve, showing the trade-off between a high precision and a high recall
y_proba = classifier.predict_proba(holdout_dataset.loc[:, holdout_dataset.columns != 'target'])[:, 1]  # The model's probability scores of a transaction being fraudulent


### Your code here, plot the evaluation curve of your choice

###

### Tuning the model

As we have seen, most classifiers return probabilities between 0 and 1.
We thus need to choose a threshold above which our model should classify a transaction as fraudulent

In [None]:
precision, recall, thresholds = precision_recall_curve(holdout_dataset.target, y_proba)
thresholds = np.append(thresholds, np.nan)

fig = plt.figure(figsize=(10, 6))
plt.title('Precision vs Recall at various thresholds')
plt.xlabel('Decision threshold')
plt.plot(thresholds, precision, label='Precision');
plt.plot(thresholds, recall, label='Recall');
plt.grid();
plt.legend();
plt.plot();

**Assignment:** Play around with different threshold in the cell below to see how the outcome changes with respect to true positives, false positives, true negatives and false negative.


In [None]:
threshold = ...
assert type(threshold) == float, "Threshold should be a float."
assert threshold >= 0 and threshold <= 1, "Threshold must be between 0 and 1."
y_pred = y_proba > threshold

ConfusionMatrixDisplay.from_predictions(holdout_dataset.target, y_pred, display_labels=['Benign', 'Fraud']);

**Assignment:** Think about what these values mean for a production environment. 

Take the suspicious registration checker we presented as an example. Evaluating a suspicious domain name can take up to 15 minutes. Some simple calculations reveal that, with 2500 new registrations per day, a precision of 99% means 25 false positives, taking around 6 hours to evaluate! That's 0.8 FTE wasted on checking false-positive results.

This shows that the evaluation of a model should be done taking into account the production environment it is deployed in. Writing a paper in which we advance the state-of-the-art for example by obtaining 97% precision sounds good on paper, but is still not usable in real life. 

**Assignment:** Submit the classifier and threshold you selected through Menti (code 3393 6819).



