## Part 2: Improving the model with Active Learning

We are simulating an operational environment where every week analysts are able to label 50 data points to use as additional training data; they simply don't have the time to label more. 

The challenge is to select those data points where you expect the models to improve the most. Unfortunately, you can only calculate afterwards whether the models actually improved with the additional training data.

[<img src="https://miro.medium.com/max/1400/0*1K-VniGulGWWsQA9" alt="meme" width="500"/>](https://towardsdatascience.com/use-active-learning-to-boost-your-ml-problem-53c70f72b979)

#### Let's start by importing the necessary libraries and loading all the available data...

In [None]:
!git clone https://github.com/SIDN/tma22_ml.git

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from typing import List
from sklearn import clone
from sklearn.base import BaseEstimator
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import *

In [None]:
path_to_creditcard_holdout = 'tma22_ml/data/creditcard_holdout.csv.gz'
path_to_creditcard_week1_2 = 'tma22_ml/data/creditcard_1-2.csv.gz'
path_to_creditcard_week3_52 = 'tma22_ml/data/creditcard_3-52.csv.gz'

# All data from weeks 1 and 2
data_weeks_1_2 = pd.read_csv(path_to_creditcard_week1_2, compression='gzip')
data_weeks_1_2['amount'] = np.log(data_weeks_1_2.amount + 1)

# Our training data currently conisists of the 60 initially labeled data points
training_data = (data_weeks_1_2
                  .sample(frac=1, random_state=42)  # Shuffle the data to get random data points
                  .groupby('target')  # group the data in two groups: fraudulent and malicious
                  .head(30))  # Take the first 30 data points from each group

# Load the holdout dataset
holdout_dataset = pd.read_csv(path_to_creditcard_holdout, compression='gzip')
holdout_dataset['amount'] = np.log(holdout_dataset.amount + 1)

# All data from weeks 3 through 52 -- including labels. We'll simulate going through this dataset week by week
data_weeks_3_52 = pd.read_csv(path_to_creditcard_week3_52, compression='gzip')
data_weeks_3_52['amount'] = np.log(data_weeks_3_52.amount + 1)

# Our omniscient oracle which we can query for the labels once we have made our sample selection
oracle = data_weeks_3_52.target

# The now unlabeled data for weeks 3 - 52.
data_weeks_3_52.drop('target', axis=1, inplace=True)

Now it is time to go through each week, selecting 50 samples following a sampling strategy. To get you started we implemented 3 common sampling methods:

- `UniformSampling`: Just select a number of random data points
- `UncertaintySampling`: Select the samples closest to the decision boundary of the model
- `CommitteeDisagreementSampling`: Select samples of which a "committee" of models have different opinions (some say it's legitimate, some say it's fraudulent)

These strategies are defined in the `al` directory. If you have some time left, try to come up with your own strategy!

In [None]:
from tma22_ml.al.uniform_sampling import UniformSampling
from tma22_ml.al.uncertainty_sampling import UncertaintySampling
from tma22_ml.al.committee_sampling import CommitteeDisagreementSampling

In [None]:
# Create a voting committee of models to be used in the CommitteeDisagreementSampling strategy:
# This should be a list of sklearn classifiers, and can be any combination.
# Try out different models and different model parameters

### Your code here

committee: List[BaseEstimator] = ... # e.g. [RandomForestClassifier(random_state=i) for i in range(10)]

###

In [None]:
# Here you can choose between two modes of exploring the Active Learning magic: 
# Interactive mode: where you can re-run the next cell multiple times to see the results of each week of added labeled data. This way you can choose a different sampling strategy for each week.
# Automatic mode: simulate a number of weeks of active learning with a sampling strategy
# If you want to try again using another strategy or mode, run all the cells from the top of this notebook. To be extra safe you can restart the runtime.

### Your code here

INTERACTIVE_MODE = False  # You choose: Interactive or automatic

classifier = ...  # Your best classifier from part 1 - you might need to import the model from the sklearn library

###

classifier.fit(training_data.loc[:, training_data.columns!='target'], training_data.target)

ap_per_week = []
precision_recall_per_week = []
labels_per_week = []

week_nr = 3  # Week from which to start simulating the active learning loop (3 - 52)

if not INTERACTIVE_MODE:  # Automatic mode
    weeks_to_simulate = 30
    assert weeks_to_simulate <= 50, "We have one year of data and start at week 3"

In [None]:
####### Choose your sample strategy:

strategy = ...

#######

# for each week in automatic mode, or one week in interactive mode:
for week_nr in [week_nr] if INTERACTIVE_MODE else tqdm(range(week_nr, week_nr + weeks_to_simulate)):
    if week_nr > 52:
        print('No more data to simulate active learning.')
        break
        
    # The pool of unlabeled data consists of all data of this one week
    unlabeled_pool = data_weeks_3_52[data_weeks_3_52.week_no == week_nr]
    
    # Fit the classifier to the currently available training data
    classifier.fit(training_data.loc[:, training_data.columns!='target'], training_data.target)
    
    # If the strategy is a Committee Disagreement Sampling strategy, we also need to fit the committee models
    if strategy == CommitteeDisagreementSampling:
        for model in committee:
            model.fit(training_data.loc[:, training_data.columns!='target'], training_data.target)

    # Select unlabeled data points for which we want analysts to provide the ground truth label
    selected_indices = strategy.select_batch(unlabeled_pool, nr_samples=50, model=classifier, committee=committee)
    queried_data = unlabeled_pool.loc[selected_indices]
    queried_data['target'] = oracle.loc[selected_indices]

    # Add the now labeled data to our training dataset
    training_data = pd.concat([training_data, queried_data])

    # Evaluate the new model on the holdout set! (cheat mode)
    classifier.fit(training_data.loc[:, training_data.columns!='target'], training_data.target)
    # The model's probability scores of a transaction being fraudulent
    y_pred = classifier.predict(holdout_dataset.loc[:, holdout_dataset.columns != 'target'])
    y_proba = classifier.predict_proba(holdout_dataset.loc[:, holdout_dataset.columns != 'target'])[:, 1]
    
    labels_per_week.append(training_data.target.value_counts().values.tolist())
    ap_per_week.append(average_precision_score(holdout_dataset.target, y_proba))
    precision_recall_per_week.append([precision_score(holdout_dataset.target, y_pred), recall_score(holdout_dataset.target, y_pred)])
    
    if INTERACTIVE_MODE:
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
        fig.suptitle(f'week_nr {week_nr} \n Re-run this cell for the next week', fontsize=16)
        ax2.set_title('Precision - recall curve')
        training_data.target.value_counts().plot(kind='bar', title='Training data labels', ax=ax1)
        ax1.grid(axis='y')
        PrecisionRecallDisplay.from_predictions(holdout_dataset.target, y_proba, ax=ax2);
        week_nr += 1

#### Now let's look at a summary of the active learning experiment. Did our model improve?

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(18, 14))
fig.suptitle('Active learning summary', fontsize=16)
plt.setp([ax1, ax2, ax3], xlabel='AL iteration')

# Samples per iteration
ax1.plot(range(1, len(labels_per_week) + 1), labels_per_week); ax1.grid(axis='y'); ax1.legend(['Number legitimate', 'Number fraudulent']); ax1.set_title('Training samples per iteration');
ax2.plot(range(1, len(ap_per_week) + 1), ap_per_week); ax2.grid(axis='y'); ax2.set_title('Average Precision per iteration');
ax3.plot(range(1, len(precision_recall_per_week) + 1), precision_recall_per_week); ax3.grid(axis='y'); ax3.legend(['Precision', 'Recall']); ax3.set_title('Precision and recall per iteration');
PrecisionRecallDisplay.from_predictions(holdout_dataset.target, y_proba, ax=ax4); ax4.set_title('Final precision / recall curve');

### Which of the sampling strategies worked best overall?
### How do different classifiers perform before and after active learning?
### How might we further improve the model?