# Part 2: Improving the model with Active Learning

We are simulating an operational environment where every week analysts are able to label 50 data points to use as additional training data; they simply don't have the time to label more. 

The challenge is to select those data points where you expect the models to improve the most. Unfortunately, you can only calculate afterwards whether the models actually improved with the additional training data.

[<img src="https://miro.medium.com/max/1400/0*1K-VniGulGWWsQA9" alt="meme" width="500"/>](https://towardsdatascience.com/use-active-learning-to-boost-your-ml-problem-53c70f72b979)

## Download data


In [None]:
!git clone https://github.com/SIDN/ml_workshop.git

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from typing import List
from sklearn import clone
from sklearn.base import BaseEstimator
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import *

## Read data

In [None]:
path_to_creditcard_holdout = 'ml_workshop/data/creditcard_holdout.csv.gz'
path_to_creditcard_week1_2 = 'ml_workshop/data/creditcard_1-2.csv.gz'
path_to_creditcard_week3_52 = 'ml_workshop/data/creditcard_3-52.csv.gz'

# All data from weeks 1 and 2
data_weeks_1_2 = pd.read_csv(path_to_creditcard_week1_2, compression='gzip')
data_weeks_1_2['amount'] = np.log(data_weeks_1_2.amount + 1)

# Our training data currently conisists of the 60 initially labeled data points
training_data = (data_weeks_1_2
                  .sample(frac=1, random_state=42)  # Shuffle the data to get random data points
                  .groupby('target')  # group the data in two groups: fraudulent and malicious
                  .head(30))  # Take the first 30 data points from each group

# Load the holdout dataset
holdout_dataset = pd.read_csv(path_to_creditcard_holdout, compression='gzip')
holdout_dataset['amount'] = np.log(holdout_dataset.amount + 1)

# All data from weeks 3 through 52 -- including labels. We'll simulate going through this dataset week by week
data_weeks_3_52 = pd.read_csv(path_to_creditcard_week3_52, compression='gzip')
data_weeks_3_52['amount'] = np.log(data_weeks_3_52.amount + 1)

# Our omniscient oracle which we can query for the labels once we have made our sample selection
oracle = data_weeks_3_52.target

# The now unlabeled data for weeks 3 - 52.
data_weeks_3_52.drop('target', axis=1, inplace=True)

## Define initial classifier

The default is to use a Random Forest classifier. You can also change this to a different method or to the best classifier obtained in part 1.


In [None]:
clf = RandomForestClassifier()

## Introduce sampling strategies

Now it is time move to the active learning part. To get you started we implemented 3 common sampling methods:

- `UniformSampling`: Just select a number of random data points
- `UncertaintySampling`: Select the samples closest to the decision boundary of the model
- `CommitteeDisagreementSampling`: Select samples of which a "committee" of models have different opinions (some say it's legitimate, some say it's fraudulent).

In [None]:
from ml_workshop.al.uniform_sampling import UniformSampling
from ml_workshop.al.uncertainty_sampling import UncertaintySampling
from ml_workshop.al.committee_sampling import CommitteeDisagreementSampling
from ml_workshop.al import SamplingMethod

All strategies implement the `SamplingMethod` abstract class. Let's read the docs:

In [None]:
?SamplingMethod

All sampling methods implement the `select_batch` method:

In [None]:
?CommitteeDisagreementSampling.select_batch

The strategies are defined in the `ml_workshop/al` directory. If you have some time left, try to come up with your own strategy!

## Update model using sampling strategy

### Preperation

**Assignment:** Create a voting committee of models to be used in the `CommitteeDisagreementSampling` method. This should be a list of at least two sklearn classifiers, and can be any combination. Try out different models and different model parameters!

In [None]:
### Begin your code here

committee: List[BaseEstimator] = ...

### End your code here

In the next cells you will go through each week, selecting 50 samples following a sampling strategy.
This can be done using two different modes:

- *Interactive mode*: you can re-run the **Magic cell** below multiple times and select a different sampling strategy for each week. 
- *Automatic mode*: simulate a number of weeks of active learning using the same sampling strategy for each weak.

**Assignment:** Select interactive or automatic mode. We recommend starting with interactive, because then you have more control over the strategy used. If you want to try again using another strategy or mode, run all the cells from the top of this notebook. To be extra safe you can restart the runtime.

In [None]:
### Begin your code here

INTERACTIVE_MODE = True  # You choose: Interactive or automatic

### End your code here

# With the following lists we keep track of the model's performance over the weeks
ap_per_week = []
precision_recall_per_week = []
labels_per_week = []

week_nr = 3  # Week from which to start simulating the active learning loop

if not INTERACTIVE_MODE:  # Automatic mode
    weeks_to_simulate = 30
    assert weeks_to_simulate <= 50, "We have one year of data and start at week 3"

### Magic cell 🪄

The cell below runs the active learning magic. In interactive mode run the follow cell multiple times to simulate each iteration of active learning.

In [None]:
### Begin your code here

strategy = ...  # Set the sampling strategy you want to use for the next iteration(s)

### End your code here


# for each week in automatic mode, or one week in interactive mode:
for week_nr in [week_nr] if INTERACTIVE_MODE else tqdm(range(week_nr, week_nr + weeks_to_simulate)):
    if week_nr > 52:
        print('No more data to simulate active learning.')
        break
        
    # The pool of unlabeled data consists of all data of this one week
    unlabeled_pool = data_weeks_3_52[data_weeks_3_52.week_no == week_nr]
    
    # Fit the classifier to the currently available training data
    clf.fit(training_data.loc[:, training_data.columns!='target'], training_data.target)
    
    # If the strategy is a Committee Disagreement Sampling strategy, we also need to fit the committee models
    if strategy == CommitteeDisagreementSampling:
        for model in committee:
            model.fit(training_data.loc[:, training_data.columns!='target'], training_data.target)

    # Select unlabeled data points for which we want analysts to provide the ground truth label
    selected_indices = strategy.select_batch(unlabeled_pool, nr_samples=50, model=clf, committee=committee)
    queried_data = unlabeled_pool.loc[selected_indices]
    queried_data['target'] = oracle.loc[selected_indices]

    # Add the now labeled data to our training dataset
    training_data = pd.concat([training_data, queried_data])

    # Evaluate the new model on the holdout set! (cheat mode)
    clf.fit(training_data.loc[:, training_data.columns!='target'], training_data.target)
    # The model's probability scores of a transaction being fraudulent
    y_pred = clf.predict(holdout_dataset.loc[:, holdout_dataset.columns != 'target'])
    y_proba = clf.predict_proba(holdout_dataset.loc[:, holdout_dataset.columns != 'target'])[:, 1]
    
    labels_per_week.append(training_data.target.value_counts().values.tolist())
    ap_per_week.append(average_precision_score(holdout_dataset.target, y_proba))
    precision_recall_per_week.append([precision_score(holdout_dataset.target, y_pred), recall_score(holdout_dataset.target, y_pred)])
    
    if INTERACTIVE_MODE:
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
        fig.suptitle(f'week_nr {week_nr} \n Re-run this cell for the next week', fontsize=16)
        ax2.set_title('Precision - recall curve')
        training_data.target.value_counts().plot(kind='bar', title='Training data labels', ax=ax1)
        ax1.grid(axis='y')
        PrecisionRecallDisplay.from_predictions(holdout_dataset.target, y_proba, ax=ax2);
        week_nr += 1

## Evaluate model improvement

 Now let's look at a summary of the active learning experiment. Did our model improve?

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(18, 14))
fig.suptitle('Active learning summary', fontsize=16)
plt.setp([ax1, ax2, ax3], xlabel='AL iteration')

# Samples per iteration
ax1.plot(range(1, len(labels_per_week) + 1), labels_per_week); ax1.grid(axis='y'); ax1.legend(['Number legitimate', 'Number fraudulent']); ax1.set_title('Training samples per iteration');
ax2.plot(range(1, len(ap_per_week) + 1), ap_per_week); ax2.grid(axis='y'); ax2.set_title('Average Precision per iteration');
ax3.plot(range(1, len(precision_recall_per_week) + 1), precision_recall_per_week); ax3.grid(axis='y'); ax3.legend(['Precision', 'Recall']); ax3.set_title('Precision and recall per iteration');
PrecisionRecallDisplay.from_predictions(holdout_dataset.target, y_proba, ax=ax4); ax4.set_title('Final precision / recall curve');

## Next steps

**Asignment** Now that you've tested a sampling strategy, it's time to compare different strategies. You can keep the following questions in mind: Which of the sampling strategies worked best overall? How do different classifiers perform before and after active learning? How might we further improve the model?

**Assignment:** Submit your best performing strategy and the achieved Average Precision through Menti.com (code 3393 6819).