## Quickstart with the active learning module

[Active learning]( https://en.wikipedia.org/wiki/Active_learning_(machine_learning) ) is a method that aims to collect and labelled new data in order to improve the machine learning model. Based on a predefined heuristic strategy, a certain number of data will be annotated and used to re-train the model. 
In pyrelational, we use: 
- an active learning ``strategy``, that query new data points based on specific selection criterion;
- a ``model manager`` that takes in input an uninstantiated ML model and a set of arguments (ie: the number of epochs) used for training;
- a ``data_manager`` that will update, after each query, the pool of labelled and unlabelled data; 
- an ``oracle``, at the interface between the data manager and the strategy, it gives the user the ability to access the queried data and manually annotate it with external tools. The oracle is optional; 
- a ``pipeline`` that aims to manage the strategy, the model and the data manager together.

## The dataset

We will use the sklearn [digits dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) where each datapoint is a 8x8 image of a digit that we aim to classify with a neural network.

We are defining a data manager that will update the pool of labelled data used to train the model. The validation set and the test set are fixed, and they will always remained unchanged. In the following example, we have 9000 unlabelled images, which we aim to query, based on a specific strategy, and annotate, to improve the model performances.

In [None]:
import torch
from torchvision import datasets, transforms
from pyrelational.data_managers import DataManager

# creating the dataset with pytorch
dataset = datasets.FashionMNIST(root="data", train=True, download=True, transform=transforms.ToTensor())
train_ds, val_ds, test_ds = torch.utils.data.random_split(dataset, [50000, 5000, 5000])
train_indices = train_ds.indices
val_indices = val_ds.indices
test_indices = test_ds.indices

# creating the data manager
data_manager = DataManager(
    dataset=dataset,
    train_indices=train_indices,
    validation_indices=val_indices,
    test_indices=test_indices,
    loader_batch_size=1000,
    label_attr="targets",
)

## The model manager

The model manager here is build with the [Pytorch Lightning module](https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html). 
It needs three inputs: 
- a `model class` that contains the core structure of the model and can contain some hyperparameters such as the number of layers or the dropout rate.
- a `model configuration` dictionary that contains the values for the hyperparameters. It can be empty if no parameters are defined.
- a `trainer configuration` dictionary with all the parameters needed for training, written in a pytorch lightning fashion. The default dictionnary can be inspected in the `models/lightning_model.py` file.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from lightning.pytorch import LightningModule
from sklearn.metrics import accuracy_score


class MnistClassification(LightningModule):
    """Custom module for a simple convnet Classification"""

    def __init__(self, dropout=0):
        super(MnistClassification, self).__init__()

        # mnist images are (1, 28, 28) (channels, width, height)
        self.layer_1 = nn.Linear(28 * 28, 128)
        self.layer_2 = nn.Linear(128, 256)
        self.layer_3 = nn.Linear(256, 10)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x):
        batch_size, channels, width, height = x.size()

        # (b, 1, 28, 28) -> (b, 1*28*28)
        x = x.view(batch_size, -1)
        x = self.dropout(self.layer_1(x))
        x = F.relu(x)
        x = self.dropout(self.layer_2(x))
        x = F.relu(x)
        x = self.dropout(self.layer_3(x))

        x = F.log_softmax(x, dim=1)
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log("test_loss", loss)

        # compute accuracy
        _, y_pred = torch.max(logits.data, 1)
        accuracy = accuracy_score(y, y_pred)
        self.log("test_accuracy", accuracy)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [None]:
from pyrelational.model_managers import LightningMCDropoutModelManager


model_manager = LightningMCDropoutModelManager(
    model_class=MnistClassification, 
    model_config={"dropout": 0.2}, 
    trainer_config={"epochs": 4})

## The query strategy and the active learning loop

Using more labelled data for training improves the model performances. Yet, labelling data can be time-consuming and some data may be more influential. The idea is to query the most informative data that aim to be annotated. The informativeness of the data depends on the strategy used. In this example, we are considering four different strategies designed for a classification task: 

- `least confidence strategy` aim to query samples whose predictions are the most uncertain;
- `marginal confidence strategy` computes the difference between the top and second top prediction: the lower is this difference, the highest is the score;
- `ratio confidence strategy` is similar to the marginal confidence strategy except that the score is computed as a ratio between the top and the second top predictions;
- `entropy classification strategy` returns the Shannon entropy of the predictions.

In [None]:
%%capture output

from pyrelational.pipeline import Pipeline
from pyrelational.strategies.classification import (
    LeastConfidenceStrategy,
    MarginalConfidenceStrategy,
    RatioConfidenceStrategy,
    EntropyClassificationStrategy,
)

query = dict()
strategies = [LeastConfidenceStrategy,
                 MarginalConfidenceStrategy, 
                 RatioConfidenceStrategy, 
                 EntropyClassificationStrategy]

for strategy in strategies:
    # the data manager is reinitialized for each strategy
    data_manager = DataManager(
        dataset=dataset,
        train_indices=train_indices,
        validation_indices=val_indices,
        test_indices=test_indices,
        loader_batch_size=10000,
        label_attr="targets",
    )
    pipeline = Pipeline(data_manager=data_manager, model_manager=model_manager, strategy=strategy())

    # we will annotate 10000 points step by step until there is no more unlabelled training data
    # The training pool consists of 50000 points, so we will annotate all the points in 9 runs
    pipeline.run(num_annotate=10000)
    query[strategy.__name__] = pipeline


We can look at a specific strategy: after each iteration, the test accuracy should increase. More metrics can be stored in the pipeline, as long as they are logged in the model class.

In [None]:
# print performance after each iteration for one strategy
query['MarginalConfidenceStrategy'].summary()

In [None]:
import matplotlib.pyplot as plt

for strategy in strategies :
    df = query[strategy.__name__].summary()
    plt.plot(df['test_accuracy'], label=strategy.__name__)
    plt.legend()
    plt.xlabel('Number of iteration')
    plt.title('Accuracy for different strategies')
plt.show()

for strategy in strategies :
    df = query[strategy.__name__].summary()
    plt.plot(df['test_loss'], label=strategy.__name__)
    plt.legend()
    plt.xlabel('Number of iteration')
    plt.title('Loss for different strategies')
plt.show()