# Dark Experience Replay

DISCLAIMER: all required algorithms have been implemented from scratch in this project, except from backbone models using PyTorch's `torch.nn` and `torch.optim` libraries, as well as loss functions.

In [1]:
import warnings
import os
import sys

%load_ext autoreload
%autoreload 2

warnings.filterwarnings('ignore')
current_dir = %pwd

parent_dir = os.path.abspath(os.path.join(current_dir, '../'))
sys.path.append(parent_dir)

import main

## Introduction

In this notebook we are going to show some running examples of the `Dark Experience Replay` method ([original paper](https://arxiv.org/pdf/2004.07211)). The documentation of all the implemented methods can be found underneath their respective files, functions and methods.

## Implementation Choices
- Every dataset has an associated model, together with its hyperparameters. For example, `SequentialMNIST` has a `SingleHeadMLP` model attached to it, with a predefined number of neurons per layer. It also has a fixed number of epochs (just one for the dataset simplicity) and a fixed batch size.
- DER has a fixed replay batch size, fixed SGD optimizer with changeable learning rate. The replay buffer is implemented with reservoir and simply stores all the needed variables to perform the forward pass, the loss calculation and the optimization step.
- In CIL, we store all output neurons in the replay buffer, which are going to be masked according to the task id in the metrics calculation.
- In TIL, we store all output neurons in the replay buffer as in CIL. We could have adopted a multi-head architecture (implemented in `src/models.py`) but we decided to stick with the single head architecture + masking of all not required output neurons (based on the input task id).
- In DIL, we use all neurons.

## Metrics
In all scenarios we compute the following metrics a matrix $R \in [0,1]^{T \times T}$, where $R_{ij} \in [0,1]$ is the accuracy of the model at time $i$ on task $j$, and then we compute all the relevant metrics:
- Average accuracy over all tasks: mean of its lower triangular matrix
- Last model accuracy: bottom right element of the matrix
- Full stream accuracy: mean of the last row of the matrix
- Forgetting
- Backward Transfer
- Forward Transfer

In DIL, we are more interested in forward and backward transfer with respect to the other since we always have all neurons available and the data is undergone a change of distribution (as in `PermutedMNIST` and `RotatedMNIST`). Instead, in CIL and TIL it is not a good metric, and we rather focus on accuracies and forgetting.

## Hyperparameters

In this notebook all hyperparameters such as learning rate, batch size, number of neurons per layer, etc. are fixed, sticking with the suggestions of the original paper. They are being validated in the `notebooks/validation.iypnb` notebook (in the same folder). The buffer size for ALL methods is set as default to 500, but it is then tuned in the other notebook.

## Outline

In this notebook we are going to show the following examples:
1. DER and DER++ on `SequentialMNIST` (CIL and TIL)
2. DER and DER++ on `PermutedMNIST` (DIL)
3. DER and DER++ on `RotatedMNIST` (DIL)

We also have provided examples on `CIFAR10`, which is in the notebook `notebooks/showcase_cifar10.ipynb`.

# Sequential MNIST - Task and Class Incremental Learning

Sequential MNIST consists of 5 tasks with respectively the classes 0-1, 2-3, 4-5, 6-7, 8-9. 
Here we train a MLP having one hidden layer with DER, and we show metrics for both the TIL and CIL scenarios.


## DER

In the cell output we can see the results of the metrics for both TIL and CIL scenarios:
- CIL: we have the upper part of the accuracy matrix filled with zeros, which is reasonable since the model is not activating neurons of unseen classes, whilst it remembers the old tasks (lower part);
- TIL: The upper part of the matrix is suggesting that we are randomly predicting one class over the two possible classes, which is expected since we haven't seen the data yet (remember this doesn't use as backbone a multi-head architecture).

Forgetting and BWT are better in the TIL scenario, which is expected due to the simplicity of this task with respect to CIL.

In [2]:
main.run_experiment(
    DATASET='SequentialMNIST',
    alpha=1.0,
    lr=0.03
)

Experience (0) - Training Samples: 11399
Experience (1) - Training Samples: 10881
Experience (2) - Training Samples: 10137
Experience (3) - Training Samples: 10965
Experience (4) - Training Samples: 10620
Epoch 1/1 - Loss: 1.027967095375061
 ===  Accuracies - CIL === 

[[99.85815603  0.          0.          0.          0.        ]
 [99.8108747  93.38883448  0.          0.          0.        ]
 [99.76359338 85.99412341 95.94450374  0.          0.        ]
 [99.66903073 85.60235064 73.85272145 96.37462236  0.        ]
 [99.62174941 87.80607248 69.37033084 87.86505539 93.74684821]]

 ===  Accuracies - TIL === 

[[99.85815603 46.71890304 49.14621131 46.57603223 46.19263742]
 [99.8108747  94.31929481 60.24546425 49.64753273 48.91578417]
 [99.76359338 95.00489716 97.7054429  48.74118832 71.91124559]
 [99.66903073 94.07443683 97.38527215 99.34541793 66.96923853]
 [99.62174941 94.31929481 96.85165422 99.29506546 96.72213817]]

=== Task-IL (TIL) vs Class-IL (CIL) Metrics ===

Accuracy - Last Mo

<src.metric.Metric at 0x129c22ed0>

## DER++

Here we evaluate the same scenario as before but with DER++. By looking at the final accuracies we can see that the model is more focusing on old tasks: we have a decrease of forgetting in the CIL scenario, which is the hardest between the two scenarios.

In [3]:
main.run_experiment(
    DATASET='SequentialMNIST',
    lr=0.03,
    alpha=1.0,
    beta=0.5
)

Experience (0) - Training Samples: 11399
Experience (1) - Training Samples: 10881
Experience (2) - Training Samples: 10137
Experience (3) - Training Samples: 10965
Experience (4) - Training Samples: 10620
Epoch 1/1 - Loss: 1.1340641975402832
 ===  Accuracies - CIL === 

[[99.90543735  0.          0.          0.          0.        ]
 [99.71631206 90.40156709  0.          0.          0.        ]
 [99.71631206 89.37316357 94.87726788  0.          0.        ]
 [99.66903073 90.79333986 81.43009605 95.77039275  0.        ]
 [99.66903073 92.55631734 84.20490928 91.18831823 93.44427635]]

 ===  Accuracies - TIL === 

[[99.90543735 49.46131244 51.97438634 48.23766365 51.68935956]
 [99.71631206 91.77277179 55.54962647 61.83282981 48.10892587]
 [99.71631206 93.14397649 98.07897545 52.9204431  28.89561271]
 [99.66903073 94.22135162 98.29242263 99.39577039 25.97075139]
 [99.66903073 94.95592556 98.23906083 99.29506546 96.92385275]]

=== Task-IL (TIL) vs Class-IL (CIL) Metrics ===

Accuracy - Last M

<src.metric.Metric at 0x129e09dd0>

# Permuted MNIST - Domain Incremental Learning

This benchmark consists of 20 tasks in which we perform a random permutation of the pixels of the images. Each task therefore takes into account 10 classes.

## DER

In this case the forward transfer is close to zero as expected since the permutations are not task dependent, but totally random. This means that the model is not capable of reusing its gained knowledge from the previous tasks to infer something never seen.

In [4]:
main.run_experiment(
    DATASET='PermutedMNIST',
    lr=0.2,
    alpha=1.0,
)

Experience (0) - Training Samples: 54000
Experience (1) - Training Samples: 54000
Experience (2) - Training Samples: 54000
Experience (3) - Training Samples: 54000
Experience (4) - Training Samples: 54000
Experience (5) - Training Samples: 54000
Experience (6) - Training Samples: 54000
Experience (7) - Training Samples: 54000
Experience (8) - Training Samples: 54000
Experience (9) - Training Samples: 54000
Experience (10) - Training Samples: 54000
Experience (11) - Training Samples: 54000
Experience (12) - Training Samples: 54000
Experience (13) - Training Samples: 54000
Experience (14) - Training Samples: 54000
Experience (15) - Training Samples: 54000
Experience (16) - Training Samples: 54000
Experience (17) - Training Samples: 54000
Experience (18) - Training Samples: 54000
Experience (19) - Training Samples: 54000
Epoch 1/1 - Loss: 0.23940658569335938
 ===  Accuracies - DIL === 

[[92.1  11.26 10.04  9.63 12.07 11.65  9.18 10.16  5.75 11.95 10.91 12.44
  11.36 11.23 11.38  7.5  10.

<src.metric.Metric at 0x12bd37ad0>

## DER++

As in Sequential MNIST, we have an improvement of the metrics taking into account the past.

In [5]:
main.run_experiment(
    DATASET='PermutedMNIST',
    lr=0.2,
    alpha=1.0,
    beta=0.5
)

Experience (0) - Training Samples: 54000
Experience (1) - Training Samples: 54000
Experience (2) - Training Samples: 54000
Experience (3) - Training Samples: 54000
Experience (4) - Training Samples: 54000
Experience (5) - Training Samples: 54000
Experience (6) - Training Samples: 54000
Experience (7) - Training Samples: 54000
Experience (8) - Training Samples: 54000
Experience (9) - Training Samples: 54000
Experience (10) - Training Samples: 54000
Experience (11) - Training Samples: 54000
Experience (12) - Training Samples: 54000
Experience (13) - Training Samples: 54000
Experience (14) - Training Samples: 54000
Experience (15) - Training Samples: 54000
Experience (16) - Training Samples: 54000
Experience (17) - Training Samples: 54000
Experience (18) - Training Samples: 54000
Experience (19) - Training Samples: 54000
Epoch 1/1 - Loss: 0.2792215645313263
 ===  Accuracies - DIL === 

[[92.88 13.01 12.55 13.64  8.54  6.82 10.96  8.07  7.62  9.41  8.05 11.86
  11.2  11.56  6.72 10.54  8.4

<src.metric.Metric at 0x13f807ad0>

# Rotated MNIST - Domain Incremental Learning

This benchmark consists of 20 task in which for each task we rotate the images by a fixed random angle. Each task therefore takes into account 10 classes.

We expect the rotation not to be as impactful as the permutation, since it brings a gradual shift in the data distribution. The features are not completely scrambled as in the permuted case.

## DER

As a result we have a good forward transfer, which is expected since the model can still use the features learned in the past to predict unseen tasks. The backward transfer is also good, which is expected since the rotation is not as impactful as the permutation.

In [2]:
main.run_experiment(
    DATASET='RotatedMNIST',
    lr=0.2,
    alpha=1.0,
)

Experience (0) - Training Samples: 54000
Experience (1) - Training Samples: 54000
Experience (2) - Training Samples: 54000
Experience (3) - Training Samples: 54000
Experience (4) - Training Samples: 54000
Experience (5) - Training Samples: 54000
Experience (6) - Training Samples: 54000
Experience (7) - Training Samples: 54000
Experience (8) - Training Samples: 54000
Experience (9) - Training Samples: 54000
Experience (10) - Training Samples: 54000
Experience (11) - Training Samples: 54000
Experience (12) - Training Samples: 54000
Experience (13) - Training Samples: 54000
Experience (14) - Training Samples: 54000
Experience (15) - Training Samples: 54000
Experience (16) - Training Samples: 54000
Experience (17) - Training Samples: 54000
Experience (18) - Training Samples: 54000
Experience (19) - Training Samples: 54000
Epoch 1/1 - Loss: 0.279278963804245
 ===  Accuracies - DIL === 

[[90.95 65.5  22.89 20.77 17.03 28.81 84.04 17.22 90.45 83.93 64.14 76.51
  89.51 12.74 17.31 16.13 12.35

<src.metric.Metric at 0x133320590>

## DER++

This benchmark is less impacted by old predictions (similar performances to DER++)

In [3]:
main.run_experiment(
    DATASET='RotatedMNIST',
    lr=0.2,
    alpha=1.0,
    beta=0.5
)

Experience (0) - Training Samples: 54000
Experience (1) - Training Samples: 54000
Experience (2) - Training Samples: 54000
Experience (3) - Training Samples: 54000
Experience (4) - Training Samples: 54000
Experience (5) - Training Samples: 54000
Experience (6) - Training Samples: 54000
Experience (7) - Training Samples: 54000
Experience (8) - Training Samples: 54000
Experience (9) - Training Samples: 54000
Experience (10) - Training Samples: 54000
Experience (11) - Training Samples: 54000
Experience (12) - Training Samples: 54000
Experience (13) - Training Samples: 54000
Experience (14) - Training Samples: 54000
Experience (15) - Training Samples: 54000
Experience (16) - Training Samples: 54000
Experience (17) - Training Samples: 54000
Experience (18) - Training Samples: 54000
Experience (19) - Training Samples: 54000
Epoch 1/1 - Loss: 0.22809439897537231
 ===  Accuracies - DIL === 

[[92.63 84.24 84.88 84.8  13.51 12.25 33.92 20.16 24.23 82.53 17.68 20.16
  20.16 67.6  15.09 89.59 17.

<src.metric.Metric at 0x1357f1350>

## Conclusions

In this notebook we have shown the results of the `Dark Experience Replay` method on three different benchmarks. We have seen that the method is able to mitigate catastrophic forgetting and to improve the forward transfer in the case of Rotated MNIST. The DER++ variant is able to improve the results by taking into account the past tasks. The results are consistent with the original paper and the method is able to achieve state-of-the-art results on the benchmarks used in this project. 

Future works could explore the possibility of including more complex tasks and datasets, as well as different architectures.
