# Validation

In [8]:
import warnings
import os
import sys

%load_ext autoreload
%autoreload 2

warnings.filterwarnings('ignore')
current_dir = %pwd

parent_dir = os.path.abspath(os.path.join(current_dir, '../'))
sys.path.append(parent_dir)

from src.model_selection import continual_hyperparameter_selection
import utils

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Overview

In this part of the showcase we use the `Continual Hyperparameter Selection` framework [[M. De Lange et al. 2022](https://arxiv.org/pdf/1909.08383)] to validate and find the best parameters of the selected models for all MNIST datasets. We do not include other datasets since their backbone models are too expensive to train. 

Each model has its own set of parameters as specified in the corresponding yaml file in the `/hyperparameter` folder. Of them, only the learning rate and buffer size are used for finding the optimal **plasticity**, while the other are annealed with the _hyperparameter_drop_ constant when considering the **stability**.

The validation split is, for every dataset, of the 10% of the training set, while keeping the original augmentation function and the same seed for the split.
Both $\alpha$ and $\beta$ are initially set to 1 (the latter only used in the DER++ model).

The authors of the paper suggest that most of the continual model selection methods in the literature _violate_ the constraint of not having access to old data. This brings the algorithm on being more applicable in a real world scenario, where we use the current data we have and continually perform model selection. 

As we have implemented, we decide to specify the buffer size in advance since tuning it automatically would bring an ulterior complexity:
- If model selection increases it, just use the extra space for storing past samples;
- Otherwise, we would have to choose which examples to discard. We could discard samples such that the proportion of the seen classes is uniform, or we could employ different heuristics.


## Hyperparameters

For the validated models we look for the following plasticity hyperparameters.

In [2]:
print("SequentialMNIST Hyperparameters", utils.load_hparams('seq-mnist'))
print("PermutedMNIST Hyperparameters", utils.load_hparams('perm-mnist'))
print("RotatedMNIST Hyperparameters", utils.load_hparams('rot-mnist'))

SequentialMNIST Hyperparameters {'lr': [0.1, 0.05, 0.03, 0.01]}
PermutedMNIST Hyperparameters {'lr': [0.1, 0.05, 0.03, 0.01]}
RotatedMNIST Hyperparameters {'lr': [0.1, 0.05, 0.03, 0.01]}


## Sequential MNIST with DER

We chose as metric, for each task, the TIL and CIL averaged accuracy for this dataset since it can be evaluated in both settings, with a maximum drop of 3% with respect to the best accuracy.

As it can be seen, the model achieves good hold-out performance metrics on the test set after the continual selection process.
The result of this run is $\alpha=0.25$, meaning that the model is more plastic and needs less regularization.

In [3]:
continual_hyperparameter_selection('SequentialMNIST', accuracy_drop=0.03)

Task 0 - Best LR: 0.1 - Best Accuracy on Validation set: 99.76
Task 1 - Best LR: 0.05 - Best Accuracy on Validation set: 97.52
Task 2 - Best LR: 0.03 - Best Accuracy on Validation set: 99.20
Task 3 - Best LR: 0.1 - Best Accuracy on Validation set: 99.75
Task 4 - Best LR: 0.01 - Best Accuracy on Validation set: 98.31

 ===  Accuracies on test sets - CIL === 

[[100.           0.           0.           0.           0.        ]
 [ 99.71631206  96.22918707   0.           0.           0.        ]
 [ 99.90543735  93.97649363  97.49199573   0.           0.        ]
 [ 99.57446809  92.70323213  90.1814301   96.37462236   0.        ]
 [ 99.8108747   91.77277179  90.50160085  94.05840886  96.41956631]]

 ===  Accuracies on test sets - TIL === 

[[100.          52.20372184  48.07897545  51.76233635  49.11749874]
 [ 99.71631206  97.35553379  55.76307364  51.76233635  49.11749874]
 [ 99.90543735  96.62095984  99.62646745  51.40986908  46.94906707]
 [ 99.57446809  95.93535749  99.2529349   99.546827

{'best_lr': 0.01, 'best_alpha': 0.25, 'best_beta': None}

For both examples an accuracy drop of maximum 5% is reasonable since Split MNIST is an easy task.

## Sequential MNIST with DER++

The setting is the same as for the standard DER model, but we also look for the best $\beta$ parameter. 

Here, we can see a slight improvement over DER, with a better hold-out performance on the test sets.

In [5]:
continual_hyperparameter_selection('SequentialMNIST', accuracy_drop=0.01, plus_plus=True)

Task 0 - Best LR: 0.03 - Best Accuracy on Validation set: 99.92
Task 1 - Best LR: 0.05 - Best Accuracy on Validation set: 97.85
Task 2 - Best LR: 0.01 - Best Accuracy on Validation set: 99.38
Task 3 - Best LR: 0.1 - Best Accuracy on Validation set: 100.00
Task 4 - Best LR: 0.01 - Best Accuracy on Validation set: 98.31

 ===  Accuracies on test sets - CIL === 

[[99.95271868  0.          0.          0.          0.        ]
 [99.76359338 97.35553379  0.          0.          0.        ]
 [99.8108747  96.03330069 98.98612593  0.          0.        ]
 [99.8108747  95.59255632 85.05869797 98.89224572  0.        ]
 [99.76359338 95.29872674 85.48559232 91.8429003  97.73071104]]

 ===  Accuracies on test sets - TIL === 

[[99.95271868 50.53868756 52.18783351 49.09365559 50.88250126]
 [99.76359338 97.94319295 52.50800427 25.93152064 49.72264246]
 [99.8108747  97.30656219 99.41302028 61.68177241 46.44478064]
 [99.8108747  96.52301665 98.71931697 99.6978852  32.87947554]
 [99.76359338 96.57198825 

{'best_lr': 0.01, 'best_alpha': 0.03125, 'best_beta': 0.03125}

## Permuted MNIST with DER

In this case if we accept a 15% drop in accuracy, the model needs to be regularized more with $\alpha = 1.0$. The performances are comparable with the ones already presented in the showcase with hard coded hyperparameters but with a decent increment in accuracy. 

In [6]:
continual_hyperparameter_selection('PermutedMNIST', accuracy_drop=0.15)

Task 0 - Best LR: 0.01 - Best Accuracy on Validation set: 93.05
Task 1 - Best LR: 0.01 - Best Accuracy on Validation set: 94.60
Task 2 - Best LR: 0.03 - Best Accuracy on Validation set: 94.78
Task 3 - Best LR: 0.01 - Best Accuracy on Validation set: 95.67
Task 4 - Best LR: 0.01 - Best Accuracy on Validation set: 95.83
Task 5 - Best LR: 0.03 - Best Accuracy on Validation set: 96.17
Task 6 - Best LR: 0.01 - Best Accuracy on Validation set: 96.07
Task 7 - Best LR: 0.01 - Best Accuracy on Validation set: 95.95
Task 8 - Best LR: 0.03 - Best Accuracy on Validation set: 96.25
Task 9 - Best LR: 0.01 - Best Accuracy on Validation set: 95.90
Task 10 - Best LR: 0.01 - Best Accuracy on Validation set: 96.03
Task 11 - Best LR: 0.01 - Best Accuracy on Validation set: 95.70
Task 12 - Best LR: 0.01 - Best Accuracy on Validation set: 96.17
Task 13 - Best LR: 0.01 - Best Accuracy on Validation set: 96.47
Task 14 - Best LR: 0.01 - Best Accuracy on Validation set: 96.37
Task 15 - Best LR: 0.01 - Best Accu

{'best_lr': 0.01, 'best_alpha': 1.0, 'best_beta': None}

## Rotated MNIST with DER

Even in the case of this benchmark, the model need to be regularized more with $\alpha = 1.0$ if we want a drop of 10% maximum in the accuracy.

In [7]:
continual_hyperparameter_selection('RotatedMNIST', accuracy_drop=0.1)

Task 0 - Best LR: 0.03 - Best Accuracy on Validation set: 93.08
Task 1 - Best LR: 0.01 - Best Accuracy on Validation set: 95.07
Task 2 - Best LR: 0.01 - Best Accuracy on Validation set: 95.57
Task 3 - Best LR: 0.01 - Best Accuracy on Validation set: 95.63
Task 4 - Best LR: 0.01 - Best Accuracy on Validation set: 95.58
Task 5 - Best LR: 0.01 - Best Accuracy on Validation set: 96.55
Task 6 - Best LR: 0.03 - Best Accuracy on Validation set: 97.02
Task 7 - Best LR: 0.03 - Best Accuracy on Validation set: 97.22
Task 8 - Best LR: 0.01 - Best Accuracy on Validation set: 97.08
Task 9 - Best LR: 0.01 - Best Accuracy on Validation set: 96.65
Task 10 - Best LR: 0.01 - Best Accuracy on Validation set: 97.37
Task 11 - Best LR: 0.01 - Best Accuracy on Validation set: 96.63
Task 12 - Best LR: 0.01 - Best Accuracy on Validation set: 97.00
Task 13 - Best LR: 0.01 - Best Accuracy on Validation set: 97.23
Task 14 - Best LR: 0.01 - Best Accuracy on Validation set: 97.73
Task 15 - Best LR: 0.03 - Best Accu

{'best_lr': 0.03, 'best_alpha': 1.0, 'best_beta': None}

# Conclusion

We can see how continual hyperparameter selection is influenced by the choice of the accuracy drop margin:
- A higher allowed drop in accuracy preserves the stability of the model, meaning that we should get more backward transfer and less catastrophic forgetting.
- A lower allowed drop in accuracy will lead to a more plastic model, which will be able to learn more tasks, but with a higher risk of catastrophic forgetting.

We could even optimize different metrics, depending on the scenario we are dealing with. 