# VisCOLL: training and evaluation

## Training / Inference

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

### Training

Configs can be specified in files such as files under configs/mlmcaptioning/. or directly specified in the command line. Some configurations that should be specified are
- Model achitecture: cfg.MLMCAPTION.BASE=vlbert/lxmert
- Output dir (cfg.OUTPUT_DIR)
- Replay memory size (cfg.EXTERNAL.REPLAY.MEM_LIMIT)
- OCL algorithm (cfg.EXTERNAL.OCL.ALGO=naive|ER|AGEM)
    - To run MIR, specify ALGO=ER and let EXTERNAL.OCL.MIR=1. Also, you should specify the hyperparameter EXTERNAL.OCL.MIR_K. Finally, EXTERNAL.OCL.MIR_AGG decides whether use original MIR or the variant MIR-MAX.

For example, to train a VLBERT model with a memory of 10,000 examples on coco using ER continual learning algorithm, run:

In [None]:
!python train_mlm.py --name debug --config configs/mlmcaptioning/er.yaml --seed 0 --cfg MLMCAPTION.BASE=vlbert OUTPUT_DIR=runs/

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
  exp_config = edict(yaml.load(f))
epoch
loading from cache
** Buffer details **
* length: 638903
* task: continuous
loading from cache
** Buffer details **
* length: 28720
* task: continuous
  0%|                                                 | 0/19966 [00:00<?, ?it/s]continuous 19966
  5%|█▊                                   | 989/19966 [09:41<4:47:33,  1.10it/s]

To train a LXMERT model with a memory of 10,000 examples on Flickr using AGEM, run:

In [None]:
!python train_mlm.py --name debug_flickr --config configs/mlmcaptioning/agem_flickr.yaml --seed 0 --cfg MLMCAPTION.BASE=lxmert OUTPUT_DIR=runs/

To train a VLBERT model with a memory of 10,000 examples on COCO, using MIR and MIR_K=64

In [None]:
!python train_mlm.py --name debug_mir --config configs/mlmcaptioning/er_mir.yaml --seed 0 --cfg MLMCAPTION.BASE=vlbert OUTPUT_DIR=runs/
#equivalent to !python train_mlm.py --name debug_mir --config configs/mlmcaptioning/er.yaml --seed 0 --cfg MLMCAPTION.BASE=vlbert OUTPUT_DIR=runs/ EXTERNAL.OCL.MIR=1 EXTERNAL.OCL.MIR_AGG=max EXTERNAL.OCL.MIR_K=64

The model will ends training and saves itself to "<output_dir\>/model00.pth" after the first pass of the data (the online continual learning setup). It will also outputs model checkpoints every 2000 iterations in "<output_dir>/results/model_0_<iter\>.pth"

### Testing

To evaluate a model, you may run test.py. In the command line, You can specify epoch and iter. For COCO, you can also specify the "--novel_comps" flag to evaluate on 24 heldout concept pairs

In [None]:
!python test.py --name debug --config configs/mlmcaptioning/er.yaml --seed 0 --epoch 00 --cfg MLMCAPTION.BASE=vlbert OUTPUT_DIR=runs

In [None]:
!python test.py --name debug --config configs/mlmcaptioning/er.yaml --seed 0 --epoch 00 --novel_comps --cfg MLMCAPTION.BASE=vlbert OUTPUT_DIR=runs

Similarly, you can run evaluation on Flickr-shift. For example:

In [None]:
!python test.py --name flickr-lxmert-er-mem10k-lr0.0001 --config configs/mlmcaptioning/er_flickr.yaml --seed 1 --epoch 00 --cfg MLMCAPTION.BASE=lxmert OUTPUT_DIR=runs-bak

To compute metrics such as forgetting, running evaluation for every intermediate checkpoint file is required. You can use the shell script in scripts/ folder to perform batch processing for all seeds and all checkpoints.

In [None]:
!python ./scripts/mlm_eval.sh
# or !python ./scripts/mlm_eval_flickr.sh

## Evaluting continual learning

After running inferences, you will find a file named "results_verbose_model_<epoch\>_<iter\>.json" in the output directory. The file contains raw predictions and scores to compute metrics.

In [None]:
# Final BLEU and PPL score
from metrics.final_scores import *

In [None]:
# Final BLEU 1 to BLEU 4
evaluate_bleu_score_from_records('runs-bak/flickr-vlbert-er-mem10k-lr0.0001_1/results_verbose_model_00.json')

In [None]:
# Final log PPL
evaluate_ppl_from_records('runs-bak/flickr-vlbert-er-mem10k-lr0.0001_1/results_verbose_model_00.json')

You may also run the forgetting metrics reported in the paper. This requires knowing when the task is visited in the training data stream. 

In [None]:
from data.coco import COCO
from yacs.config import CfgNode

cfg = CfgNode(new_allowed=True)
cfg.merge_from_file('configs/mlmcaptioning/naive.yaml') # placeholder config file
coco_dataset = COCO(cfg=cfg, split='train')

In [None]:
coco_tasks = coco_dataset.all_tasks
coco_dataset.set_task('continuous')
coco_buffer = coco_dataset.current_task_buffer

In [None]:
from metrics.forgetting import *
task_ends = stat_task_end(coco_tasks, coco_buffer)

In [None]:
stat_forgetting_single(task_ends, coco_tasks, output_dir='runs-bak/mscoco-vlbert-er-mem10k-lr0.0001_1', dataset='coco')

## Evaluating compositional generalization

In [None]:
import importlib
import metrics.comp_gen
importlib.reload(metrics.comp_gen)
from metrics.comp_gen import *

First tokenize sentences in the coco dataset.

In [None]:
coco_buffer = tokenize_buffer(coco_buffer, coco_dataset)

24 novel pairs are defined in metrics/comp_gen.py. We construct corresponding "seen pairs" with the same set of atoms.

In [None]:
possible_pairs = get_possible_seen_pairs_from_novel_pairs(novel_pairs, coco_buffer)

Computes performance on novel pairs on the compositional test split and seens pairs on the regular test split.

In [None]:
_, seen_pairs_ppl, _, _, _, _ = compute_novel_seen_performance_verbose('runs-bak/mscoco-vlbert-er-mem10k-lr0.0001_1', 'results_verbose_model_00.json', novel_pairs, possible_pairs)

In [None]:
novel_pairs_ppl, _, _, _, _, _ = compute_novel_seen_performance_verbose('runs-bak/mscoco-vlbert-er-mem10k-lr0.0001_1', 'results_verbose_model_00_novel_comps.json', novel_pairs, possible_pairs)

In [None]:
seen_pairs_ppl, novel_pairs_ppl