<a href="https://colab.research.google.com/github/SiAce/AJAX/blob/master/Hw2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color='red'>! Warning</font>

This notebook might crush the colab runtime after using all available RAM. If this happens, simply click <font color='#dddd00'><ins>Get more RAM</ins></font> in the pop-up message at the bottom left corner:
![alt text](https://i.imgur.com/uEhO0k4.png)
and then click <font color='#0099ff'>YES</ins></font> in the following message:
![alt text](https://i.imgur.com/7yPvpLd.png) to switch to a high-RAM runtime.

The estimated execution time of this notebook is 2 hours.

<h1><center><font size="5">AI Project 2 - Continual Learning (CL) for Robotic Perception</font></center></h1>

Zete Dai /zd790  
Min Yang /my2138  
Xiaorann Sun /xs954  

# 1. Overview

## Introduction and backgroud
One of the greatest goals of AI is building an artificial continual learning agent which can construct a sophisticated understanding of the external world from its own experience through the adaptive, goal-oriented and incremental development of ever more complex skills and knowledge.  

Continual Learning (CL) is built on the idea of learning continuously and adaptively about the external world and enabling the autonomous incremental development of ever more complex skills and knowledge.
In the context of Machine Learning it means being able to smoothly update the prediction model to take into account different tasks and data distributions but still being able to re-use and retain useful knowledge and skills during time.

Gradient Episodic Memory (GEM) is an effective model for continual learning, where each gradient update for the current task is formulated as a quadratic program problem with inequality constraints that alleviate catastrophic forgetting of previous tasks. 





## Dataset 
We choose MNIST due to consideration of the effectiveness.

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. The original creators of the database keep a list of some of the methods tested on it.

We used the dataset provided from [here](https://github.com/facebookresearch/GradientEpisodicMemory).



## Desgin
![Flow Chart](https://i.imgur.com/yPhgP9R.jpg)

# 2. Prepare before implementation

## Install and load packages

*   quadprog - Functions to solve quadratic programming problems




In [0]:
!pip install quadprog





*   numpy
*   quadprog
*   torch
*   torchvision - The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision.
*   PIL - Python Imaging Library. It is a free library for the Python programming language that adds support for opening, manipulating, and saving many different image file formats.




In [0]:
#import packages
from __future__ import print_function
import importlib
import datetime
import argparse
import random
import uuid
import time
import os
import numpy as np
import subprocess
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import quadprog
import math
from torch.nn.functional import relu, avg_pool2d
from torchvision import transforms
from PIL import Image

Creating namespace.

In [0]:
# class for holding arguments passed into files
class Namespace():
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)

## Data processing

Load data.



*   Load data from path `mnist_raw_path`.


*   If the path doesn't exist, then load from amazon S3 bucket.


*   Divide the dataset into two parts: train and test.


In [0]:
%%time
#get raw data
mnist_raw_path = "mnist.npz"
if not os.path.exists(mnist_raw_path):
  subprocess.call("wget https://s3.amazonaws.com/img-datasets/mnist.npz", shell=True)

f = np.load('mnist.npz')
x_tr = torch.from_numpy(f['x_train'])
y_tr = torch.from_numpy(f['y_train']).long()
x_te = torch.from_numpy(f['x_test'])
y_te = torch.from_numpy(f['y_test']).long()
f.close()

torch.save((x_tr, y_tr), 'mnist_train.pt')
torch.save((x_te, y_te), 'mnist_test.pt')


CPU times: user 284 ms, sys: 66.5 ms, total: 350 ms
Wall time: 348 ms


Define rotate function.



In [0]:
#rotate function
def rotate_dataset(dataset, rotation):
    tensor = transforms.ToTensor()
    new_dataset = torch.FloatTensor(dataset.size(0), 784)

    for i in range(dataset.size(0)):
        img = Image.fromarray(dataset[i].numpy(), mode='L')
        new_dataset[i] = tensor(img.rotate(rotation)).view(784)
    return new_dataset

Randomly rotate the dataset for data augmentation. 

Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks.

In [0]:
%%time
#add argument and produce new dataset

rotation_args = Namespace(**
    {
      'i': './',
      'o': 'mnist_rotations.pt',
      'n_tasks': 20,
      'min_rotation': 0.,
      'max_rotation': 180.,
      'random': 0
    }
)

torch.manual_seed(rotation_args.random)

x_tr, y_tr = torch.load(os.path.join(rotation_args.i, 'mnist_train.pt'))
x_te, y_te = torch.load(os.path.join(rotation_args.i, 'mnist_test.pt'))

new_train_dataset = []
new_test_dataset = []
for i in range(rotation_args.n_tasks):
  min_rotation = i * (rotation_args.max_rotation - rotation_args.min_rotation) / rotation_args.n_tasks + rotation_args.min_rotation
  max_rotation = (i + 1) * (rotation_args.max_rotation - rotation_args.min_rotation) / rotation_args.n_tasks + rotation_args.min_rotation
  rotation = random.random() * (max_rotation - min_rotation) + min_rotation
  new_train_dataset.append([rotation, rotate_dataset(x_tr, rotation), y_tr])
  new_test_dataset.append([rotation, rotate_dataset(x_te, rotation), y_te])

torch.save([new_train_dataset, new_test_dataset], rotation_args.o)

  

CPU times: user 3min 33s, sys: 6.75 s, total: 3min 40s
Wall time: 4min 48s


# 3. Implementation

## Continuum iterator

We can interpret GEM as a model that learns the subset of correlations common to a set of distributions (tasks). Furthermore, GEM can be used to predict target vectors associated to previous or new tasks without making use of task
descriptors.  

![Algorithm](https://i.imgur.com/cVx1OlW.png)

In [0]:
%%time

# /main.py

# continuum iterator #########################################################

def load_datasets(args):
    d_tr, d_te = torch.load(args.data_path + '/' + args.data_file)
    n_inputs = d_tr[0][1].size(1)
    n_outputs = 0
    for i in range(len(d_tr)):
        n_outputs = max(n_outputs, d_tr[i][2].max().item())
        n_outputs = max(n_outputs, d_te[i][2].max().item())
    return d_tr, d_te, n_inputs, n_outputs + 1, len(d_tr)


class Continuum:

    def __init__(self, data, args):
        self.data = data
        self.batch_size = args.batch_size
        n_tasks = len(data)
        task_permutation = range(n_tasks)

        if args.shuffle_tasks == 'yes':
            task_permutation = torch.randperm(n_tasks).tolist()

        sample_permutations = []

        for t in range(n_tasks):
            N = data[t][1].size(0)
            if args.samples_per_task <= 0:
                n = N
            else:
                n = min(args.samples_per_task, N)

            p = torch.randperm(N)[0:n]
            sample_permutations.append(p)

        self.permutation = []

        for t in range(n_tasks):
            task_t = task_permutation[t]
            for _ in range(args.n_epochs):
                task_p = [[task_t, i] for i in sample_permutations[task_t]]
                random.shuffle(task_p)
                self.permutation += task_p

        self.length = len(self.permutation)
        self.current = 0

    def __iter__(self):
        return self

    def next(self):
        return self.__next__()

    def __next__(self):
        if self.current >= self.length:
            raise StopIteration
        else:
            ti = self.permutation[self.current][0]
            j = []
            i = 0
            while (((self.current + i) < self.length) and
                   (self.permutation[self.current + i][0] == ti) and
                   (i < self.batch_size)):
                j.append(self.permutation[self.current + i][1])
                i += 1
            self.current += i
            j = torch.LongTensor(j)
            return self.data[ti][1][j], ti, self.data[ti][2][j]


CPU times: user 32 µs, sys: 0 ns, total: 32 µs
Wall time: 35.3 µs


## Data training

Define the training functions.

In [0]:
%%time

# train handle ###############################################################


def eval_tasks(model, tasks, args):
    model.eval()
    result = []
    for i, task in enumerate(tasks):
        t = i
        x = task[1]
        y = task[2]
        rt = 0
        
        eval_bs = x.size(0)

        for b_from in range(0, x.size(0), eval_bs):
            b_to = min(b_from + eval_bs, x.size(0) - 1)
            if b_from == b_to:
                xb = x[b_from].view(1, -1)
                yb = torch.LongTensor([y[b_to]]).view(1, -1)
            else:
                xb = x[b_from:b_to]
                yb = y[b_from:b_to]
            if args.cuda:
                xb = xb.cuda()
            _, pb = torch.max(model(xb, t).data.cpu(), 1, keepdim=False)
            rt += (pb == yb).float().sum()

        result.append(rt / x.size(0))

    return result


def life_experience(model, continuum, x_te, args):
    result_a = []
    result_t = []

    current_task = 0
    time_start = time.time()

    for (i, (x, t, y)) in enumerate(continuum):
        if(((i % args.log_every) == 0) or (t != current_task)):
            result_a.append(eval_tasks(model, x_te, args))
            result_t.append(current_task)
            current_task = t

        v_x = x.view(x.size(0), -1)
        v_y = y.long()

        if args.cuda:
            v_x = v_x.cuda()
            v_y = v_y.cuda()

        model.train()
        model.observe(v_x, t, v_y)

    result_a.append(eval_tasks(model, x_te, args))
    result_t.append(current_task)

    time_end = time.time()
    time_spent = time_end - time_start

    return torch.Tensor(result_t), torch.Tensor(result_a), time_spent



CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs


Set the arguments `model`,`n_hiddens`, `n_layers`, `n_memories`, `memory_strength`, `finetune`, `n_epochs`, `batch_size`, `lr`, `cuda`, `seed`, `log_every`, `save_path`, `data_path`, `data_file`, `samples_per_task`, `shuffle_tasks`.

Set epochs to 100 and using cuda to accelerate the process.  

In [0]:
%%time

main_args = Namespace(**
    {
        'model': 'gem',
        'n_hiddens': 100,
        'n_layers': 2,
        'n_memories': 256,
        'memory_strength': 0.5,
        'finetune': 'no',
        'n_epochs': 100,
        'batch_size': 10,
        'lr': 0.1,
        'cuda': 'yes',
        'seed': 0,
        'log_every': 100,
        'save_path': './',
        'data_path': './',
        'data_file': 'mnist_rotations.pt',
        'samples_per_task': 1000,
        'shuffle_tasks': 'no'
    }
)

main_args.cuda = True if main_args.cuda == 'yes' else False
main_args.finetune = True if main_args.finetune == 'yes' else False

# multimodal model has one extra layer
if main_args.model == 'multimodal':
    main_args.n_layers -= 1

# unique identifier
uid = uuid.uuid4().hex

# initialize seeds
torch.backends.cudnn.enabled = False
torch.manual_seed(main_args.seed)
np.random.seed(main_args.seed)
random.seed(main_args.seed)
if main_args.cuda:
    torch.cuda.manual_seed_all(main_args.seed)

# load data
x_tr, x_te, n_inputs, n_outputs, n_tasks = load_datasets(main_args)

# set up continuum
continuum = Continuum(x_tr, main_args)


CPU times: user 9.4 s, sys: 3.05 s, total: 12.4 s
Wall time: 12.4 s


## Architecture

On the MNIST task, we use fully-connected neural networks with two hidden layers of 100 ReLU units.  

In [0]:
%%time

# load model
# /model/gem.py
# /model/common.py

def Xavier(m):
    if m.__class__.__name__ == 'Linear':
        fan_in, fan_out = m.weight.data.size(1), m.weight.data.size(0)
        std = 1.0 * math.sqrt(2.0 / (fan_in + fan_out))
        a = math.sqrt(3.0) * std
        m.weight.data.uniform_(-a, a)
        m.bias.data.fill_(0.0)


class MLP(nn.Module):
    def __init__(self, sizes):
        super(MLP, self).__init__()
        layers = []

        for i in range(0, len(sizes) - 1):
            layers.append(nn.Linear(sizes[i], sizes[i + 1]))
            if i < (len(sizes) - 2):
                layers.append(nn.ReLU())

        self.net = nn.Sequential(*layers)
        self.net.apply(Xavier)

    def forward(self, x):
        return self.net(x)


def conv3x3(in_planes, out_planes, stride=1):
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                    padding=1, bias=False)


class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(in_planes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion * planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion * planes, kernel_size=1,
                        stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion * planes)
            )

    def forward(self, x):
        out = relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = relu(out)
        return out


class ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes, nf):
        super(ResNet, self).__init__()
        self.in_planes = nf

        self.conv1 = conv3x3(3, nf * 1)
        self.bn1 = nn.BatchNorm2d(nf * 1)
        self.layer1 = self._make_layer(block, nf * 1, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, nf * 2, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, nf * 4, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, nf * 8, num_blocks[3], stride=2)
        self.linear = nn.Linear(nf * 8 * block.expansion, num_classes)

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def forward(self, x):
        bsz = x.size(0)
        out = relu(self.bn1(self.conv1(x.view(bsz, 3, 32, 32))))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out


def ResNet18(nclasses, nf=20):
    return ResNet(BasicBlock, [2, 2, 2, 2], nclasses, nf)


# Auxiliary functions useful for GEM's inner optimization.

def compute_offsets(task, nc_per_task, is_cifar):
    """
        Compute offsets for cifar to determine which
        outputs to select for a given task.
    """
    if is_cifar:
        offset1 = task * nc_per_task
        offset2 = (task + 1) * nc_per_task
    else:
        offset1 = 0
        offset2 = nc_per_task
    return offset1, offset2


def store_grad(pp, grads, grad_dims, tid):
    """
        This stores parameter gradients of past tasks.
        pp: parameters
        grads: gradients
        grad_dims: list with number of parameters per layers
        tid: task id
    """
    # store the gradients
    grads[:, tid].fill_(0.0)
    cnt = 0
    for param in pp():
        if param.grad is not None:
            beg = 0 if cnt == 0 else sum(grad_dims[:cnt])
            en = sum(grad_dims[:cnt + 1])
            grads[beg: en, tid].copy_(param.grad.data.view(-1))
        cnt += 1


def overwrite_grad(pp, newgrad, grad_dims):
    """
        This is used to overwrite the gradients with a new gradient
        vector, whenever violations occur.
        pp: parameters
        newgrad: corrected gradient
        grad_dims: list storing number of parameters at each layer
    """
    cnt = 0
    for param in pp():
        if param.grad is not None:
            beg = 0 if cnt == 0 else sum(grad_dims[:cnt])
            en = sum(grad_dims[:cnt + 1])
            this_grad = newgrad[beg: en].contiguous().view(
                param.grad.data.size())
            param.grad.data.copy_(this_grad)
        cnt += 1


def project2cone2(gradient, memories, margin=0.5, eps=1e-3):
    """
        Solves the GEM dual QP described in the paper given a proposed
        gradient "gradient", and a memory of task gradients "memories".
        Overwrites "gradient" with the final projected update.

        input:  gradient, p-vector
        input:  memories, (t * p)-vector
        output: x, p-vector
    """
    memories_np = memories.cpu().t().double().numpy()
    gradient_np = gradient.cpu().contiguous().view(-1).double().numpy()
    t = memories_np.shape[0]
    P = np.dot(memories_np, memories_np.transpose())
    P = 0.5 * (P + P.transpose()) + np.eye(t) * eps
    q = np.dot(memories_np, gradient_np) * -1
    G = np.eye(t)
    h = np.zeros(t) + margin
    v = quadprog.solve_qp(P, q, G, h)[0]
    x = np.dot(v, memories_np) + gradient_np
    gradient.copy_(torch.Tensor(x).view(-1, 1))


class Net(nn.Module):
    def __init__(self,
                n_inputs,
                n_outputs,
                n_tasks,
                args):
        super(Net, self).__init__()
        nl, nh = args.n_layers, args.n_hiddens
        self.margin = args.memory_strength
        self.is_cifar = (args.data_file == 'cifar100.pt')
        if self.is_cifar:
            self.net = ResNet18(n_outputs)
        else:
            self.net = MLP([n_inputs] + [nh] * nl + [n_outputs])

        self.ce = nn.CrossEntropyLoss()
        self.n_outputs = n_outputs

        self.opt = optim.SGD(self.parameters(), args.lr)

        self.n_memories = args.n_memories
        self.gpu = args.cuda

        # allocate episodic memory
        self.memory_data = torch.FloatTensor(
            n_tasks, self.n_memories, n_inputs)
        self.memory_labs = torch.LongTensor(n_tasks, self.n_memories)
        if args.cuda:
            self.memory_data = self.memory_data.cuda()
            self.memory_labs = self.memory_labs.cuda()

        # allocate temporary synaptic memory
        self.grad_dims = []
        for param in self.parameters():
            self.grad_dims.append(param.data.numel())
        self.grads = torch.Tensor(sum(self.grad_dims), n_tasks)
        if args.cuda:
            self.grads = self.grads.cuda()

        # allocate counters
        self.observed_tasks = []
        self.old_task = -1
        self.mem_cnt = 0
        if self.is_cifar:
            self.nc_per_task = int(n_outputs / n_tasks)
        else:
            self.nc_per_task = n_outputs

    def forward(self, x, t):
        output = self.net(x)
        if self.is_cifar:
            # make sure we predict classes within the current task
            offset1 = int(t * self.nc_per_task)
            offset2 = int((t + 1) * self.nc_per_task)
            if offset1 > 0:
                output[:, :offset1].data.fill_(-10e10)
            if offset2 < self.n_outputs:
                output[:, offset2:self.n_outputs].data.fill_(-10e10)
        return output

    def observe(self, x, t, y):
        # update memory
        if t != self.old_task:
            self.observed_tasks.append(t)
            self.old_task = t

        # Update ring buffer storing examples from current task
        bsz = y.data.size(0)
        endcnt = min(self.mem_cnt + bsz, self.n_memories)
        effbsz = endcnt - self.mem_cnt
        self.memory_data[t, self.mem_cnt: endcnt].copy_(
            x.data[: effbsz])
        if bsz == 1:
            self.memory_labs[t, self.mem_cnt] = y.data[0]
        else:
            self.memory_labs[t, self.mem_cnt: endcnt].copy_(
                y.data[: effbsz])
        self.mem_cnt += effbsz
        if self.mem_cnt == self.n_memories:
            self.mem_cnt = 0

        # compute gradient on previous tasks
        if len(self.observed_tasks) > 1:
            for tt in range(len(self.observed_tasks) - 1):
                self.zero_grad()
                # fwd/bwd on the examples in the memory
                past_task = self.observed_tasks[tt]

                offset1, offset2 = compute_offsets(past_task, self.nc_per_task,
                                                self.is_cifar)
                ptloss = self.ce(
                    self.forward(
                        self.memory_data[past_task],
                        past_task)[:, offset1: offset2],
                    self.memory_labs[past_task] - offset1)
                ptloss.backward()
                store_grad(self.parameters, self.grads, self.grad_dims,
                        past_task)

        # now compute the grad on the current minibatch
        self.zero_grad()

        offset1, offset2 = compute_offsets(t, self.nc_per_task, self.is_cifar)
        loss = self.ce(self.forward(x, t)[:, offset1: offset2], y - offset1)
        loss.backward()

        # check if gradient violates constraints
        if len(self.observed_tasks) > 1:
            # copy gradient
            store_grad(self.parameters, self.grads, self.grad_dims, t)
            indx = torch.cuda.LongTensor(self.observed_tasks[:-1]) if self.gpu \
                else torch.LongTensor(self.observed_tasks[:-1])
            dotp = torch.mm(self.grads[:, t].unsqueeze(0),
                            self.grads.index_select(1, indx))
            if (dotp < 0).sum() != 0:
                project2cone2(self.grads[:, t].unsqueeze(1),
                            self.grads.index_select(1, indx), self.margin)
                # copy gradients back
                overwrite_grad(self.parameters, self.grads[:, t],
                            self.grad_dims)
        self.opt.step()

model = Net(n_inputs, n_outputs, n_tasks, main_args)

if main_args.cuda:
    model.cuda()

# run model on continuum
result_t, result_a, spent_time = life_experience(
    model, continuum, x_te, main_args)

# prepare saving path and file name
if not os.path.exists(main_args.save_path):
    os.makedirs(main_args.save_path)

fname = main_args.model + '_' + main_args.data_file + '_'
fname += datetime.datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
fname += '_' + uid
fname = os.path.join(main_args.save_path, fname)



CPU times: user 2h 57min 58s, sys: 4h 20min 51s, total: 7h 18min 50s
Wall time: 1h 52min 32s


## Evaluation metrics



*   confusion matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.




In [0]:
%%time

# /metrics/metrics.py
def task_changes(result_t):
    n_tasks = int(result_t.max() + 1)
    changes = []
    current = result_t[0]
    for i, t in enumerate(result_t):
        if t != current:
            changes.append(i)
            current = t

    return n_tasks, changes


def confusion_matrix(result_t, result_a, fname=None):
    nt, changes = task_changes(result_t)

    baseline = result_a[0]
    changes = torch.LongTensor(changes + [result_a.size(0)]) - 1
    result = result_a[changes]

    # acc[t] equals result[t,t]
    acc = result.diag()
    fin = result[nt - 1]
    # bwt[t] equals result[T,t] - acc[t]
    bwt = result[nt - 1] - acc

    # fwt[t] equals result[t-1,t] - baseline[t]
    fwt = torch.zeros(nt)
    for t in range(1, nt):
        fwt[t] = result[t - 1, t] - baseline[t]

    if fname is not None:
        f = open(fname, 'w')

        print(' '.join(['%.4f' % r for r in baseline]), file=f)
        print('|', file=f)
        for row in range(result.size(0)):
            print(' '.join(['%.4f' % r for r in result[row]]), file=f)
        print('', file=f)
        # print('Diagonal Accuracy: %.4f' % acc.mean(), file=f)
        print('Final Accuracy: %.4f' % fin.mean(), file=f)
        print('Backward: %.4f' % bwt.mean(), file=f)
        print('Forward:  %.4f' % fwt.mean(), file=f)
        f.close()

    stats = []
    # stats.append(acc.mean())
    stats.append(fin.mean())
    stats.append(bwt.mean())
    stats.append(fwt.mean())

    return stats
    
# save confusion matrix and print one line of stats
stats = confusion_matrix(result_t, result_a, fname + '.txt')
one_liner = str(vars(main_args)) + ' # '
one_liner += ' '.join(["%.3f" % stat for stat in stats])
print(fname + ': ' + one_liner + ' # ' + str(spent_time))

# save all results in binary file
torch.save((result_t, result_a, model.state_dict(),
            stats, one_liner, main_args), fname + '.pt')


./gem_mnist_rotations.pt_2020_04_05_23_23_15_acb9dfeee10847b28ddf4c8105e45896: {'model': 'gem', 'n_hiddens': 100, 'n_layers': 2, 'n_memories': 256, 'memory_strength': 0.5, 'finetune': False, 'n_epochs': 100, 'batch_size': 10, 'lr': 0.1, 'cuda': True, 'seed': 0, 'log_every': 100, 'save_path': './', 'data_path': './', 'data_file': 'mnist_rotations.pt', 'samples_per_task': 1000, 'shuffle_tasks': 'no'} # 0.884 -0.024 0.729 # 6744.08793759346
CPU times: user 24.2 ms, sys: 0 ns, total: 24.2 ms
Wall time: 36.3 ms


# 4. Visualization

Visualise the result using Matplotlib.



*   Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.




In [0]:
%%time

import matplotlib as mpl
from matplotlib import pyplot as plt
from glob import glob

#Visualization
mpl.use('Agg')

models = ['gem']
datasets = ['mnist_rotations']

names_datasets = {'mnist_rotations': 'MNIST rotations'}

names_models = {'gem': 'GEM'}

colors = {'gem': 'C0'}

barplot = {}

for dataset in datasets:
    barplot[dataset] = {}
    for model in models:
        barplot[dataset][model] = {}
        matches = glob(model + '*' + dataset + '*.pt')
        if len(matches):
            data = torch.load(matches[0], map_location=lambda storage, loc: storage)
            acc, bwt, fwt = data[3][:]
            barplot[dataset][model]['acc'] = acc
            barplot[dataset][model]['bwt'] = bwt
            barplot[dataset][model]['fwt'] = fwt

for dataset in datasets:
    x_lab = []
    y_acc = []
    y_bwt = []
    y_fwt = []

    for i, model in enumerate(models):
        if barplot[dataset][model] != {}:
            x_lab.append(model)
            y_acc.append(barplot[dataset][model]['acc'])
            y_bwt.append(barplot[dataset][model]['bwt'])
            y_fwt.append(barplot[dataset][model]['fwt'])

    x_ind = np.arange(len(y_acc))

    plt.figure(figsize=(7, 3))
    all_colors = []
    for xi, yi, li in zip(x_ind, y_acc, x_lab):
        plt.bar(xi, yi, label=names_models[li], color=colors[li])
        all_colors.append(colors[li])
    plt.bar(x_ind + (len(y_acc) + 1) * 1, y_bwt, color=all_colors)
    plt.bar(x_ind + (len(y_acc) + 1) * 2, y_fwt, color=all_colors)
    plt.xticks([0, 2, 4], ['ACC', 'BWT', 'FWT'], fontsize=16)
    plt.yticks(fontsize=16)
    plt.xlim(-1, len(y_acc) * 3 + 2)
    plt.ylabel('classification accuracy', fontsize=16)
    plt.title(names_datasets[dataset], fontsize=16)
    plt.legend(fontsize=12)
    plt.tight_layout()
    plt.savefig('barplot_%s.jpg' % dataset, bbox_inches='tight')

evoplot = {}

for dataset in datasets:
    evoplot[dataset] = {}
    for model in models:
        matches = glob(model + '*' + dataset + '*.pt')
        if len(matches):
            data = torch.load(matches[0], map_location=lambda storage, loc: storage)
            evoplot[dataset][model] = data[1][:, 0].numpy()

for dataset in datasets:

    plt.figure(figsize=(7, 3))
    for model in models:
        if model in evoplot[dataset]:
            x = np.arange(len(evoplot[dataset][model]))
            x = (x - x.min()) / (x.max() - x.min()) * 20
            plt.plot(x, evoplot[dataset][model], color=colors[model], lw=3)
            plt.xticks(range(0, 21, 2))

    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.xlabel('task number', fontsize=16)
    plt.title(names_datasets[dataset], fontsize=16)
    plt.tight_layout()
    plt.savefig('evoplot_%s.jpg' % dataset, bbox_inches='tight')
    

CPU times: user 517 ms, sys: 406 ms, total: 923 ms
Wall time: 435 ms


# 5. Result and Output

## Running log and statistics

0.1111 0.1215 0.1218 0.1232 0.0901 0.0901 0.0787 0.0837 0.0654 0.0720 0.0687 0.0643 0.0478 0.0423 0.0507 0.0556 0.0648 0.0718 0.0786 0.0767
|
0.9030 0.8121 0.6356 0.5782 0.3703 0.2999 0.2030 0.1716 0.1383 0.1245 0.1182 0.1094 0.1100 0.1287 0.1547 0.1750 0.1870 0.1924 0.1998 0.2128
0.9070 0.9199 0.8400 0.8047 0.5817 0.4926 0.3151 0.2370 0.1633 0.1516 0.1397 0.1177 0.1042 0.1159 0.1413 0.1766 0.1985 0.2191 0.2194 0.2340
0.8824 0.9323 0.9304 0.9163 0.8125 0.7323 0.5218 0.3946 0.2274 0.1977 0.1814 0.1479 0.1294 0.1274 0.1402 0.1633 0.1884 0.2218 0.2294 0.2423
0.8775 0.9368 0.9415 0.9356 0.8636 0.7932 0.5900 0.4568 0.2553 0.2081 0.1859 0.1406 0.1092 0.1068 0.1197 0.1394 0.1657 0.1957 0.2038 0.2142
0.8547 0.9217 0.9404 0.9391 0.9224 0.8886 0.7567 0.6308 0.3908 0.3036 0.2712 0.1860 0.1344 0.1230 0.1318 0.1430 0.1669 0.2026 0.2175 0.2358
0.8525 0.9164 0.9355 0.9383 0.9384 0.9260 0.8556 0.7694 0.5164 0.4230 0.3698 0.2336 0.1606 0.1419 0.1425 0.1594 0.1755 0.2087 0.2156 0.2294
0.8479 0.9049 0.9247 0.9277 0.9422 0.9428 0.9187 0.8849 0.6939 0.5997 0.5365 0.3428 0.2177 0.1679 0.1558 0.1639 0.1777 0.2057 0.2080 0.2178
0.8332 0.8965 0.9135 0.9216 0.9349 0.9378 0.9303 0.9132 0.8140 0.7516 0.6905 0.4884 0.3009 0.1996 0.1765 0.1668 0.1734 0.1987 0.2078 0.2122
0.8211 0.8874 0.9055 0.9106 0.9228 0.9258 0.9289 0.9245 0.9079 0.8689 0.8417 0.6878 0.4761 0.2933 0.2180 0.1814 0.1819 0.1983 0.1999 0.2025
0.8185 0.8871 0.9024 0.9077 0.9202 0.9220 0.9270 0.9272 0.9214 0.9066 0.8857 0.7883 0.5688 0.3789 0.2742 0.2083 0.1985 0.2065 0.2082 0.2080
0.8117 0.8882 0.9034 0.9046 0.9141 0.9169 0.9247 0.9260 0.9347 0.9282 0.9188 0.8512 0.6712 0.4666 0.3472 0.2466 0.2151 0.1994 0.1984 0.2086
0.8111 0.8850 0.8995 0.9024 0.9036 0.9056 0.9144 0.9163 0.9291 0.9312 0.9328 0.9128 0.7958 0.6207 0.4824 0.3376 0.2796 0.2337 0.2179 0.2087
0.7986 0.8734 0.8924 0.8983 0.9040 0.9047 0.9098 0.9090 0.9156 0.9217 0.9286 0.9225 0.8995 0.8128 0.7180 0.5489 0.4560 0.3543 0.2907 0.2513
0.7810 0.8690 0.8900 0.8942 0.8997 0.9002 0.9015 0.8971 0.9032 0.9062 0.9149 0.9172 0.9174 0.8922 0.8383 0.7181 0.6026 0.4502 0.3477 0.2721
0.7725 0.8674 0.8921 0.8965 0.9025 0.9001 0.8978 0.8919 0.8934 0.8990 0.9066 0.9162 0.9251 0.9195 0.8984 0.8101 0.7040 0.5298 0.3833 0.2645
0.7723 0.8668 0.8913 0.8948 0.8975 0.8996 0.8947 0.8868 0.8887 0.8956 0.9007 0.9053 0.9176 0.9220 0.9169 0.8949 0.8513 0.7330 0.5896 0.4232
0.7819 0.8705 0.8912 0.8940 0.8963 0.8960 0.8894 0.8835 0.8879 0.8912 0.8958 0.8992 0.9123 0.9238 0.9221 0.9202 0.9008 0.8183 0.6882 0.5044
0.7694 0.8635 0.8884 0.8916 0.8949 0.8939 0.8895 0.8812 0.8833 0.8837 0.8896 0.8933 0.9010 0.9111 0.9160 0.9280 0.9194 0.9002 0.8256 0.6756
0.7666 0.8600 0.8842 0.8882 0.8954 0.8908 0.8855 0.8754 0.8757 0.8803 0.8817 0.8842 0.8922 0.9041 0.9065 0.9193 0.9220 0.9209 0.8977 0.8195
0.7617 0.8611 0.8844 0.8917 0.8986 0.8948 0.8893 0.8809 0.8749 0.8781 0.8804 0.8830 0.8899 0.8987 0.8967 0.9112 0.9167 0.9229 0.9189 0.8912

Final Accuracy: 0.8863
Backward: -0.0233
Forward:  0.7292


CPU times: user 1h 30min 15s, sys: 1h 3min 5s, total: 2h 33min 20s
Wall time: 1h 21min 50s

## Output

The results are two output jpg files

1. evoplot_mnist_rotations.jpg  

![evoplot_mnist_rotations.jpg](https://i.imgur.com/85kRLdi.jpg)

2. barplot_mnist_rotations.jpg

![barplot_mnist_rotations.jpg](https://i.imgur.com/VVeEabZ.jpg)


# 6. References

[1] https://en.wikipedia.org/wiki/MNIST_database  
[2] http://yann.lecun.com/exdb/mnist/  
[3] https://openreview.net/forum?id=H1g79ySYvB  
[4] https://medium.com/continual-ai/why-continuous-learning-is-the-key-towards-  
[5] https://www.geeksforgeeks.org/confusion-matrix-machine-learning/  
[6] https://github.com/facebookresearch/GradientEpisodicMemory  
[7] [Gradient Episodic Memory for Continual Learning](https://arxiv.org/pdf/1706.08840.pdf)  
[8] [Continual Lifelong Learning with Neural Networks:
A Review](https://arxiv.org/pdf/1802.07569.pdf)  