# 项目8-终身学习

## 友情提示
同学们可以前往课程作业区先行动手尝试 ！！！

## 项目描述

* 使用EWC - Elastic Weight Consolidation和MAS - Memory Aware Synapse两种regularization based lifelong learning 的方法在Paddle的两个不同的数据集MNIST和Cifar10上进行学习训练
* 根据给出的两种lifelong learning regularization算法自己做一个SCP算法

## 数据集介绍

本次使用的数据集为MNIST和Cifar10数据集，共有10类

MNIST数据集：
图像都是28x28大小的灰度图像，每个像素的是一个八位字节（0~255）
识别数字0～9（0，1，2，3，4，5，6，7，8，9）
Training set: 60000
Testing set: 10000张

Cifar10数据集：
图像都是32x32大小的彩色图像，每个像素🈶由一个3x八位字节（0~255）组成
识别'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'
Training set: 50000张
Testing set: 10000张

**数据格式**

Paddle中自带数据集，在代码中下载数据包

## 项目要求

* 请以中文说明一下 lifelong learning 的中心概念是什么？
* 列出EWC,MAS的作法是什么？根据你的理解，说明一下大概的流程该怎么做
* EWC和MAS方法上所需要的资料最大的差异是什么
* 秀出part1及part2最后结果的图，并分析一下结果，以及你跑的实验中有什么发现

## 数据准备
        
无

## 环境配置/安装

无

## 终身机器学习

### 方法
在2019年底，有人提出了一个大汇整将lifelong learning 的方法，从2016- 2019 年初 的模型做了归类，大致上可以分成三种大方法
* Replay-based methods
* Regularization-based methods
* Parameter isolation methods

![](https://ai-studio-static-online.cdn.bcebos.com/dc2465afd2ee4e51a35d93b8129dba025afa68f7fbe44993a11bc73891f45214)


在这次的作业之中，我们要走过一次regularization-based methods 里面的 prior-focused的两种方法 分别是 EWC 和 MAS 这两种方法

图片出处 [Continual Learning in Neural
Networks](https://arxiv.org/pdf/1910.02718.pdf)





In [1]:
%%capture
# 进入官网适合自己的paddlepaddle版本，并运行相应的安装命令安装paddle paddle 2.0 rc版本
# !python -m pip install paddlepaddle-gpu==2.0.0rc0.post90 -f https://paddlepaddle.org.cn/whl/stable.html

# 导入库

In [7]:
%%capture
import paddle
import paddle.nn as nn
import paddle.optimizer as optim
import paddle.nn.functional as F
import paddle.fluid.data as data
import paddle.fluid.core as core
import paddle.fluid.dataloader.sampler as sampler
from paddle import vision
from paddle.io import DataLoader
from paddle.vision import datasets, transforms

import numpy as np
import os
import random
from copy import deepcopy
import json

# 如果支持gpu则使用gpu训练否则使用cpu训练
support_gpu = paddle.is_compiled_with_cuda()
place = paddle.CPUPlace()
if support_gpu:
    place = paddle.CUDAPlace(0)
paddle.disable_static(place)
print(paddle.__version__)
print("place:{} ".format(place))

device = paddle.set_device("gpu" if core.is_compiled_with_cuda() else "cpu")

# 模型

 >因为本次作业强调的是lifelong learning 的训练方法，并非叠模型，所以今天我们所举的例子，都会使用同一个模型来做训练只是应用上不同lifelong learning的训练方法， 在这次的作业的例子内 我们使用的是一个六层的fully-connected layer的模型加上relu的activation function.

## 基准模型

In [8]:
class Model(nn.Layer):

  def __init__(self):
    super(Model, self).__init__()
    self.fc1 = nn.Linear(3*32*32, 1024)
    self.fc2 = nn.Linear(1024, 512)
    self.fc3 = nn.Linear(512, 256)
    self.fc4 = nn.Linear(256, 128)
    self.fc5 = nn.Linear(128, 128)
    self.fc6 = nn.Linear(128, 10)
    self.relu = nn.ReLU()

  def forward(self, x):
    x = x.reshape((-1, 3*32*32))
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    x = self.relu(x)
    x = self.fc3(x)
    x = self.relu(x)
    x = self.fc4(x)
    x = self.relu(x)
    x = self.fc5(x)
    x = self.relu(x)
    x = self.fc6(x)
    return x

以下我们將依序介紹这两种方法 EWC 跟 MAS 

## EWC

### Elastic Weight Consolidation

#### 概念
老师在影片中已经把核心概念介绍给大家，那在这边我想大家都非常了解了这个方法的概念，我们就直接进入主题

今天我们的任务 是在学习连续的两个 task task A 跟 task B:

在 EWC 作法下 他的 loss function 会被定义如下
 $$\mathcal{L}_B = \mathcal{L}(\theta) + \sum_{i} \frac{\lambda}{2} F_i (\theta_{i} - \theta_{A,i}^{*})^2  $$

先解释这个 loss function 裡的变数，$\mathcal{L}_B$ 是指 task B 的 loss, 会等于 正常的loss function $\mathcal{L}(\theta)$ (如果 是 classification 的问题,就是 cross entropy 的 loss function) 加上一个正则项 (regularization term) 

这个正则项的由两个部份组成，第一个是 $F_i$ 也是这个方法的核心, 第二个部份是 $(\theta_{i} - \theta_{A,i}^{*})^2$  ,  $\theta_{A,i}^{*}$ 代表的是 训练完task A 存下来模型第 i 个参数的值, $\theta_i$ 代表的是目前模型第i个参数的值，注意一点是模型的架构在这种 regularization based 的方法上，都是固定ㄉ，目前模型跟 task A 存下来的模型 架构都一样只是值不一样。底下我将说明这个 $F_i$ 是怎么实做出来

在老师的影片中，老师是以只有两个参数的模型举例子，那假设我今天模型就是一个 neural network(参数不只两个) 该怎么办呢？   

$F_i$ 对应到老师的影片叙述是指第i个参数的守卫，假设这个参数对 task A 很重要，那这个 $F_i$ 的值就会很大，这个参数尽量不能被更动...

实际上这个参数的算法 即是 如下的式子

$$ F = [ \nabla \log(p(y_n | x_n, \theta_{A}^{*}) \nabla \log(p(y_n | x_n, \theta_{A}^{*})^T ] $$ 

$F$ 之中 只以对角线的值去近似各个参数的 $F_i$ 值

$p(y_n | x_n, \theta_{A}^{*})$ 指的就是模型在给定之前 task 的 data $x_n$ 以及 给定 训练完 task A 存下来的模型参数 $\theta_A^*$ 得到 $y_n$($x_n$ 对应的 label ) 的 posterior probability.
那统整一下作法就是 再对这个 $p(y_n | x_n, \theta_{A}^{*})$ 取 log 再取 gradient 并且平方 ( parameter.grad )^2.

每一个参数我都可以使用 paddle 的 backward 之后再取 gradient 的性质算出各自的 $F_i$.

有关这个 $F$ 其实博大精深，是来自于 fisher information matrix. 底下我放上有关这个lifelong learning 在 fisher information matrix 上是怎么简单的近似到这一项，简单的推导来自 [Continual Learning in Neural
Networks](https://arxiv.org/pdf/1910.02718.pdf) 第2.4.1 小节 与 2.4 节

For You Information: [Elastic Weight Consolidation](https://arxiv.org/pdf/1612.00796.pdf)






In [25]:
class EWC(object):
  """
    @article{kirkpatrick2017overcoming,
        title={Overcoming catastrophic forgetting in neural networks},
        author={Kirkpatrick, James and Pascanu, Razvan and Rabinowitz, Neil and Veness, Joel and Desjardins, Guillaume and Rusu, Andrei A and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and others},
        journal={Proceedings of the national academy of sciences},
        year={2017},
        url={https://arxiv.org/abs/1612.00796}
    }
  """

  def __init__(self, model: nn.Layer, dataloaders: list, device):

      self.model = model
      self.dataloaders = dataloaders
      self.device = device

      self.params = {n: p for n, p in self.model.named_parameters() if not p.stop_gradient}  # 抓出模型的所有参数
      self._means = {}  # 初始化 平均参数
      self._precision_matrices = self._calculate_importance()  # 产生 EWC 的 Fisher (F) 矩阵

      for n, p in self.params.items():
          self._means[n] = p.clone().detach()  # 算出每个参数的平均 （用之前任务的资料去算平均）

  def _calculate_importance(self):
      print('Computing EWC')

      precision_matrices = {}
      for n, p in self.params.items():  # 初始化 Fisher (F) 的矩阵（都补零）
          t_val = p.clone().detach()
          print(type(t_val))
          t_val.set_value(np.zeros(shape=t_val.numpy().shape, dtype=np.float32))
          print(t_val)
          precision_matrices[n] = t_val.numpy()

      self.model.eval()
      dataloader_num = len(self.dataloaders)
      number_data = sum([len(loader) for loader in self.dataloaders])
      for dataloader in self.dataloaders:
          for data in dataloader:
              self.model.clear_gradients()
              input = data[0]
              output = self.model(input)
              label = np.argmax(output.numpy(), axis=-1)
              label = paddle.to_tensor(label)

              ############################################################################
              #####                      产生 EWC 的 Fisher(F) 矩阵                    #####
              ############################################################################
              loss = F.nll_loss(F.log_softmax(output, axis=1), label)
              loss.backward()
              for n, p in self.model.named_parameters():
                  precision_matrices[n] += np.power(p.grad, 2) / number_data

      precision_matrices = {n: p for n, p in precision_matrices.items()}
      return precision_matrices

  def penalty(self, model: nn.Layer):
      loss = 0
      for n, p in model.named_parameters():
          _loss = self._precision_matrices[n] * (p - self._means[n]).numpy() ** 2
          loss += _loss.sum()
      return loss

## MAS

### Memory Aware Synapses
概念:
老师的影片中，将它归类到和 EWC 一样的方法，只是算这个 important weight 的方式不太一样.底下我将说明这个方法该怎么实做

MAS:
在 MAS 内，学习一个连续的 tasks, task A, 和 task B, 他的 loss function 定义如下:

$$\mathcal{L}_B = \mathcal{L}(\theta) + \sum_{i} \frac{\lambda}{2} \Omega_i (\theta_{i} - \theta_{A,i}^{*})^2$$

和 ewc不同的是 式子中的 $F_i$ 被取代成 $\Omega_i$ , $\Omega_i$ 来自于以下的式子：

$$\Omega_i = || \frac{\partial \ell_2^2(M(x_k; \theta))}{\partial \theta_i} || $$ 

$x_k$ 是 来自于 前面 task 的 sample data。 式子上的作法就是对最后模型的 output vector (最后一层)做 l2 norm 后取平方 再对各自的weight微分(取gradient) 并且取 该 gradient 的绝对值，在该paper 中其实也可以对各个层的 output vector 做 l2 norm ( local 版本)，这边只实做 global 的版本。


For Your Information: 
[Memory Aware Synapses](https://arxiv.org/pdf/1711.09601.pdf)
 






In [9]:
import paddle
import paddle.nn as nn
import numpy as np


class MAS(object):
    """
    @article{aljundi2017memory,
      title={Memory Aware Synapses: Learning what (not) to forget},
      author={Aljundi, Rahaf and Babiloni, Francesca and Elhoseiny, Mohamed and Rohrbach, Marcus and Tuytelaars, Tinne},
      booktitle={ECCV},
      year={2018},
      url={https://eccv2018.org/openaccess/content_ECCV_2018/papers/Rahaf_Aljundi_Memory_Aware_Synapses_ECCV_2018_paper.pdf}
    }
    """

    def __init__(self, model: nn.Layer, dataloaders: list, device):
        self.model = model
        self.dataloaders = dataloaders
        self.params = {n: p for n, p in self.model.named_parameters() if not p.stop_gradient}  # 抓出模型的所有参数
        self._means = {}  # 初始化 平均参数
        self.device = device
        self._precision_matrices = self.calculate_importance()  # 产生 MAS 的 Omega(Ω) 矩阵

        for n, p in self.params.items():
            self._means[n] = p.clone().detach()

    def calculate_importance(self):
        print('Computing MAS')

        precision_matrices = {}
        for n, p in self.params.items():
            # 初始化 Omega(Ω) 矩阵（都补0）
            t_val = p.clone().detach()
            t_val.set_value(np.zeros(shape=t_val.numpy().shape, dtype=np.float32))
            precision_matrices[n] = t_val.numpy()

        self.model.eval()
        dataloader_num = len(self.dataloaders)
        num_data = sum([len(loader) for loader in self.dataloaders])
        for dataloader in self.dataloaders:
            for data in dataloader:
                self.model.clear_gradients()
                # output = self.model(data[0].to(self.device))
                output = self.model(data[0])

                #######################################################################################
                #####  产生 MAS 的 Omega(Ω) 矩阵 ( 对 output 向量 算他的 l2 norm 的平方) 再取 gradient  ####
                #######################################################################################
                output = paddle.pow(output, 2)
                #output.pow_(2)
                loss = paddle.sum(output, axis=1)
                loss = loss.mean()
                loss.backward()

                for n, p in self.model.named_parameters():
                    precision_matrices[n] += abs(p.grad) / num_data

        precision_matrices = {n: p for n, p in precision_matrices.items()}
        return precision_matrices

    def penalty(self, model: nn.Layer):
        loss = 0
        for n, p in model.named_parameters():
            _loss = self._precision_matrices[n] * (p - self._means[n]).numpy() ** 2
            loss += _loss.sum()
        return loss


# 资料

## 资料预处理
- 转换 MNIST  ($1*28*28$) 到 ($3*32*32$)
- 转换 USPS   ($1*16*16$) 到 ($3*32*32$)
- 正规化 图片

In [10]:
import paddle
import numpy as np
from paddle.vision import transforms
import paddle.nn.functional as F
from paddle.vision.transforms import functional


class ResizeImg(object):

    def __init__(self, size):
        self.size = size

    def __call__(self, img):
        img_size_0 = img.shape[0]
        img_size_1 = img.shape[1]
        if (self.size - img_size_0) % 2 == 1:
            img_size_0 += 1
        if (self.size - img_size_1) % 2 == 1:
            img_size_1 += 1
        img = functional.resize(img, (img_size_0, img_size_1))
        idx_tuple = np.where(img == 0)
        if idx_tuple:
            for i in range(len(idx_tuple[0])):
                img[idx_tuple[0][i]][idx_tuple[1][i]][idx_tuple[2][i]] = 1
        print("shape of img:{} ".format(img.shape))
        return img


class Convert2RGB(object):

    def __init__(self, num_channel):
        self.num_channel = num_channel

    def __call__(self, img):
        # If the channel of img is not equal to desired size,
        # then expand the channel of img to desired size.
        img_channel = img.shape[0]
        img = paddle.concat([img] * (self.num_channel - img_channel + 1), 0)
        return img


class Pad(object):

    def __init__(self, size, fill=0, padding_mode='constant'):
        self.size = size
        self.fill = fill
        self.padding_mode = padding_mode

    def __call__(self, img):
        # If the H and W of img is not equal to desired size,
        # then pad the channel of img to desired size.'
        img = img.reshape((img.shape[1], img.shape[0], img.shape[2]))
        img_size_h = img.shape[1]
        img_size_w = img.shape[2]
        left_pad = (self.size - img_size_w) // 2
        right_pad = self.size - left_pad - img_size_w
        up_pad = (self.size - img_size_h) // 2
        low_pad = self.size - up_pad - img_size_h
        img = img.reshape((1, img.shape[0], img.shape[1], img.shape[2]))
        pad_rs = F.pad(img, [left_pad, right_pad, up_pad, low_pad], value=1, mode='constant', data_format="NCHW")
        pad_rs = pad_rs.reshape((pad_rs.shape[-3], pad_rs.shape[-2], pad_rs.shape[-1]))
        return pad_rs


def get_transform():
    transform = transforms.Compose([  # ResizeImg(32),
        transforms.ToTensor(),
        Pad(32),
        Convert2RGB(3),
        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])])
    return transform


## 准备 资料集
- MNIST   : 一张图片资料大小:  $28*28*1$, 灰阶 , 10 个种类
- SVHN    : 一张图片资料大小:  $32*32*3$, RGB , 10 个种类
- USPS    : 一张图片资料大小:  $16*16*1$, 灰阶 , 10 个种类


In [11]:
import os
from paddle.dataset import common
from paddle.vision import datasets


class Data():

    def __init__(self):
        transform = get_transform()

        print('download training data and load training data of mnist dataset ! ')
        self.MNIST_train_dataset = datasets.MNIST(mode='train',
                                                  transform=transform)
        # self.MNIST_test_dataset = datasets.MNIST(mode='test',
        #                                          transform=transform)
        print('load mnist dataset finished ! ')

        # 如果需要三个数据集，可以使用Flowers数据集，但是需要修改网络中的类别数目
        #print('download training data and load training data of flowers dataset ! ')
        #self.Flowers_train_dataset = datasets.Flowers(mode='train',
        #                                             transform=transform)
        # self.Flowers_test_dataset = datasets.MNIST(mode='test',
        #                                            transform=transform)
        #print('load flowers dataset finished ! ')

        print('download training data and load training data of cifar10 dataset ! ')
        self.Cifar10_train_dataset = datasets.Cifar10(mode='train',
                                                      transform=transform)
        # self.Cifar10_test_dataset = datasets.Cifar10(mode='test',
        #                                              transform=transform)
        print('load cifar10 dataset finished ! ')

        self.path = common.DATA_HOME

    def get_datasets(self):
        #a = [(self.MNIST_train_dataset, "MNIST"), (self.Cifar10_train_dataset, "Cifar10"),
        #    (self.Cifar10_train_dataset, "Cifar10")]
        a = [(self.MNIST_train_dataset, "MNIST"), (self.Cifar10_train_dataset, "Cifar10")]
        return a



## 建立 Dataloader
- *.train_loader: 拿取训练集并训练 \\
- *.val_loader: 拿取验证集并验测结果 \\

In [12]:
import numpy as np
from paddle.io import DataLoader, BatchSampler
import paddle.fluid.dataloader.sampler as sampler


class Dataloader():

    def __init__(self, dataset, batch_size, split_ratio=0.1):
        self.dataset = dataset[0]
        self.name = dataset[1]
        train_sampler, val_sampler = self.split_dataset(split_ratio)

        self.train_dataset_size = len(train_sampler)
        self.val_dataset_size = len(val_sampler)

        bs_train = BatchSampler(sampler=train_sampler,
                                shuffle=False,
                                batch_size=batch_size,
                                drop_last=True)
        bs_val = BatchSampler(sampler=val_sampler,
                              shuffle=False,
                              batch_size=batch_size,
                              drop_last=True)
        self.train_loader = DataLoader(self.dataset, batch_sampler=bs_train)
        self.val_loader = DataLoader(self.dataset, batch_sampler=bs_val)
        # print("number of labels: {}".format(len(set(self.dataset.labels))))
        self.train_iter = self.infinite_iter()

    def split_dataset(self, split_ratio):
        data_size = len(self.dataset)
        split = int(data_size * split_ratio)
        indices = list(range(data_size))
        np.random.shuffle(indices)
        train_idx, valid_idx = indices[split:], indices[:split]

        train_sampler = sampler.RandomSampler(train_idx)
        val_sampler = sampler.RandomSampler(valid_idx)
        return train_sampler, val_sampler

    def infinite_iter(self):
        it = iter(self.train_loader)
        while True:
            try:
                ret = next(it)
                yield ret
            except StopIteration:
                it = iter(self.train_loader)


# 小工具

## 储存模型

In [13]:
def save_model(model, optimizer, store_model_path):
    # save model and optimizer
    paddle.save(model.state_dict(), f'{store_model_path}.pdparams')
    paddle.save(optimizer.state_dict(), f'{store_model_path}.pdopt')
    return

##载入模型

## 载入模型

In [14]:
def load_model(model, optimizer, load_model_path):
    # load model and optimizer
    print(f'Load model from {load_model_path}')
    model_state_dict = paddle.load(f'{load_model_path}.pdparams')
    opt_state_dict = paddle.load(f'{load_model_path}.pdopt')
    model.set_state_dict(model_state_dict)
    optimizer.set_state_dict(opt_state_dict)
    return model, optimizer

## 建立模型 & 优化器

In [15]:
def build_model(data_path, batch_size, learning_rate):
  # create model
    model = Model()
    optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=learning_rate)
    data = Data()
    datasets = data.get_datasets()
    tasks = []
    for dataset in datasets:
        tasks.append(Dataloader(dataset, batch_size))

    return model, optimizer, tasks

# 训练

## 正常训练 ( baseline )

In [16]:
def normal_train(model, optimizer, task, total_epochs, summary_epochs):
    model.train()
    model.clear_gradients()
    ceriation = nn.CrossEntropyLoss()
    losses = []
    loss = 0.0
    for epoch in range(summary_epochs):
        imgs, labels = next(task.train_iter)
        outputs = model(imgs)
        ce_loss = ceriation(outputs, labels)

        optimizer.clear_gradients()
        ce_loss.backward()
        optimizer.step()

        loss += ce_loss.numpy()[0]
        print("loss:{} of epoch:{} ".format(ce_loss.numpy()[0], epoch))
        if (epoch + 1) % 50 == 0:
            loss = loss / 50
            print("\r", "train task {} [{}] loss: {:.3f}      \n".format(task.name, (total_epochs + epoch + 1), loss),
                  end=" ")
            losses.append(loss)
            loss = 0.0

    return model, optimizer, losses

## EWC 训练

In [18]:
def ewc_train(model, optimizer, task, total_epochs, summary_epochs, ewc, lambda_ewc):
    model.train()
    model.clear_gradients()
    ceriation = nn.CrossEntropyLoss()
    losses = []
    loss = 0.0
    for epoch in range(summary_epochs):
        imgs, labels = next(task.train_iter)
        outputs = model(imgs)
        ce_loss = ceriation(outputs, labels)
        total_loss = ce_loss
        ewc_loss = ewc.penalty(model)
        total_loss += lambda_ewc * ewc_loss

        optimizer.clear_gradients()
        total_loss.backward()
        optimizer.step()

        loss += total_loss.numpy()[0]
        print("loss:{} of epoch:{} ".format(ce_loss.numpy()[0], epoch))
        if (epoch + 1) % 50 == 0:
            loss = loss / 50
            print("\r", "train task {} [{}] loss: {:.3f}      \n".format(task.name, (total_epochs + epoch + 1), loss),
                  end=" ")
            losses.append(loss)
            loss = 0.0

    return model, optimizer, losses

## MAS 训练

In [19]:
def mas_train(model, optimizer, task, total_epochs, summary_epochs, mas_tasks, lambda_mas, alpha=0.8):
    model.train()
    model.clear_gradients()
    ceriation = nn.CrossEntropyLoss()
    losses = []
    loss = 0.0
    for epoch in range(summary_epochs):
        imgs, labels = next(task.train_iter)
        # imgs, labels = imgs.to(device), labels.to(device)
        outputs = model(imgs)
        ce_loss = ceriation(outputs, labels)
        total_loss = ce_loss
        mas_tasks.reverse()
        if len(mas_tasks) > 1:
            preprevious = 1 - alpha
            scalars = [alpha, preprevious]
            for mas, scalar in zip(mas_tasks[:2], scalars):
                mas_loss = mas.penalty(model)
                total_loss += lambda_mas * mas_loss * scalar
        elif len(mas_tasks) == 1:
            mas_loss = mas_tasks[0].penalty(model)
            total_loss += lambda_mas * mas_loss
        else:
            pass

        optimizer.clear_gradients()
        total_loss.backward()
        optimizer.step()
        loss += total_loss.numpy()[0]
        print("loss:{} of epoch:{} ".format(ce_loss.numpy()[0], epoch))
        if (epoch + 1) % 50 == 0:
            loss = loss / 50
            print("\r", "train task {} [{}] loss: {:.3f}      \n".format(task.name, (total_epochs + epoch + 1), loss),
                  end=" ")
            losses.append(loss)
            loss = 0.0

    return model, optimizer, losses

## 验证

In [20]:
def val(model, task):
    model.eval()
    correct_cnt = 0
    for imgs, labels in task.val_loader:
        outputs = model(imgs)
        pred_label = np.argmax(outputs.numpy(), axis=-1)

        correct_cnt += (pred_label == labels.numpy()).sum()

    return correct_cnt / task.val_dataset_size

## 主训练程序


In [21]:
def train_process(model, optimizer, tasks, config):
    task_loss, acc = {}, {}
    for task_id, task in enumerate(tasks):
        print('\n')
        total_epochs = 0
        task_loss[task.name] = []
        acc[task.name] = []
        if config.mode == 'basic' or task_id == 0:
            while (total_epochs < config.num_epochs):
                model, optimizer, losses = normal_train(model, optimizer, task, total_epochs,
                                                                    config.summary_epochs)
                task_loss[task.name] += losses

                for subtask in range(task_id + 1):
                    acc[tasks[subtask].name].append(val(model, tasks[subtask]))

                total_epochs += config.summary_epochs
                if total_epochs % config.store_epochs == 0 or total_epochs >= config.num_epochs:
                    save_model(model, optimizer, config.store_model_path)

        if config.mode == 'ewc' and task_id > 0:
            old_dataloaders = []
            for old_task in range(task_id):
                old_dataloaders += [tasks[old_task].val_loader]
            ewc = EWC(model, old_dataloaders, device)
            while (total_epochs < config.num_epochs):
                model, optimizer, losses = ewc_train(model, optimizer, task, total_epochs,
                                                                 config.summary_epochs, ewc,
                                                                 config.lifelong_coeff)
                task_loss[task.name] += losses

                for subtask in range(task_id + 1):
                    acc[tasks[subtask].name].append(val(model, tasks[subtask]))

                total_epochs += config.summary_epochs
                if total_epochs % config.store_epochs == 0 or total_epochs >= config.num_epochs:
                    save_model(model, optimizer, config.store_model_path)

        if config.mode == 'mas' and task_id > 0:
            old_dataloaders = []
            mas_tasks = []
            for old_task in range(task_id):
                old_dataloaders += [tasks[old_task].val_loader]
                mas = MAS(model, old_dataloaders, device)
                mas_tasks += [mas]
            while (total_epochs < config.num_epochs):
                model, optimizer, losses = mas_train(model, optimizer, task, total_epochs,
                                                                 config.summary_epochs,
                                                                 mas_tasks, config.lifelong_coeff)
                task_loss[task.name] += losses

                for subtask in range(task_id + 1):
                    acc[tasks[subtask].name].append(val(model, tasks[subtask]))

                total_epochs += config.summary_epochs
                if total_epochs % config.store_epochs == 0 or total_epochs >= config.num_epochs:
                    save_model(model, optimizer, config.store_model_path)

        if config.mode == 'scp' and task_id > 0:
            pass
            ########################################
            ##       TODO 区块 （ PART 2 ）         ##
            ########################################
            ##    PART 2  implementation 的部份    ##
            ##   你也可以写别的 regularization 方法  ##
            ##    助教这里有提供的是  scp    的 作法   ##
            ##     Slicer Cramer Preservation     ##
            ########################################
            ########################################
            ##       TODO 区块 （ PART 2 ）         ##
            ########################################
    return task_loss, acc

# 设定

In [22]:
class configurations(object):
  def __init__(self):
    self.batch_size = 256
    self.num_epochs = 10000
    self.store_epochs = 250
    self.summary_epochs = 250
    self.learning_rate = 0.0005
    self.load_model = False
    self.store_model_path = "./model"
    self.load_model_path = "./model"
    self.data_path = "./data"
    self.mode = None
    self.lifelong_coeff = 0.5

###### 你也可以自己设定参数   ########
###### 但上面的参数 是这次作业的预设直 #########

#主程式区块
- 给 EWC, MAS 超参数 $\lambda$ 
- 训练

In [26]:
"""
the order is mnist -> cifar10
==============================================

"""
if __name__ == '__main__':
    #mode_list = ['mas', 'ewc', 'basic']
    mode_list = ['ewc', 'mas', 'basic']

    ## hint: 谨慎的去选择 lambda 超参数 / ewc: 80~400, mas: 0.1 - 10
    ############################################################################
    #####                           TODO 区块 （ PART 1 ）                  #####
    ############################################################################
    coeff_list = [0, 0, 0]  ## 你需要在这 微调 lambda 参数, mas, ewc, baseline=0 ##
    ############################################################################
    #####                           TODO 区块 （ PART 1 ）                  #####
    ############################################################################

    config = configurations()
    count = 0
    for mode in mode_list:
        config.mode = mode
        config.lifelong_coeff = coeff_list[count]
        print("{} training".format(config.mode))
        model, optimizer, tasks = build_model(config.data_path, config.batch_size, config.learning_rate)
        print("Finish build model")
        if config.load_model:
            model, optimizer = load_model(model, optimizer, config.load_model_path)
        task_loss, acc = train_process(model, optimizer, tasks, config)
        with open(f'./{config.mode}_acc.txt', 'w') as f:
            json.dump(acc, f)
        count += 1

ewc training
download training data and load training data of mnist dataset ! 
load mnist dataset finished ! 
download training data and load training data of cifar10 dataset ! 
load cifar10 dataset finished ! 
Finish build model


loss:46.33511734008789 of epoch:0 
loss:33.42926025390625 of epoch:1 
loss:32.75161361694336 of epoch:2 
loss:27.345964431762695 of epoch:3 
loss:18.874662399291992 of epoch:4 
loss:12.846699714660645 of epoch:0 
loss:12.845219612121582 of epoch:1 
loss:10.548949241638184 of epoch:2 
loss:8.525252342224121 of epoch:3 
loss:6.8116655349731445 of epoch:4 


Computing EWC
<class 'paddle.VarBase'>
Tensor(shape=[3072, 1024], dtype=float32, place=CPUPlace, stop_gradient=True,
       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
<class 'paddle.VarBase'>
Tensor(shape=[1024],

# 画出 Result 图片

In [28]:
import json
import matplotlib.pyplot as plt


def plot_result(mode_list, task1, task2):
    # draw the lines
    count = 0
    for reg_name in mode_list:
        label = reg_name
        with open(f'./{reg_name}_acc.txt', 'r') as f:
            acc = json.load(f)
        if count == 0:
            color = 'red'
        elif count == 1:
            color = 'blue'
        else:
            color = 'purple'
        ax1 = plt.subplot(2, 1, 1)
        plt.plot(range(len(acc[task1])), acc[task1], color, label=label)
        ax1.set_ylabel(task1)
        ax2 = plt.subplot(2, 1, 2, sharex=ax1, sharey=ax1)
        plt.plot(range(len(acc[task2])), acc[task2], color, label=label)
        ax2.set_ylabel(task2)
        count += 1
    plt.ylim((0.02, 1.02))
    plt.legend()
    plt.show()
    return

mode_list = ['ewc', 'mas', 'basic']
plot_result(mode_list, 'MNIST', 'Cifar10')

在今年 ICLR 2020 的 paper，有以这两种方法做 baseline，并对这两种方法各自做了一个 geometry view，也提出新的方法，有兴趣的人可以参考

paper link 如下 [SLICED CRAMER´ SYNAPTIC CONSOLIDATION FOR
PRESERVING DEEPLY LEARNED REPRESENTATIONS](https://openreview.net/pdf?id=BJge3TNKwH)


# 进阶 
请实做其他的 regularization 的方法，助教有提供的是 SCP 的作法，

你也可以考虑实做出 SI, Rimennian Walk, IMM, 或是上面的方法, 

你可以参考助教上方的写法，写出雷同的 class 跟 training 来 train，

记得画出与上方雷同的 evaluation 图表 (show result) example 需要比对的话 可以参考助教给的 slide。


In [29]:
def sample_spherical(npoints, ndim=3):
    vec = np.random.randn(ndim, npoints)
    vec /= np.linalg.norm(vec, axis=0)
    return vec

In [30]:
class SCP(object):
    """
    OPEN REVIEW VERSION:
    https://openreview.net/forum?id=BJge3TNKwH
    """
    def __init__(self, model: nn.Layer, dataloaders: list, L: int, device):
        self.model = model 
        self.dataloaders = dataloaders
        self.params = {n: p for n, p in self.model.named_parameters() if not p.stop_gradient}
        self._means = {}
        self.L= L
        self.device = device
        self._precision_matrices = self.calculate_importance()
    
        for n, p in self.params.items():
            self._means[n] = p.clone().detach()
    
    def calculate_importance(self):
        print('Computing SCP')

        precision_matrices = {}
        for n, p in self.params.items():  # 初始化 Fisher (F) 的矩阵（都补零）
            t_val = p.clone().detach()
            t_val.set_value(np.zeros(shape=t_val.numpy().shape, dtype=np.float32))
            precision_matrices[n] = t_val.numpy()

        self.model.eval()
        dataloader_num = len(self.dataloaders)
        num_data = sum([len(loader) for loader in self.dataloaders])
        for dataloader in self.dataloaders:
            for data in dataloader:
                self.model.clear_gradients()
                output = self.model(data[0])

                ####################################################################################
                #####                            TODO 区块 （ PART 2 ）                           #####
                ####################################################################################
                ##### 产生 SCP 的 Gamma(Γ) 矩阵（ 如同 MAS 的 Omega(Ω) 矩阵, EWC 的 Fisher(F) 矩阵 ）#####
                ####################################################################################
                #####        1.对所有资料的 Output vector 取 平均 得到 平均 vector φ(:,θ_A* )       #####
                ####################################################################################

                ####################################################################################
                #####   2. 随机 从 单位球壳 取样 L 个 vector ξ #（ Hint: sample_spherical() ）      #####
                ####################################################################################

                ####################################################################################
                #####   3.    每一个 vector ξ 和 vector φ( :,θ_A* )内积得到 scalar ρ               #####
                #####           对 scalar ρ 取 backward ， 每个参数得到各自的 gradient ∇ρ           #####
                #####       每个参数的 gradient ∇ρ 取平方 取 L 平均 得到 各个参数的 Γ scalar          #####  
                #####              所有参数的  Γ scalar 组合而成其实就是 Γ 矩阵                      #####
                ####(hint:记得每次backward之后要clear_gradients 去清gradient, 不然 gradient会累加)   ######   
                ####################################################################################
      
                ####################################################################################      
                #####                            TODO 区块 （ PART 2 ）                          #####
                ####################################################################################

        precision_matrices = {n: p for n, p in precision_matrices.items()}
        return precision_matrices

    def penalty(self, model: nn.Layer):
        loss = 0
        for n, p in model.named_parameters():
            _loss = self._precision_matrices[n] * (p - self._means[n]).numpy() ** 2
            loss += _loss.sum()
        return loss

In [31]:
def scp_train(model, optimizer, task, total_epochs, summary_epochs, scp_tasks, lambda_scp,alpha=0.65):
  losses = []
  loss = 0.0
  ###############################
  #####  TODO 区块 （PART 2） #####
  ###############################
  ##  参考 MAS. EWC train 的写法 ##                 
  ###############################
  #####  TODO 区块 （PART 2） #####
  ###############################
  return model, optimizer, losses

In [32]:
# if __name__ == "__main__": 
#   pass 
###############################
#####  TODO 区块 （PART 2） #####
###############################
##     参考 main 区块一样       ##                 
##     的 code 結合新方法       ##
###############################
#####  TODO 区块 （PART 2） #####
###############################