# Code For Model Compressing：Pruning and Distilling

                                                        姓名：岳天驰
                                                        导师：张绍武
                                                        - 2020年4月17日


# [Overview of BERT模型压缩](http://mitchgordon.me/machine/learning/2019/11/18/all-the-ways-to-compress-BERT.html)

![title](img/f0.png)
![title](img/f1.png)

# 1： 剪枝

In [1]:
%%html
<h3>非结构化剪枝：考虑每个权重，删除不重要的参数；也称为稀疏剪枝</h3>
<img src='img/f2.png', width=400>
<h3>结构化剪枝：直接去掉整个神经元的结构化信息；</h3>
<img src='img/f3.png', width=400>


## 剪枝过程
- one-shot剪枝：
    - train -> evaluate-> prune -> finetune -> stop
- Iteration:
    - train -> evaluate -> prune -> finetune -> if continue return step2 else stop
    
    
## [Rethinking the Value of Network Pruning](https://arxiv.org/pdf/1810.05270.pdf)          ICLR2019
- https://github.com/Eric-mingjie/rethinking-network-pruning
- 猜想：
    - 对于如左图可以预定义的架构，可以直接随机初始化训练小模型。
    - 对于如右图无预定义的架构, 随机初始化训练剪枝后的模型可以实现与微调一样的效果
![title](img/f9.png)

## BaseModel VGG
![title](img/f4.png)

In [None]:
import torch.nn as nn
import numpy as np
import torch
defaultcfg = {
    11 : [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512],
    13 : [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512],
    16 : [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512],
    19 : [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512],
}

class vgg(nn.Module):
    def __init__(self, depth=19, init_weights=True, cfg=None):
        super(vgg, self).__init__()
        if cfg is None:
            cfg = defaultcfg[depth]

        self.cfg = cfg

        self.feature = self.make_layers(cfg, True)
        self.classifier = nn.Sequential(
              nn.Linear(cfg[-1], 512),
              nn.BatchNorm1d(512),
              nn.ReLU(inplace=True),
              nn.Linear(512, 10)
            )

    def make_layers(self, cfg, batch_norm=False):
        layers = []
        in_channels = 3
        for v in cfg:
            if v == 'M':
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1, bias=False)
                if batch_norm:
                    layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
                else:
                    layers += [conv2d, nn.ReLU(inplace=True)]
                in_channels = v
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.feature(x)
        x = nn.AvgPool2d(2)(x)
        x = x.view(x.size(0), -1)
        y = self.classifier(x)
        return y


base_model = vgg(cfg=defaultcfg[19])

print(base_model)

## [Pruning Filters for Efficient ConvNets](https://arxiv.org/pdf/1608.08710.pdf) ICLR 2017 
- 结构化剪枝，裁剪每一个卷积层的权值小的filter
![title](img/f5.png)

In [23]:

# 例如对于3*3的2d卷积，in_channels=64, out_channels=1
# 该层参数权重是(128,64,3,3),计算每个64,3,3的权重和，就是128个数，进行排序，然后裁掉较小的部分，保留剩下的 
# 比如可以裁剪成(64,64,3,3),那么该层的输出就减少了。下层的输入就少了。
conv = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3,3))
conv.weight.size()


torch.Size([128, 64, 3, 3])

In [None]:
# 结构化剪枝关键部分伪代码片段
cfg = [32, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 256, 256, 256, 'M', 256, 256, 256]

cfg_mask = []
layer_id = 0
# 对于每个卷积层，计算每个filter的参数和，然后排序，然后利用mask进行选择。
for m in model.modules():
    if isinstance(m, nn.Conv2d):
        out_channels = m.weight.data.shape[0]
        if out_channels == cfg[layer_id]:
            cfg_mask.append(torch.ones(out_channels))
            layer_id += 1
            continue
        weight_copy = m.weight.data.abs().clone()
        weight_copy = weight_copy.cpu().numpy()
        # 首先计算每个filter的权值和，一共128个，然后进行排序
        L1_norm = np.sum(weight_copy, axis=(1, 2, 3))
        arg_max = np.argsort(L1_norm)
        arg_max_rev = arg_max[::-1][:cfg[layer_id]]
        assert arg_max_rev.size == cfg[layer_id], "size of arg_max_rev not correct"
        mask = torch.zeros(out_channels)
        # mask 只有对大权值的设置为1
        mask[arg_max_rev.tolist()] = 1
        cfg_mask.append(mask)
        layer_id += 1


newmodel = vgg(dataset=args.dataset, cfg=cfg)

# 遍历原模型和新模型。将原模型的copy进去。    
start_mask = torch.ones(3)
layer_id_in_cfg = 0
end_mask = cfg_mask[layer_id_in_cfg]
for [m0, m1] in zip(model.modules(), newmodel.modules()):
    elif isinstance(m0, nn.Conv2d):
        idx0 = np.squeeze(np.argwhere(np.asarray(start_mask.cpu().numpy())))
        idx1 = np.squeeze(np.argwhere(np.asarray(end_mask.cpu().numpy())))
        print('In shape: {:d}, Out shape {:d}.'.format(idx0.size, idx1.size))
        if idx0.size == 1:
            idx0 = np.resize(idx0, (1,))
        if idx1.size == 1:
            idx1 = np.resize(idx1, (1,))
        # 将原模型的权重拷贝到新模型中，权重的in_channels和out_channels要改变
        w1 = m0.weight.data[:, idx0.tolist(), :, :].clone()
        w1 = w1[idx1.tolist(), :, :, :].clone()
        m1.weight.data = w1.clone()

## [The Lottery Ticket Hypothesis:finding sparse, Trainable Neural Networks](https://arxiv.org/pdf/1803.03635.pdf)  ICLR 2019
- 一个随机初始化的大型神经网络包含一个初始化的子网络，在单独训练时，最多经过相同的迭代次数，可以达到和原始网络一样的测试准确率。
- 将一个复杂网络的所有参数当做奖池，上述一组子参数对应的子网络就是中奖彩票。
- 一个密集的前馈神经网络 f(x;θ)，其中初始化参数 θ,当在训练集上用随机梯度下降时，f 可以在 j 次迭代后达到损失 l 和准确率 a。
- 考虑对参数θ作用一个 01 mask矩阵，在相同的数据集上训练 f(x;m⊙θ), f 在 j' 次迭代后达到损失 l' 和准确率 a'。
- 彩票假设指出存在 m, 使得 j'<=j (训练时间更快), a'>=a (准确率更高), ||m||_0 << |θ| (更少的参数)。
-![title](img/f9.jpg)

In [27]:
# 例如对于3*3的2d卷积，in_channels=64, out_channels=1
# 该层参数权重是(128,64,3,3),计算所有权重的值，然后进行排序，选择一个阈值，将小与阈值的部分裁掉
"""
- weight [0.1,0.8,
          0.7,0.2]
- threshold 0.5
- mask [0,1,
       1,0]
- new_weight = mask*weight
"""

# 非结构化剪枝代码片段

# pruning 
total = 0
# 首先统计一共有多少参数，然后排序。 根据裁剪比例选择阈值
for m in model.modules():
    if isinstance(m, nn.Conv2d):
        total += m.weight.data.numel()
conv_weights = torch.zeros(total)
index = 0
for m in model.modules():
    if isinstance(m, nn.Conv2d):
        size = m.weight.data.numel()
        conv_weights[index:(index+size)] = m.weight.data.view(-1).abs().clone()
        index += size
y, i = torch.sort(conv_weights)
thre_index = int(total * args.percent)
thre = y[thre_index]
pruned = 0
print('Pruning threshold: {}'.format(thre))
zero_flag = False

# 遍历每个modules,对卷积的权重进行修改，设定mask矩阵与weight同大小，如果该权重小于，则为0.
for k, m in enumerate(model.modules()):
    if isinstance(m, nn.Conv2d):
        weight_copy = m.weight.data.abs().clone()
        mask = weight_copy.gt(thre).float().cuda()
        pruned = pruned + mask.numel() - torch.sum(mask)
        m.weight.data.mul_(mask)
        if int(torch.sum(mask)) == 0:
            zero_flag = True
        print('layer index: {:d} \t total params: {:d} \t remaining params: {:d}'.
            format(k, mask.numel(), int(torch.sum(mask))))
print('Total conv params: {}, Pruned conv params: {}, Pruned ratio: {}'.format(total, pruned, pruned/total))

SyntaxError: invalid syntax (<ipython-input-27-da4653771bab>, line 5)

----------------------------------------------------
# 2： 蒸馏

## [Patient Knowledge Distillation for BERT Model Compression](https://arxiv.org/pdf/1908.09355.pdf) - EMNLP2019  微软
- https://github.com/intersun/PKD-for-BERT-Model-Compression

## [Tiny BERT](https://arxiv.org/pdf/1909.10351.pdf)   - ICLR2020 拒稿 华为
- https://github.com/huawei-noah/Pretrained-Language-Model
![title](img/f8.png)

- TinyBERT 的 transformer 蒸馏采用隔 k 层蒸馏的方式。
- 举个例子，teacher BERT 一共有 12 层，若是设置 student BERT 为 4 层，就是每隔 3 层计算一个 transformer loss. 
- 映射函数为 g(m) = 3 * m, m 为 student encoder 层数。
- 具体对应为 student 第 1 层 transformer 对应 teacher 第 3 层，第 2 层对应第 6 层，第 3 层对应第 9 层，第 4 层对应第 12 层。
- 每一层的 transformer loss 又分为两部分组成，attention based distillation 和 hidden states based distillation.


In [None]:
# 构建student_model和 teacher_model. 将attention_weights 和 hidden_states对应，算loss
student_model = TinyBertForPreTraining.from_scratch(args.student_model)
teacher_model = BertModel.from_pretrained(args.teacher_model)
student_atts, student_reps = student_model(input_ids, segment_ids, input_mask)
teacher_reps, teacher_atts, _ = teacher_model(input_ids, segment_ids, input_mask)
teacher_reps = [teacher_rep.detach() for teacher_rep in teacher_reps]  # speedup 1.5x
teacher_atts = [teacher_att.detach() for teacher_att in teacher_atts]

teacher_layer_num = len(teacher_atts)
student_layer_num = len(student_atts)
assert teacher_layer_num % student_layer_num == 0
layers_per_block = int(teacher_layer_num / student_layer_num)
new_teacher_atts = [teacher_atts[i * layers_per_block + layers_per_block - 1]
                    for i in range(student_layer_num)]

for student_att, teacher_att in zip(student_atts, new_teacher_atts):
    student_att = torch.where(student_att <= -1e2, torch.zeros_like(student_att).to(device),
                              student_att)
    teacher_att = torch.where(teacher_att <= -1e2, torch.zeros_like(teacher_att).to(device),
                              teacher_att)
    att_loss += loss_mse(student_att, teacher_att)

new_teacher_reps = [teacher_reps[i * layers_per_block] for i in range(student_layer_num + 1)]
new_student_reps = student_reps

for student_rep, teacher_rep in zip(new_student_redps, new_teacher_reps):
    rep_loss += loss_mse(student_rep, teacher_rep)

loss = att_loss + rep_loss

## [BERT-of-Theseus: Compressing BERT by Progressive Module Replacing](https://arxiv.org/pdf/2002.02925.pdf)  - ARXIV 2020.02 微软
-  https://github.com/JetRunner/BERT-of-Theseus
![title](img/f6.png)
- Pmodel是原模型 ,Smodel是压缩后的模型。如果要压缩一半的层数，原始bert-base为12层，压缩后为6层。
- Smodel的第i个module为scc_i，0<=i<6,每个module包含一个transformer layer。
- 将Pmodel的12层分隔成6个module，每个module包含两个transformer layers，得到 Prdi,0<=i<6
- 可以将scc_i和prd_i建立一对一的映射关系。


In [None]:

self.layer = nn.ModuleList([BertLayer(config) for _ in range(self.prd_n_layer)])
self.scc_layer = nn.ModuleList([BertLayer(config) for _ in range(self.scc_n_layer)])
# 只需要改training阶段的BertEncoder的layers组合部分。
def forward(self, hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None,
            encoder_attention_mask=None):
    all_hidden_states = ()
    all_attentions = ()
    if self.training:
        inference_layers = []
        for i in range(self.scc_n_layer):
            # 根据概率去选择替换还是不替换
            if self.bernoulli.sample() == 1:  # REPLACE
                inference_layers.append(self.scc_layer[i])
            else:  # KEEP the original
                for offset in range(self.compress_ratio):
                    inference_layers.append(self.layer[i * self.compress_ratio + offset])

    else:  # inference with compressed model
        inference_layers = self.scc_layer
        
#初始化
scc_n_layer = model.bert.encoder.scc_n_layer
model.bert.encoder.scc_layer = nn.ModuleList([deepcopy(model.bert.encoder.layer[ix]) for ix in range(scc_n_layer)])

# 概率替换策略
class LinearReplacementScheduler:
    def __init__(self, bert_encoder: BertEncoder, base_replacing_rate, k):
        self.bert_encoder = bert_encoder
        self.base_replacing_rate = base_replacing_rate
        self.step_counter = 0
        self.k = k
        self.bert_encoder.set_replacing_rate(base_replacing_rate)

    def step(self):
        self.step_counter += 1
        current_replacing_rate = min(self.k * self.step_counter + self.base_replacing_rate, 1.0)
        self.bert_encoder.set_replacing_rate(current_replacing_rate)
        return current_replacing_rate