# 前情
HW7的任務是模型壓縮 - Neural Network Compression

Compression有很多種門派，在這裡我們會介紹上課出現過的其中四種，分別是:

- 知識蒸餾 Knowledge Distillation
- 網路剪枝 Network Pruning
- 用少量參數來做CNN Architecture Design
- 參數量化 Weight Quantization

在這個notebook中我們會介紹Knowledge Distillation，
而我們有提供已經學習好的大model方便大家做Knowledge Distillation。
而我們使用的小model是"Architecture Design"過的model。

- Architecute Design在同目錄中的hw7_Architecture_Design.ipynb。
- 下載pretrained大model(47.2M): https://drive.google.com/file/d/1B8ljdrxYXJsZv2vmTequdPOofp3VF3NN/view?usp=sharing
  - 請使用torchvision提供的ResNet18，把num_classes改成11後load進去即可。(後面有範例。)

In [1]:
import torch
import os
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.models as models
from hw7_data import hw7_Architecture_Design as Architecture_Design

# Knowledge Distillation
<img src="hw7_data/Knowledge Distillation.png" width = 50%>
讓小 model 模仿已經做得很好的大model們，利用大model預測的logits給小model當作標準

## 為甚麼這會work?
- 例如當 data 不是很乾淨的時候，對一般的 model 來說他是個 noise，只會干擾學習。透過去學習其他大 model 預測的 logits 會比較好。
- label 和 label 之間可能有關連，這可以引導小 model 去學習。例如數字 8 可能就和 6, 9, 0 有關係。
- 弱化已經學習不錯的 target(?)，避免讓其 gradient 干擾其他還沒學好的 task。

## 要怎麼實作?
$$Loss = \alpha T^2 \times KL(\frac{\text{Teacher's Logits}}{T} || \frac{\text{Student's Logits}}{T}) + (1-\alpha)(\text{Original Loss})$$


- 以下code為甚麼要對student使用log_softmax: https://github.com/peterliht/knowledge-distillation-pytorch/issues/2
- reference: [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)

# Data Processing

我們的 Dataset 使用的是跟 hw3 - CNN 同樣的 Dataset

## Read image 
本次的 model 必须将图片直接读入，否则准确率下降到老师例子的一半，非常耗费内存，只能减少训练规模

In [6]:
import time
import numpy as np
from PIL import Image

def readfile(path, label):# label 是0 或 1，1代表需要回傳 y 值
    image_dir = sorted(os.listdir(path))
    # uint8是专门用于存储各种图像的，范围是从0–255
    x = [] #初始化，
    y = np.zeros((len(image_dir)), dtype=np.uint8) #初始化
    # 给文件夹里的图片一个编号，并将编号和图片组合成一个表
    for i, file in enumerate(image_dir):
        image = Image.open(path + "/" + file)
        image_fp = image.fp # Get File Descriptor
        image.load()
        image_fp.close() # Close File Descriptor (or it'll reach OPEN_MAX)
        x.append(image)
        if label:
            # 训练集图像命名方式为 [类别]_[第几张图片].jpg
            y[i] = int(file.split("_")[0]) #图片名分成2个部分，取前面的一个，得到类别作为 y 值
    if label:
      return x, y
    else:
      return x

# 分別將 training set、validation set、testing set 用 readfile 函式讀進來
workspace_dir = 'hw7_data/'
start_time = time.time()
print("Reading data")
train_x, train_y = readfile(os.path.join(workspace_dir, "training"), True)
print("Size of training data = {}".format(len(train_x)))
val_x, val_y = readfile(os.path.join(workspace_dir, "validation"), True)
print("Size of validation data = {}".format(len(val_x)))
test_x = readfile(os.path.join(workspace_dir, "testing"), False)
print("Size of Testing data = {}".format(len(test_x)))
end_time = time.time()
print("用时：", end_time - start_time)

Reading data
Size of training data = 1000
Size of validation data = 500
Size of Testing data = 124
用时： 7.232342720031738


## 预处理

In [7]:
from PIL import Image
import torchvision.transforms as transforms
from torch.utils.data import Dataset

# 来自torchvision.transforms
# 训练数据做数据增强 (data augmentation)
train_transform = transforms.Compose([ # 处理图片, 用Compose把多个处理步骤整合到一起
    #transforms.ToPILImage(), #把数据转换为tensfroms格式
    transforms.RandomCrop(256, pad_if_needed = True, padding_mode='symmetric'), # 随机裁剪到256*256
    transforms.RandomHorizontalFlip(), # 隨機將圖片左右镜像
    transforms.RandomRotation(15), # 隨機旋轉圖片
    transforms.ToTensor(), # 將圖片轉成 Tensor，並把數值 normalize 到 [0,1] ,这个格式可以直接输入进神经网络了
])

# 测试数据不需做 数据增强 (data augmentation)
test_transform = transforms.Compose([
    #transforms.ToPILImage(),   
    transforms.CenterCrop(256),
    transforms.ToTensor(),
])

# 定义ImgDataset类，继承torch.utils.data.Dataset，实现数据读取方式
# torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False,...) 
class ImgDataset(Dataset):
    def __init__(self, x, y=None, transform=None):
        self.x = x
        self.y = y
        if y is not None: #将 y 变成一个长向量 64-bit integer (signed)
            self.y = torch.LongTensor(y)
        self.transform = transform
    def __len__(self):
        return len(self.x)
    def __getitem__(self, index): #取一张图片和对应的分类 y 值
        X = self.x[index]
        if self.transform is not None:
            X = self.transform(X)
        if self.y is not None: # 判断是否需要 y 值
            Y = self.y[index]
            return X, Y
        else:
            return X

train_set = ImgDataset(train_x, train_y, train_transform) #训练集
val_set = ImgDataset(val_x, val_y, test_transform) #验证集

## 载入数据

In [8]:
from torch.utils.data import DataLoader

batch_size = 32

# 使用torch.utils.data.DataLoader(), 实现数据的批量读取
#dataset：加载的数据集(Dataset对象); batch_size：batch size; shuffle:：是否将数据打乱
train_loader = DataLoader(train_set, batch_size = batch_size, shuffle=True)
val_loader = DataLoader(val_set, batch_size = batch_size, shuffle=False)

# 载入模型
载入提供的模型，设置模型参数

In [26]:
teacher_net = models.resnet18(pretrained = False, num_classes=11).cuda()
student_net = Architecture_Design.StudentNet(base=16).cuda()

teacher_net.load_state_dict(torch.load('hw7_data/teacher_resnet18.bin'))
optimizer = optim.AdamW(student_net.parameters(), lr = 0.001)

def loss_fn_kd(outputs, labels, teacher_outputs, T = 20, alpha = 0.5):
    hard_loss = F.cross_entropy(outputs, labels) * (1. - alpha) # 一般的Cross Entropy
    # 讓logits的log_softmax對目標機率(teacher的logits/T後softmax)做KL Divergence。
    soft_loss = nn.KLDivLoss(reduction='batchmean')(F.log_softmax(outputs/T, dim=1),F.softmax(teacher_outputs/T, dim=1)) * (alpha * T * T) 
    return hard_loss + soft_loss

# Start Training
- 剩下的步驟與你在做hw3 - CNN的時候一樣。

In [27]:
def train_epoch(train_loader, alpha=0.5):
    train_acc, train_loss = 0, 0
    for i, data in enumerate(train_loader): # 给train_loader的矩阵一个编号，并组合成一个表
        optimizer.zero_grad() # 將 model 參數的 gradient 歸零
        with torch.no_grad(): # 先用 teacher_net 算一次
            soft_labels = teacher_net(data[0].cuda()) 
        train_pred = student_net(data[0].cuda()) # 再用 student_net 算一次
        
        loss = loss_fn_kd(train_pred, data[1].cuda(), soft_labels, 20, alpha) #算 student_net 和  teacher_net 的差
        loss.backward() # 利用 back propagation 算出每個參數的 gradient
        optimizer.step() # 以 optimizer 用 gradient 更新參數值

        train_acc += np.sum(np.argmax(train_pred.cpu().data.numpy(), axis=1) == data[1].numpy())
        train_loss += loss.item()
    return train_loss/train_set.__len__(), train_acc/train_set.__len__()

def val_epoch(val_loader, alpha=0.5):
    val_acc, val_loss = 0 ,0
    with torch.no_grad(): #torch.no_grad() 是一个上下文管理器，被该语句 wrap 起来的部分将不会track 梯度
        for i, data in enumerate(val_loader):
            with torch.no_grad(): # 先用 teacher_net 算一次
                soft_labels = teacher_net(data[0].cuda()) 
            with torch.no_grad(): # 再用 student_net 算一次
                val_pred = student_net(data[0].cuda())
                loss = loss_fn_kd(val_pred, data[1].cuda(), soft_labels, 20, alpha) #算 student_net 和  teacher_net 的差
            
            val_acc += np.sum(np.argmax(val_pred.cpu().data.numpy(), axis=1) == data[1].numpy())
            val_loss += loss.item()
    return val_loss/val_set.__len__(), val_acc/val_set.__len__()


teacher_net.eval() # TeacherNet永遠都是Eval mode
now_best_acc = 0
for epoch in range(100):
    epoch_start_time = time.time()
    student_net.train()
    train_loss, train_acc = train_epoch(train_loader)
    student_net.eval()
    valid_loss, valid_acc = val_epoch(val_loader)

    # 存下最好的model。
    if valid_acc > now_best_acc:
        now_best_acc = valid_acc
        torch.save(student_net.state_dict(), 'hw7_data/student_model_Knowledge_Distillation.bin')
    print('第{:>3d}轮 : [train loss: {:6.4f}, acc: {:6.4f}] [valid loss: {:6.4f}, acc: {:6.4f}], 用时：{:4.2f} sec(s)'.format(
            epoch+1, train_loss, train_acc, valid_loss, valid_acc, time.time()-epoch_start_time))

第 1轮 : [train loss: 0.6292, acc: 0.1590] [valid loss: 0.8030, acc: 0.1300], 用时：7.54 sec(s)
第 2轮 : [train loss: 0.5840, acc: 0.2670] [valid loss: 0.7957, acc: 0.2520], 用时：7.51 sec(s)
第 3轮 : [train loss: 0.5687, acc: 0.2740] [valid loss: 0.5888, acc: 0.3220], 用时：7.52 sec(s)
第 4轮 : [train loss: 0.5453, acc: 0.3120] [valid loss: 0.6964, acc: 0.2880], 用时：7.54 sec(s)
第 5轮 : [train loss: 0.5368, acc: 0.3420] [valid loss: 0.5780, acc: 0.3780], 用时：7.50 sec(s)
第 6轮 : [train loss: 0.5206, acc: 0.3460] [valid loss: 0.5602, acc: 0.3820], 用时：7.53 sec(s)
第 7轮 : [train loss: 0.5367, acc: 0.3520] [valid loss: 0.5735, acc: 0.3460], 用时：7.53 sec(s)
第 8轮 : [train loss: 0.5116, acc: 0.3500] [valid loss: 0.5265, acc: 0.4220], 用时：7.54 sec(s)
第 9轮 : [train loss: 0.5034, acc: 0.3980] [valid loss: 0.5488, acc: 0.3660], 用时：7.56 sec(s)
第10轮 : [train loss: 0.4801, acc: 0.4010] [valid loss: 0.5131, acc: 0.4060], 用时：7.54 sec(s)
第11轮 : [train loss: 0.4737, acc: 0.4130] [valid loss: 0.5150, acc: 0.4160], 用时：7.56 sec(s)

第92轮 : [train loss: 0.2043, acc: 0.7920] [valid loss: 0.4194, acc: 0.5620], 用时：7.59 sec(s)
第93轮 : [train loss: 0.2080, acc: 0.7900] [valid loss: 0.4097, acc: 0.5400], 用时：7.60 sec(s)
第94轮 : [train loss: 0.2073, acc: 0.8030] [valid loss: 0.4083, acc: 0.5440], 用时：7.60 sec(s)
第95轮 : [train loss: 0.2073, acc: 0.8030] [valid loss: 0.4041, acc: 0.5640], 用时：7.57 sec(s)
第96轮 : [train loss: 0.1932, acc: 0.8390] [valid loss: 0.4202, acc: 0.5560], 用时：7.63 sec(s)
第97轮 : [train loss: 0.1907, acc: 0.8100] [valid loss: 0.3994, acc: 0.5460], 用时：7.60 sec(s)
第98轮 : [train loss: 0.2039, acc: 0.7880] [valid loss: 0.3891, acc: 0.5900], 用时：7.61 sec(s)
第99轮 : [train loss: 0.2039, acc: 0.8200] [valid loss: 0.4519, acc: 0.5240], 用时：7.60 sec(s)
第100轮 : [train loss: 0.1802, acc: 0.8450] [valid loss: 0.3791, acc: 0.5640], 用时：7.60 sec(s)


# 测试

In [24]:
test_set = ImgDataset(test_x, transform=test_transform)
test_loader = DataLoader(test_set, batch_size = batch_size, shuffle=False)
name = ['面包', '奶', '甜品', '蛋', '油炸食品', '肉', '面条', '米饭', '海鲜', '汤', '果蔬']
prediction = []

student_net = Architecture_Design.StudentNet(base=16).cuda()
student_net.load_state_dict(torch.load('hw7_data/student_model_Knowledge_Distillation.bin'))
student_net.eval()
with torch.no_grad():
    for i, data in enumerate(test_loader):
        test_pred = student_net(data.cuda())
        test_label = np.argmax(test_pred.cpu().data.numpy(), axis=1)
        for y in test_label:
            prediction.append(y)
        
def rename_and_write_csv():
    f = os.listdir("hw7_data/testing_prediction")
    for i in range(len(f)):
        oldname = f[i]
        newname = oldname.split(".")[0] + '-'+ name[prediction[i]] + '.jpg'
        # 用os模块中的rename方法对文件改名
        os.rename("hw7_data/testing_prediction/" + oldname, "hw7_data/testing_prediction/" + newname)
    
    #將結果寫入 csv 檔
    with open("hw7_data/预测.csv", 'w') as f:
        f.write('Id,Category\n')
        for i, y in  enumerate(prediction):
            f.write('{},{},{}\n'.format(i, y, name[y]))
        
#rename_and_write_csv()