---
# IMPORTANT

**Please remember to save this notebook `SC201_Assignment5.ipynb` as you work on it!**

### 請大家務必在這份作業中使用 GPU。

請點選 `Runtime -> Change runtime type` 並將 `Hardware Accelerator` 設定為 `GPU`。

In [None]:
# this mounts your Google Drive to the Colab VM.
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# 請輸入 a5 資料夾之所在位置
FOLDERNAME = '我的 筆記型電腦 MSI/Research data/stanCodeML/L14/SC201_Assignment5'
assert FOLDERNAME is not None, "[!] Enter the foldername."

# now that we've mounted your Drive, this ensures that
# the Python interpreter of the Colab VM can load
# python files from within it.
import sys
sys.path.append('/content/Othercomputers/{}'.format(FOLDERNAME))
# /content/drive/Othercomputers/我的 筆記型電腦 MSI/Research data/stanCodeML/L14/SC201_Assignment5/SC201_Assignment5.ipynb
# this downloads the CIFAR-10 dataset to your Drive
# if it doesn't already exist.
%cd drive/Othercomputers/$FOLDERNAME/sc201/datasets/
!bash get_datasets.sh
%cd /content

Mounted at /content/drive
/content/drive/Othercomputers/我的 筆記型電腦 MSI/Research data/stanCodeML/L14/SC201_Assignment5/sc201/datasets
/content


# What is PyTorch?

PyTorch 是一套計算系統，可以用來計算動態圖形 (neural network 是圖形的一種)。這些圖形是由 PyTorch 的 Tensor 物件組成的，Tensor 的用法如同 numpy 矩陣。PyTorch 內建自動微分的功能，使用者就不必手動處理 backward pass！

This notebook assumes that you are using **PyTorch version 1.4+**

## Why PyTorch?

* PyTorch 支援 GPU 計算，我們的 training 就可以利用 GPU 執行，程式會跑的更快！
* PyTorch 也是使用 modular design，大家以後就可以直接使用 PyTorch 既有模組（或是自己定義）並隨意拼湊成各式各樣的 neural network！
* 學術和業界中的 machine learning 都是使用 PyTorch 或是其他類似的強大計算套件，大家也就能跟上最新的研究和應用！

## How can I learn PyTorch on my own?

有興趣可以參考網路上的 PyTorch 教學，如 https://github.com/jcjohnson/pytorch-examples 

另外也可以參考 PyTorch 的說明書 [API doc](http://pytorch.org/docs/stable/index.html)。PyTorch 相關問題會建議大家在 [PyTorch forum](https://discuss.pytorch.org/) 上發問，而非 StackOverflow。

# Section I. Preparation

大家在之前的作業裡做 data preparation 都是呼叫我們提供的程式。

PyTorch 內建的 `DataLoader` 和 `sampler` 類別可以將這個步驟自動化。詳細用法請參考以下的 code，特別是 data 的正規化 (normalization) 和分劃 (partitioning into *train / val / test*)。

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

In [None]:
from torchvision.transforms.transforms import RandomHorizontalFlip
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
                T.ToTensor(),
                # T.RandomHorizontalFlip(p=0.5),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ]) # (mean), (std)

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./sc201/datasets', train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./sc201/datasets', train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./sc201/datasets', train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

print('Total number of the data:', len(cifar10_train))

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./sc201/datasets/cifar-10-python.tar.gz


  0%|          | 0/170498071 [00:00<?, ?it/s]

Extracting ./sc201/datasets/cifar-10-python.tar.gz to ./sc201/datasets
Files already downloaded and verified
Files already downloaded and verified
Total number of the data: 50000


我們透由 `device` 啟用 PyTorch 的 GPU 功能。

（如果您未將 CUDA 開啟，`torch.cuda.is_available()` 會回傳 False，使 notebook 轉回 CPU mode。）

In [None]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100

print('using device:', device)

using device: cuda


In [None]:
def train_part34(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    for e in range(epochs):
      print('----------------------------Number of epochs:', e+1)
      for t, (x, y) in enumerate(loader_train):
          model.train()  # put model to training mode
          x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
          y = y.to(device=device, dtype=torch.long) # tensor is index

          scores = model(x)
          loss_function = nn.CrossEntropyLoss()
          loss = loss_function(scores, y)

          # Zero out all of the gradients for the variables which the optimizer
          # will update.
          optimizer.zero_grad()

          # This is the backwards pass: compute the gradient of the loss with
          # respect to each  parameter of the model.
          loss.backward()

          # Actually update the parameters of the model using the gradients
          # computed by the backwards pass.
          optimizer.step()

          if t % print_every == 0:
              print('Iteration %d, loss = %.4f' % (t, loss.item())) # d is decimal integer
              check_accuracy_part34(loader_val, model)
              print()

In [None]:
def check_accuracy_part34(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1) # dimension
            """Returns a namedtuple (values, indices) 
            where values is the maximum value of each row 
            of the input tensor in the given dimension dim. 
            And indices is the index location 
            of each maximum value found (argmax)."""
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

# PyTorch Sequential API

### Sequential API: Two-Layer Network
以下是 two-layer fully connected network 的 `nn.Sequential` 範例，我們把內建的 layer 依序丟入，並使用同樣的 training loop 進行訓練。

大家在這裡不用做 hyperparameter tuning，但是在不做 tuning 的情況下，模型應該還是能在一個 epoch 之內達到 40% 以上的準確率。

In [None]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential

hidden_layer_size = 4000
learning_rate = 1e-2

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(3 * 32 * 32, hidden_layer_size),
    nn.ReLU(),
    nn.Linear(hidden_layer_size, 10),
)

# you can use Nesterov momentum in optim.SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                     momentum=0.9, nesterov=True)

train_part34(model, optimizer, epochs=1)

----------------------------Number of epochs: 1
Iteration 0, loss = 2.3932
Checking accuracy on validation set
Got 176 / 1000 correct (17.60)

Iteration 100, loss = 1.5128
Checking accuracy on validation set
Got 370 / 1000 correct (37.00)

Iteration 200, loss = 1.7574
Checking accuracy on validation set
Got 414 / 1000 correct (41.40)

Iteration 300, loss = 2.0008
Checking accuracy on validation set
Got 434 / 1000 correct (43.40)

Iteration 400, loss = 1.8535
Checking accuracy on validation set
Got 428 / 1000 correct (42.80)

Iteration 500, loss = 1.5506
Checking accuracy on validation set
Got 397 / 1000 correct (39.70)

Iteration 600, loss = 1.8976
Checking accuracy on validation set
Got 442 / 1000 correct (44.20)

Iteration 700, loss = 1.6946
Checking accuracy on validation set
Got 464 / 1000 correct (46.40)



### Sequential API: Three-Layer ConvNet
請大家使用 `nn.Sequential` 建立並訓練出一套 three-layer ConvNet，架構依舊是：

1. Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
2. ReLU
3. Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
4. ReLU
5. Fully-connected layer (with bias) to compute scores for 10 classes

訓練的方式請使用 stochastic gradient descent with Nesterov momentum 0.9。

大家在這裡不用做 hyperparameter tuning，但是在不做 tuning 的情況下，模型應該還是能在一個 epoch 之內達到 55% 以上的準確率。

In [None]:
model = None
optimizer = None

################################################################################
# TODO: Rewrite the 2-layer ConvNet with bias from Part III with the           #
# Sequential API.                                                              #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# model = nn.Sequential(
#     # Your Model Here
#     # N * 3 * 32 * 32
#     nn.Conv2d(in_channels=3, out_channels=32, kernel_size=5, padding=2),
#     nn.BatchNorm2d(32),
#     nn.ReLU(),
#     nn.MaxPool2d(kernel_size=2, stride=2),
#     # N * 32 * 16 * 16
#     nn.Conv2d(in_channels=32, out_channels=16, kernel_size=3, padding=1),
#     nn.BatchNorm2d(16),
#     nn.ReLU(),
#     nn.MaxPool2d(2, 2),
#     # N * 16 * 8 * 8
#     nn.Flatten(),
#     nn.Linear(in_features=16*8*8, out_features=10)
# )
class MyCNN(nn.Module):
  def __init__(self) -> None:
      super().__init__()
      self.conv1 = nn.Conv2d(3, 64, 3, 1, 1)
      self.conv2 = nn.Conv2d(64, 128, 3, 1, 1)
      self.fc = nn.Linear(128*8*8, 10)
  def forward(self, x):
    x = self.conv1(x)
    x = F.relu(x)
    x = F.max_pool2d(x, (2, 2))
    x = self.conv2(x)
    x = F.relu(x)
    x = F.max_pool2d(x, (2, 2))
    x = torch.flatten(x, 1)
    return self.fc(x)

model = MyCNN()
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                     momentum=0.9, nesterov=True)
# optimizer = optim.RMSprop(model.parameters(), lr=1e-3)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
################################################################################
#                                 END OF YOUR CODE                             
################################################################################

train_part34(model, optimizer, epochs=3)

----------------------------Number of epochs: 1
Iteration 0, loss = 2.2998
Checking accuracy on validation set
Got 120 / 1000 correct (12.00)

Iteration 100, loss = 1.4314
Checking accuracy on validation set
Got 514 / 1000 correct (51.40)

Iteration 200, loss = 1.3789
Checking accuracy on validation set
Got 553 / 1000 correct (55.30)

Iteration 300, loss = 1.2837
Checking accuracy on validation set
Got 596 / 1000 correct (59.60)

Iteration 400, loss = 1.1367
Checking accuracy on validation set
Got 614 / 1000 correct (61.40)

Iteration 500, loss = 1.3291
Checking accuracy on validation set
Got 641 / 1000 correct (64.10)

Iteration 600, loss = 0.9429
Checking accuracy on validation set
Got 623 / 1000 correct (62.30)

Iteration 700, loss = 0.7880
Checking accuracy on validation set
Got 654 / 1000 correct (65.40)

----------------------------Number of epochs: 2
Iteration 0, loss = 0.8172
Checking accuracy on validation set
Got 647 / 1000 correct (64.70)

Iteration 100, loss = 0.8276
Checki

# Section V. CIFAR-10 open-ended challenge

最後這個章節是自由發揮題！請大家絞盡腦汁（以及 Google 的 GPU），使用 `nn.Module` 或是 `nn.Sequential` API 設計出一套 CNN 進行訓練，在十個 epoch 之內達到 70% 以上的 CIFAR-10 validation accuracy！上方的 check_accuracy 與 training 函數都可以使用。

請參考官方的 API 說明書：

* Layers in torch.nn package: http://pytorch.org/docs/stable/nn.html
* Activations: http://pytorch.org/docs/stable/nn.html#non-linear-activations
* Loss functions: http://pytorch.org/docs/stable/nn.html#loss-functions
* Optimizers: http://pytorch.org/docs/stable/optim.html

### Things you might try:
- **Filter size**: 上面的 CNN 使用的是 5x5 的 filter。
- **Number of filters**: 上面的 filter 數目為 32。
- **Pooling vs Strided Convolution**: Max pooling 和 strided convolutions 哪個效果會比較好呢？
- **Batch normalization**: 大家可以在 convolution layer 之後附加 spatial batch normalization，affine layer 之後附加 vanilla batch normalization。這樣的網路架構會不會跑得比較快？
- **Network architecture**: 深度網路會不會比較強大呢？大家可以試試看：
    - [conv-relu-pool] x N -> [affine] x M -> [softmax or SVM]
    - [conv-relu-conv-relu-pool] x N -> [affine] x M -> [softmax or SVM]
    - [batchnorm-relu-conv] x N -> [affine] x M -> [softmax or SVM]
- **Global Average Pooling**: 一般的 CNN 會在 convolution 結束後做 flattening 然後進入 affine layers。另外一種做法是在 convolution 結束後使用 global average pooling 取得一個 1x1 的 average image（形狀為 (1, 1 , Filter#)），然後 reshape 成長度為 Filter# 的向量。大家可以參考 [Google 的 Inception Network](https://arxiv.org/abs/1512.00567)（see Table 1）。
- **Regularization**: 大家可以使用 L2 regularization loss 或是 Dropout。

### Tips for training
記得要調整 learning rate 等 hyperparameters，找出最好的數值。Tuning 的過程應注意：

- 好的 hyperparameter 數值應該在一千個 iteration 以內見效。
- 記得使用 coarse-to-fine tuning：
    - 先進行粗調，不要訓練太久，不好的 hyperparameter 可以直接略過。
    - 找到適當的範圍後再進行微調，訓練更多遍。
- Hyperparameter tuning 應該使用 validation set 而不是 test set！後者是留到最後測試最好的模型使用的。

### Going above and beyond
大家如果有興趣，可以自行撰寫程式支援進階的功能！

- Alternative optimizers: 使用 Adam、Adagrad、RMSprop 等學習模式。
- Alternative activation functions：使用 leaky ReLU、parametric ReLU、ELU、MaxOut 等激勵函數。
- Model ensembles
- Data augmentation
- New architectures ([see this blog](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32))
  - [ResNets](https://arxiv.org/abs/1512.03385)：將前一層的 input 導入下一層。
  - [DenseNets](https://arxiv.org/abs/1608.06993)：將前面所有 layer 的 input 都導入下一層。

### Have fun and happy training! 

In [None]:
from torch.nn.modules.activation import Softmax
################################################################################
# TODO:                                                                        #         
# Experiment with any architectures, optimizers, and hyperparameters.          #
# Achieve AT LEAST 70% accuracy on the *validation set* within 10 epochs.      #
#                                                                              #
# Note that you can use the check_accuracy function to evaluate on either      #
# the test set or the validation set, by passing either loader_test or         #
# loader_val as the second argument to check_accuracy. You should not touch    #
# the test set until you have finished your architecture and  hyperparameter   #
# tuning, and only run the test set once at the end to report a final value.   #
################################################################################
model = None
optimizer = None
learning_rate = 1e-2
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
model = nn.Sequential(
    # Your Model Here
    # N * 3 (number of filters) * 32 (input size) * 32 (input size)
    nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    # nn.MaxPool2d(kernel_size=2, stride=2),
    nn.AvgPool2d(2, 2),

    # N * 64 * 16 * 16
    nn.Conv2d(in_channels=64, out_channels=32, kernel_size=3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    # nn.MaxPool2d(2, 2),
    nn.AvgPool2d(2, 2),

    # N * 32 * 8 * 8
    nn.Conv2d(in_channels=32, out_channels=16, kernel_size=3, padding=1),
    nn.BatchNorm2d(16),
    nn.ReLU(),
    # nn.MaxPool2d(2, 2),
    nn.AvgPool2d(2, 2),

    # N * 16 * 4 * 4
    nn.Flatten(),
    nn.Linear(in_features=16 * 4 * 4, out_features=10),
    # nn.Linear(in_features=100, out_features=10),
    # nn.Softmax(dim=1)
)
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                     momentum=0.9, nesterov=True)
# optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
################################################################################
#                                 END OF YOUR CODE                             
################################################################################

# You should get at least 70% accuracy
train_part34(model, optimizer, epochs=10)

----------------------------Number of epochs: 1
Iteration 0, loss = 2.4677
Checking accuracy on validation set
Got 118 / 1000 correct (11.80)

Iteration 100, loss = 1.8249
Checking accuracy on validation set
Got 432 / 1000 correct (43.20)

Iteration 200, loss = 1.3821
Checking accuracy on validation set
Got 520 / 1000 correct (52.00)

Iteration 300, loss = 1.2878
Checking accuracy on validation set
Got 517 / 1000 correct (51.70)

Iteration 400, loss = 1.2537
Checking accuracy on validation set
Got 567 / 1000 correct (56.70)

Iteration 500, loss = 1.0419
Checking accuracy on validation set
Got 566 / 1000 correct (56.60)

Iteration 600, loss = 1.1310
Checking accuracy on validation set
Got 564 / 1000 correct (56.40)

Iteration 700, loss = 0.8799
Checking accuracy on validation set
Got 629 / 1000 correct (62.90)

----------------------------Number of epochs: 2
Iteration 0, loss = 1.0938
Checking accuracy on validation set
Got 625 / 1000 correct (62.50)

Iteration 100, loss = 1.1972
Checki

## Describe what you did 

請敘述您採取的策略。

## Answer

[[conv-relu-pool] x 3: filter channels使用3 --> 64 --> 32 --> 16 且搭配Batch normalization和使用Global average pooling來提取特徵 --> [affine] x 1: 全連接層一次收斂為10個分類，最後使用SGD with momentum 和 nesterov 來優化權重]

## Test set -- run this only once

請將最好的模型儲存於 `best_model`，並使用 test set 做測試。下方的 test accuracy 跟上方的 validation accuracy 有何關係？

In [None]:
best_model = model
check_accuracy_part34(loader_test, best_model)

Checking accuracy on test set
Got 7527 / 10000 correct (75.27)


---
# IMPORTANT

恭喜大家完成作業！**請開啟資料夾的分享功能，並將共用連結填寫在 stanCode 作業繳交表單內！**