# Try residual

論文
He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385

conv + ReLU + ... を素通りさせるルートを加えるのが **residual**.

### 基本

Tensor をいろいろやった [$ x \mapsto F(x)$]  あとに元の $x$ を足す:
$ x \mapsto F(x) + x$.


具体的にどこの x をいつ戻すかというと、先人の知恵によれば

```
x → conv → batch norm → ReLU → conv → batch norm → + → ReLU
```

x を保存しておいて、+ のところで足す. 

### 変形版

x と F(x) は tensor の形が違うのでは？

Same padding + channel 数不変の場合は良い。しかし: 

1.  valid padding だと大きさがちょっと減る
2. channel数は同じのこともあれば、２倍くらいに増えることもよくある.
3. stride > 2 だと大きさが大きく減る


(1) についてはわからない。 $x$ の両端を切ればいいと思うけど、普通はpadding で大きさを同じにする.

PyTorch 公式の ResNet を読むと
https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L25

convolution (ReLU なし) と batch normalization を入れてある.

$$
F(x) + \mathrm{norm}(\mathrm{conv}(x))
$$



Conv は普通の学習されるパラメタを持っている。違いは ReLu の非線形性を持っていないこと.

```python
 if (stride != 1) or (self.in_channels != out_channels):
    downsample = nn.Sequential(
      conv3x3(self.in_channels, out_channels, stride=stride),
      nn.BatchNorm2d(out_channels))
 ```
                
(3) stride の対処法も同じ. F(x) でやっている stride と同じ stride で conv して ReLU 無し. F(x) 内では普通２回の conv をやるけど、stride > 2 をするのは初回だけで、２回目は形を変えない conv をしている.




In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import math
import time
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F  # F.relu とか

print('PyTorch version', torch.__version__)


# GPU が使えるなら使う
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Device:', device)

if torch.cuda.is_available():
  print('GPU:', torch.cuda.get_device_name(0))

PyTorch version 1.1.0
Device: cuda:0
GPU: Tesla T4


In [3]:
# データが PIL.Image なので、torch.Tensor に変換する
to_tensor = torchvision.transforms.ToTensor()

train = torchvision.datasets.MNIST(root='./input', train=True,
                                   download=True, transform=to_tensor)
test = torchvision.datasets.MNIST(root='./input', train=False,
                                  transform=to_tensor)
print(train)
print(test)

0it [00:00, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./input/MNIST/raw/train-images-idx3-ubyte.gz


9920512it [00:02, 4520147.93it/s]                             


Extracting ./input/MNIST/raw/train-images-idx3-ubyte.gz


0it [00:00, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./input/MNIST/raw/train-labels-idx1-ubyte.gz


32768it [00:00, 68372.01it/s]                            
0it [00:00, ?it/s]

Extracting ./input/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./input/MNIST/raw/t10k-images-idx3-ubyte.gz


1654784it [00:01, 1145982.32it/s]                            
0it [00:00, ?it/s]

Extracting ./input/MNIST/raw/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./input/MNIST/raw/t10k-labels-idx1-ubyte.gz


8192it [00:00, 26013.02it/s]            

Extracting ./input/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Dataset MNIST
    Number of datapoints: 60000
    Root location: ./input
    Split: Train
Dataset MNIST
    Number of datapoints: 10000
    Root location: ./input
    Split: Test





In [0]:
batch_size = 100
train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
test_loader  = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=False)

In [0]:
class ResidualNet(nn.Module):
    def __init__(self):
        # ここでは各レイヤーを定義しているだけ。繋がっていない
        # Sequential を使って __init__ 時点でつなぐ流儀もある
        super(ResidualNet, self).__init__()
        self.conv1a = nn.Conv2d(1,  16, 3)             # 28x28x1  -> 26x26x16
        self.norm1a = nn.BatchNorm2d(16)        
        self.conv1b = nn.Conv2d(16, 16, 3, padding=1)  # 26x24x16 -> 26x26x16
        self.norm1b = nn.BatchNorm2d(16)
        self.conv1c = nn.Conv2d(16, 16, 3, padding=1)  # 26x26x16 -> 26x26x16
        self.norm1c = nn.BatchNorm2d(16)
        self.pool1  = nn.MaxPool2d(2, 2)               # 26x26x16 -> 13x13x16
        
        self.conv2a = nn.Conv2d(16, 32, 3)             # 13x13x16 -> 11x11x32
        self.norm2a = nn.BatchNorm2d(32)
        self.conv2r = nn.Conv2d(16, 32, 3)             # 迂回経路
        self.norm2r = nn.BatchNorm2d(32)
        
        self.conv2b = nn.Conv2d(32, 32, 3, padding=1)  # 11x11x32 -> 11x11x32
        self.norm2b = nn.BatchNorm2d(32)
        self.conv2c = nn.Conv2d(32, 32, 3, padding=1)  # 11x11x32 -> 11x11x32
        self.norm2c = nn.BatchNorm2d(32)
        self.pool2  = nn.MaxPool2d(2, 2)               # 11x11x32 -> 5x5x32
        
        self.fc1    = nn.Linear(5 * 5 * 32, 50)  # fully connected -> 50
        self.dropout1 = nn.Dropout2d()           # default dropout rate is 0.5
        self.fc2      = nn.Linear(50, 10)        # 50 -> 10 number of classes
        #self.dropout2 = nn.Dropout2d()

    def forward(self, x):
        # レイヤーをつなげる
        x = F.relu(self.norm1a(self.conv1a(x)))
        
        # residual block 1
        residual = x   # x
        
        x = F.relu(self.norm1b(self.conv1b(x)))
        x = self.norm1c(self.conv1c(x))
        x += residual  # x + F(x)
        x = F.relu(x)
        x = F.relu(self.pool1(x))
        
        # 変形 residual
        out = F.relu(self.norm2a(self.conv2a(x)))  # 普通の conv + ReLU
        residual = self.norm2r(self.conv2r(x))     # ReLU 無し linear mapping
                                                   # 点線
        x = out + residual
        
        # residual block 2
        residual = x
        x = F.relu(self.norm2b(self.conv2b(x)))
        x = self.norm2c(self.conv2c(x))
        x + residual
        x = F.relu(self.pool2(x))
        
        x = x.view(-1, 5 * 5 * 32)
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)

        x = self.fc2(x)
        #x = self.dropout2(x)

        return x

In [30]:
# Model と 最適化方法
model = ResidualNet()
cross_entropy = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model = model.to(device)  # send to GPU if available

# 訓練の進行具合を記録
def measure_scores(name, loader, history, *, nbatch=100):
    loss_sum = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        for i, (images, labels) in enumerate(loader, 0):
            # send data to GPU
            images = images.to(device)
            labels = labels.to(device)
        
            outputs = model(images)
            loss = cross_entropy(outputs, labels)
            loss_sum += loss
                
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            
            if i + 1 == nbatch:
              break

    # average loss during the epoch
    loss_average = loss_sum / len(train_loader)
    accuracy = correct/total
    
    history[name + '_loss'].append(loss_average)
    history[name + '_accuracy'].append(accuracy)
            
            
epochs = 20
history = {'train_loss': [], 'test_loss': [],
           'train_accuracy': [], 'test_accuracy': []}
times_train = []
times_test = []

for iepoch in range(epochs):
    model.train()  # 訓練モード
    time_train = 0.0
    for i, (images, labels) in enumerate(train_loader, 0): 
        ts = time.time()
        
        # send data to GPU
        images = images.to(device)
        labels = labels.to(device)
            
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(images)
        loss = cross_entropy(outputs, labels)
        loss.backward()
        optimizer.step()
        
        te = time.time()
        time_train += (te - ts)
    
    times_train.append(time_train)
    
    # Measure test accuracy for each epoch
    ts = time.time()
    
    model.eval()  # evaluation モード; dropout しないなど
    
    measure_scores('train', train_loader, history)    
    measure_scores('test', test_loader, history)
    
    
    te = time.time()
    times_test.append(te - ts)
    
    print('Epoch %d: loss %.4e %.4e, Test accuracy %.6f' % (iepoch + 1,
          history['train_loss'][-1],
          history['test_loss'][-1],
          history['test_accuracy'][-1]))

    
print('Finished Training')
print('Training %.2f ± %.2f sec per epoch' % (np.mean(times_train), np.std(times_train)))
print('Test evaluation %.4f ± %.4f sec per epoch' % (np.mean(times_test), np.std(times_test)))
print('Total %.2f sec' % (np.sum(times_test) + np.sum(times_test)))

Epoch 1: loss 1.0304e-02 8.3961e-03, Test accuracy 0.983200
Epoch 2: loss 5.5120e-03 5.3586e-03, Test accuracy 0.989200
Epoch 3: loss 5.5392e-03 4.4336e-03, Test accuracy 0.991200
Epoch 4: loss 3.4171e-03 4.2596e-03, Test accuracy 0.992500
Epoch 5: loss 2.9798e-03 4.2762e-03, Test accuracy 0.991900
Epoch 6: loss 2.3169e-03 2.8138e-03, Test accuracy 0.994800
Epoch 7: loss 2.7404e-03 3.7591e-03, Test accuracy 0.993700
Epoch 8: loss 2.6596e-03 3.3486e-03, Test accuracy 0.993700
Epoch 9: loss 2.0004e-03 3.7963e-03, Test accuracy 0.993900
Epoch 10: loss 1.8190e-03 3.2798e-03, Test accuracy 0.994500
Epoch 11: loss 4.8030e-03 6.4935e-03, Test accuracy 0.989300
Epoch 12: loss 2.8197e-03 4.9198e-03, Test accuracy 0.991800
Epoch 13: loss 1.5670e-03 2.9438e-03, Test accuracy 0.995400
Epoch 14: loss 1.8783e-03 3.6530e-03, Test accuracy 0.993600
Epoch 15: loss 1.6968e-03 3.6187e-03, Test accuracy 0.995200
Epoch 16: loss 8.4926e-04 3.0151e-03, Test accuracy 0.995200
Epoch 17: loss 1.3241e-03 2.9245e

In [16]:
plt.figure(figsize=(8, 4))
plt.subplot(1, 2, 1)
plt.title('Loss')
plt.xlabel('epochs')
plt.plot(history['train_loss'], label='train')
plt.plot(history['test_loss'], label='test')
plt.legend()

plt.subplot(1, 2, 2)
plt.title('1 - Accuracy')
plt.xlabel('epochs')
plt.plot(1 - np.array(history['train_accuracy']))
plt.plot(1 - np.array(history['test_accuracy']))
plt.show()

1152

conv を追加して深くした分、精度が良くなった、0.995. Residual block によって学習が早くなったかどうかは resdual 無しで同じ深さの `PlainNet` と比べるべきだけど、まあ動いたのでいいや.