# Convolutional neural network

## Outline

1. Load dataset
2. Architecures general guidelines
2. Train and test functions
3. CNN models
 
Sources:

Deep learning
- [cs231n.stanford.edu](http://cs231n.stanford.edu/)

CNN
- [Stanford cs231n](http://cs231n.github.io/convolutional-networks/)

Pytorch
- [WWW tutorials](https://pytorch.org/tutorials/)
- [github tutorials](https://github.com/pytorch/tutorials)
- [github examples](https://github.com/pytorch/examples)

MNIST and pytorch:
- [MNIST nextjournal.com/gkoehler/pytorch-mnist](https://nextjournal.com/gkoehler/pytorch-mnist)
- [MNIST github/pytorch/examples](https://github.com/pytorch/examples/tree/master/mnist)
- [MNIST kaggle](https://www.kaggle.com/sdelecourt/cnn-with-pytorch-for-mnist)

## Architetctures

Sources:

- [cv-tricks.com](https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception)
- [zhenye-na.github.io(]https://zhenye-na.github.io/2018/12/01/cnn-deep-leearning-ai-week2.html)


### LeNet

The first  Convolutional Networks were developed by Yann LeCun in 1990’s.

<img src="figures/LeNet_Original_Image.jpg" width="1000">

### AlexNet

(2012, Alex Krizhevsky, Ilya Sutskever and Geoff Hinton)

- Deeper, bigger,
- Featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer always immediately followed by a POOL layer).
- **ReLu(Rectified Linear Unit)** for the non-linear part, instead of a Tanh or Sigmoid.

The advantage of the ReLu over sigmoid is that it trains much faster than the latter because the derivative of sigmoid becomes very small in the saturating region and therefore the updates to the weights almost vanish. This is called **vanishing gradient problem**.

- **Dropout**: reduces the over-fitting by using a Dropout layer after every FC layer. Dropout layer has a probability,(p), associated with it and is applied at every neuron of the response map separately. It randomly switches off the activation with the probability p.  

<img src="figures/dropout.jpeg" width="500">

Why does DropOut work?

The idea behind the dropout is similar to the model ensembles. Due to the dropout layer, different sets of neurons which are switched off, represent a different architecture and all these different architectures are trained in parallel with weight given to each subset and the summation of weights being one. For n neurons attached to DropOut, the number of subset architectures formed is 2^n. So it amounts to prediction being averaged over these ensembles of models. This provides a structured model regularization which helps in avoiding the over-fitting. Another view of DropOut being helpful is that since neurons are randomly chosen, they tend to avoid developing co-adaptations among themselves thereby enabling them to develop meaningful features, independent of others.


**GoogLeNet**. (Szegedy et al. from Google 2014) was a Convolutional Network . Its main contribution was the development of an

- **Inception Module** that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M).
- There are also several followup versions to the GoogLeNet, most recently Inception-v4.

**VGGNet**. (Karen Simonyan and Andrew Zisserman 2014)

- 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture.

- Only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Replace large kernel-sized filters(11 and 5 in the first and second convolutional layer, respectively) with multiple 3X3 kernel-sized filters one after another.

With a given receptive field(the effective area size of input image on which output depends), multiple stacked smaller size kernel is better than the one with a larger size kernel because multiple non-linear layers increases the depth of the network which enables it to learn more complex features, and that too at a lower cost. For example, three 3X3 filters on top of each other with stride 1 ha a receptive size of 7, but the number of parameters involved is 3*(9^2) in comparison to 49^2 parameters of kernels with a size of 7. 

- Lot more memory and parameters (140M)

**ResNet**. (Kaiming He et al. 2015) Residual Network developed by . It features special 

- Skip connections
- Batch normalization. 
- State of the art CNN models and are the default choice (as of May 10, 2016). In particular, also see more
- Recent developments that tweak the original architecture from Kaiming He et al. Identity Mappings in Deep Residual Networks (published March 2016).

[Models in pytorch](https://github.com/pytorch/vision/tree/master/torchvision/models)


## Architecures general guidelines

- ConvNets stack CONV,POOL,FC layers
- Trend towards smaller filters and deeper architectures: stack 3x3, instead of 5x5
- Trend towards getting rid of POOL/FC layers (just CONV)
- Historically architectures looked like [(CONV-RELU) x N POOL?] x M (FC-RELU) x K, SOFTMAX where N is usually up to ~5, M is large, 0 <= K <= 2.
- but recent advances such as ResNet/GoogLeNet have challenged this paradigm

## MNIST digit classification

In [None]:
import os
import numpy as np
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

import matplotlib.pyplot as plt

Set working directory

In [None]:
import tempfile
WD = os.path.join(tempfile.gettempdir(), "dl_cnn_mnist_pytorch")
os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir is:", os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

Hyperparameters

In [None]:
n_epochs = 5
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10
random_seed = 1
no_cuda = True

use_cuda = not no_cuda and torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

Load dataset: MNIST Handwritten Digit

In [None]:
def load_mnist(batch_size_train, batch_size_test):
    
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=batch_size_train, shuffle=True)
    
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('data', train=False, transform=transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])),
        batch_size=batch_size_test, shuffle=True)
    return train_loader, test_loader

train_loader, test_loader = load_mnist(batch_size_train, batch_size_test)
data_shape = train_loader.dataset.data.shape[1:]
D_in = np.prod(data_shape)
D_out = len(train_loader.dataset.targets.unique())

## Train and test functions

In [None]:
def train(model, train_loader, optimizer, epoch, device, log_interval=10, batch_max=np.inf, save_model=True):
    train_losses, train_counter = list(), list()
    # epoch = 1; log_interval=10; train_losses=[]; train_counter=[]

    model.train()

    # Iterate over minibatch
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx > batch_max:
            break
        # batch_idx, (data, target) = next(enumerate(train_loader))
        # print(data.shape)
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
    
        # Forward
        output = model(data)
        loss = F.nll_loss(output, target)
    
        # Bakward
        loss.backward()

        # Update params
        optimizer.step()
        
        # Track losses
        train_losses.append(loss.item())
        train_counter.append(data.shape[0]) # (batch_idx * data.shape[0]) + ((epoch-1)*len(train_loader.dataset)))

        # Save model
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            
            if save_model:
                torch.save(model.state_dict(), 'models/mod-%s.pth' % model.__class__.__name__)
                torch.save(optimizer.state_dict(), 'models/mod-%s_opt-%s.pth' % (model.__class__.__name__, optimizer.__class__.__name__))

    return model, train_losses, train_counter


def test(model, test_loader, device, batch_max=np.inf):

    model.eval()

    test_loss = 0
    correct = 0
    output, pred, target = list(), list(), list()

    # Iterate over mini-batches
    with torch.no_grad():
        for batch_idx, (data, target_) in enumerate(test_loader):
            if batch_idx > batch_max:
                break
            # batch_idx, (data, target) = next(enumerate(test_loader))
            # print(target_.shape)
            data, target_ = data.to(device), target_.to(device) # target.shape == 1000
            output_ = model(data) # output.shape == (1000, 10)
            
            # Compute loss
            test_loss += F.nll_loss(output_, target_, reduction='sum').item() # sum up batch loss
            pred_ = output_.argmax(dim=1) # get the index of the max log-probability
            
            # An correct classification
            correct += pred_.eq(target_.view_as(pred_)).sum().item() # view_as(other): View this tensor as the same size as other

            # Track output, class-prediction and true target
            output.append(output_)
            pred.append(pred_)
            target.append(target_)

    output = torch.cat(output)
    pred = torch.cat(pred)
    target = torch.cat(target)
    assert pred.eq(target.view_as(pred)).sum().item() == correct

    test_loss /= len(target)
    print('Average loss: {:.4f}, Accuracy: {}/{} ({:.1f}%)'.format(
        test_loss, correct, len(target),
        100. * correct / len(target)))
    return pred, output, target, test_loss

## CNN models

### LeNet-5

Here we implement LeNet-5 with relu activation
[Source](https://github.com/bollakarthikeya/LeNet-5-PyTorch/blob/master/lenet5_cpu.py)



In [None]:
# Defining the network (LeNet-5)  
class LeNet5(torch.nn.Module):
     
    def __init__(self):   
        super(LeNet5, self).__init__()
        # Convolution (In LeNet-5, 32x32 images are given as input. Hence padding of 2 is done below)
        self.conv1 = torch.nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2, bias=True)
        # Max-pooling
        self.max_pool_1 = torch.nn.MaxPool2d(kernel_size=2)
        # Convolution
        self.conv2 = torch.nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0, bias=True)
        # Max-pooling
        self.max_pool_2 = torch.nn.MaxPool2d(kernel_size=2)
        # Fully connected layer
        self.fc1 = torch.nn.Linear(16*5*5, 120)   # convert matrix with 16*5*5 (= 400) features to a matrix of 120 features (columns)
        self.fc2 = torch.nn.Linear(120, 84)       # convert matrix with 120 features to a matrix of 84 features (columns)
        self.fc3 = torch.nn.Linear(84, 10)        # convert matrix with 84 features to a matrix of 10 features (columns)
        
    def forward(self, x):
        # convolve, then perform ReLU non-linearity
        x = torch.nn.functional.relu(self.conv1(x))  
        # max-pooling with 2x2 grid
        x = self.max_pool_1(x)
        # convolve, then perform ReLU non-linearity
        x = torch.nn.functional.relu(self.conv2(x))
        # max-pooling with 2x2 grid
        x = self.max_pool_2(x)
        # first flatten 'max_pool_2_out' to contain 16*5*5 columns
        # read through https://stackoverflow.com/a/42482819/7551231
        x = x.view(-1, 16*5*5)
        # FC-1, then perform ReLU non-linearity
        x = torch.nn.functional.relu(self.fc1(x))
        # FC-2, then perform ReLU non-linearity
        x = torch.nn.functional.relu(self.fc2(x))
        # FC-3
        x = self.fc3(x)
        
        return F.log_softmax(x, dim=1)

Increase depth on conv layer, remove one FC, simplify with no padding

In [None]:
class LeNet5bis(nn.Module):
    def __init__(self):
        super(LeNet5bis, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1) 
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(50*4*4, 120) # 500
        self.fc2 = nn.Linear(120, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x)) # 20 x 24x24 # No padding, loose 4, since kernel size = 5
        x = F.max_pool2d(x, 2, 2) # 20 x 12x12
        x = F.relu(self.conv2(x)) # 50 x 8x8  # No padding, loose 4, since kernel size = 5
        x = F.max_pool2d(x, 2, 2) # 50 x 4x4
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

### Conv-relu blocks

In [None]:
# Defining the network (LeNet-5)  
class ConvNet2Block(torch.nn.Module):
     
    def __init__(self):   
        super(ConvNet2Block, self).__init__()

        # Conv block 1
        #self.conv11 = nn.Conv2d(in_channels=1, out_channels=20, kernel_size=3, stride=1, padding=1, bias=True)
        #self.conv12 = nn.Conv2d(in_channels=20, out_channels=20, kernel_size=3, stride=1, padding=1, bias=True)
        self.conv11 = nn.Conv2d(in_channels=1, out_channels=20, kernel_size=3, stride=1, padding=0, bias=True)
        self.conv12 = nn.Conv2d(in_channels=20, out_channels=20, kernel_size=3, stride=1, padding=0, bias=True)
        self.max_pool_1 = nn.MaxPool2d(kernel_size=2) # output 20*12*12

        # Conv block 2
        self.conv21 = nn.Conv2d(in_channels=20, out_channels=50, kernel_size=3, stride=1, padding=0, bias=True)
        self.conv22 = nn.Conv2d(in_channels=50, out_channels=50, kernel_size=3, stride=1, padding=0, bias=True)
        self.max_pool_2 = nn.MaxPool2d(kernel_size=2) # output 50*4*4

        # Fully connected layer
        self.fc1 = nn.Linear(50*4*4, 120)
        self.fc2 = nn.Linear(120, 10)
    
    def forward(self, x):
        x = nn.functional.leaky_relu(self.conv11(x))
        x = nn.functional.leaky_relu(self.conv12(x))
        x = self.max_pool_1(x)

        x = nn.functional.leaky_relu(self.conv21(x))
        x = nn.functional.leaky_relu(self.conv22(x))
        x = self.max_pool_2(x)
        
        # first flatten 'max_pool_2_out' to contain 32*7*7 columns
        # read through https://stackoverflow.com/a/42482819/7551231
        x = x.view(-1, 50*4*4)
        # FC-1, then perform ReLU non-linearity
        x = nn.functional.leaky_relu(self.fc1(x))
        x = self.fc2(x)
        
        return F.log_softmax(x, dim=1)

In [None]:
n_epochs = 5

#model = LeNet5() # 98.4% with 61706 parameters
model = LeNet5bis() # 99.1% with 122900 parameters
#model = ConvNet2Block() # 98.7% with  132750 parameters
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

# Explore the model
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))
    
train_losses, train_counter, test_losses, test_counter = [], [], [], []
for epoch in range(1, n_epochs + 1):

    print("Train: ", end = '')
    model, train_losses_, train_counter_ = train(model, train_loader, optimizer, epoch, device, log_interval)
    train_losses += train_losses_
    train_counter += train_counter_
    
    print("Test : ", end = '')
    pred, output, target, test_loss = test(model, test_loader, device)
    test_counter.append(np.sum(train_counter))
    test_losses.append(test_loss)
    
    # Train accuracy
    pred_train, output_train, target_train, loss_train = test(model, train_loader, device)
    #print("Train accuracy = {:.1f}%".format((target_train == pred_train).sum().item() * 100. / len(target_train)))
    #print("Test accuracy = {:.1f}%".format((target == pred).sum().item() * 100. / len(target)))

fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, color='blue')
plt.plot(test_counter, test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

## Reduce the size of training dataset

Reduce the size of the training dataset by considering only a subset of batches.
Reduce the size of the batch size to `16`, an consider `8` mini-batches for training.

In [None]:
train_loader, test_loader = load_mnist(16, batch_size_test)

In [None]:
n_epochs = 50
n_batch = 8

# model = LeNet5() # 89.2%% with 61706 parameters
#model = LeNet5bis() #93.4%
model = ConvNet2Block() # 92.4%
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

# Explore the model
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))
    
train_losses, train_counter, test_losses, test_counter = [], [], [], []
for epoch in range(1, n_epochs + 1):
    print()
    model, train_losses_, train_counter_ = train(model, train_loader, optimizer, epoch, device, log_interval=100,
                                                 batch_max=n_batch, save_model=False)
    train_losses += train_losses_
    train_counter += train_counter_
    
    print("Test : ", end = '')
    pred_test, output_test, target_test, loss_test = test(model, test_loader, device)
    test_counter.append(np.sum(train_counter))
    test_losses.append(loss_test)
    
    # Train accuracy
    print("Train: ", end = '')
    pred_train, output_train, target_train, loss_train = test(model, train_loader, device, batch_max=n_batch)

    #print("Train accuracy = {:.1f}%".format((target_train == pred_train).sum().item() * 100. / len(target_train)))
    #print("Test accuracy = {:.1f}%".format((target_test == pred_test).sum().item() * 100. / len(target_test)))
    
fig = plt.figure()
plt.plot(np.cumsum(train_counter), train_losses, color='blue')
plt.plot(test_counter, test_losses, "or")
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

## ResNet on CIFAR-10

[Source Yunjey Choi](https://github.com/yunjey/pytorch-tutorial)

In [None]:
WD = os.path.join(tempfile.gettempdir(), "dl_cnn_cifar10_pytorch")
os.makedirs(WD, exist_ok=True)
os.chdir(WD)
print("Working dir is:", os.getcwd())
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

Load CIFAR-10 dataset

In [None]:
# ---------------------------------------------------------------------------- #
# An implementation of https://arxiv.org/pdf/1512.03385.pdf                    #
# See section 4.2 for the model architecture on CIFAR-10                       #
# Some part of the code was referenced from below                              #
# https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py   #
# ---------------------------------------------------------------------------- #

import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
num_epochs = 5
learning_rate = 0.001

# Image preprocessing modules
transform = transforms.Compose([
    transforms.Pad(4),
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32),
    transforms.ToTensor()])

# CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='data/',
                                             train=True, 
                                             transform=transform,
                                             download=True)

test_dataset = torchvision.datasets.CIFAR10(root='data/',
                                            train=False, 
                                            transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=100, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=100, 
                                          shuffle=False)

How many classes to predict?

In [None]:
D_out = len(set(train_loader.dataset.targets))
print(D_out)

ResNet Model:

Stack multiple resnet blocks



Resnet block variants ([Source](http://torch.ch/blog/2016/02/04/resnets.html)):
<img src="figures/resnets_modelvariants.png" width="1000">

In [None]:
# 3x3 convolution
def conv3x3(in_channels, out_channels, stride=1):
    return nn.Conv2d(in_channels, out_channels, kernel_size=3, 
                     stride=stride, padding=1, bias=False)

# Residual block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResidualBlock, self).__init__()
        self.conv1 = conv3x3(in_channels, out_channels, stride)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(out_channels, out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample
        
    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.downsample:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

# ResNet
class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=10):
        super(ResNet, self).__init__()
        self.in_channels = 16
        self.conv = conv3x3(3, 16)
        self.bn = nn.BatchNorm2d(16)
        self.relu = nn.ReLU(inplace=True)
        self.layer1 = self.make_layer(block, 16, layers[0])
        self.layer2 = self.make_layer(block, 32, layers[1], 2)
        self.layer3 = self.make_layer(block, 64, layers[2], 2)
        self.avg_pool = nn.AvgPool2d(8)
        self.fc = nn.Linear(64, num_classes)
        
    def make_layer(self, block, out_channels, blocks, stride=1):
        downsample = None
        if (stride != 1) or (self.in_channels != out_channels):
            downsample = nn.Sequential(
                conv3x3(self.in_channels, out_channels, stride=stride),
                nn.BatchNorm2d(out_channels))
        layers = []
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        self.in_channels = out_channels
        for i in range(1, blocks):
            layers.append(block(out_channels, out_channels))
        return nn.Sequential(*layers)
    
    def forward(self, x):
        out = self.conv(x)
        out = self.bn(out)
        out = self.relu(out)
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.avg_pool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

Train and test functions

In [None]:
# For updating learning rate
def update_lr(optimizer, lr):    
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

# Train the model
def train(model, train_loader, optimizer, epoch, device, learning_rate):
    # Loss and optimizer

    total_step = len(train_loader)
    curr_lr = learning_rate
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        criterion = nn.CrossEntropyLoss()
        # CrossEntropyLoss() combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print ("Epoch [{}/{}], Step [{}/{}] Loss: {:.4f}"
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

def test(model, test_loader, device):
    # Test the model
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in test_loader:
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        print('Accuracy of the model on the test images: {} %'.format(100 * correct / total))

    return correct / total

In [None]:
n_epochs = 5

model = ResNet(ResidualBlock, [2, 2, 2]).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
curr_lr = learning_rate

# Explore the model
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))

acc_train, acc_test = [], []

for epoch in range(num_epochs):
    train(model, train_loader, optimizer, epoch, device, learning_rate)
    
    print("Train: ", end = '')
    acc_train_ = test(model, train_loader, device)
    print("Test : ", end = '')
    acc_test_ = test(model, test_loader, device)
    acc_train.append(acc_train_)
    acc_test.append(acc_test_)
    
    # Decay learning rate
    if (epoch+1) % 20 == 0:
        curr_lr /= 3
        update_lr(optimizer, curr_lr)

    # Save the model checkpoint
    torch.save(model.state_dict(), 'models/mod-%s.ckpt' % model.__class__.__name__)
    torch.save(optimizer.state_dict(), 'models/mod-%s_opt-%s.pth' % (model.__class__.__name__, optimizer.__class__.__name__))
    # save the current learning rate
    np.save("models/resnet_lr_epoch.npy", np.array([curr_lr, epoch]))


## Toward transfer learning

In [None]:
model = torchvision.models.resnet18(pretrained=True)
for param in model_conv.parameters():
    param.requires_grad = False


# Parameters of newly constructed modules have requires_grad=True by default
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, D_out)

model = model.to(device)

criterion = nn.CrossEntropyLoss()

# Observe that only parameters of final layer are being optimized as
# opposed to before.
optimizer = optim.SGD(model.fc.parameters(), lr=0.001, momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs
#from torch.optim import lr_scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

In [None]:
# Explore the model
for parameter in model.parameters():
    print(parameter.shape)

print("Total number of parameters =", np.sum([np.prod(parameter.shape) for parameter in model.parameters()]))

acc_train, acc_test = [], []

for epoch in range(num_epochs):
    train(model, train_loader, optimizer, epoch, device, learning_rate)
    
    print("Train: ", end = '')
    acc_train_ = test(model, train_loader, device)
    lr_scheduler.step()

    print("Test : ", end = '')
    acc_test_ = test(model, test_loader, device)
    acc_train.append(acc_train_)
    acc_test.append(acc_test_)
    
    # Decay learning rate
    if (epoch+1) % 20 == 0:
        curr_lr /= 3
        update_lr(optimizer, curr_lr)

    # Save the model checkpoint
    # torch.save(model.state_dict(), 'models/mod-%s.ckpt' % model.__class__.__name__)
    # torch.save(optimizer.state_dict(), 'models/mod-%s_opt-%s.pth' % (model.__class__.__name__, optimizer.__class__.__name__))
    # save the current learning rate
    # np.save("models/resnet_lr_epoch.npy", np.array([curr_lr, epoch]))