# Training a ConvNet PyTorch

In this notebook, you'll learn how to use the powerful PyTorch framework to specify a conv net architecture and train it on the CIFAR-10 dataset.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import sampler
import importlib

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

import timeit
import customDataset
from customDataset import CustomDataset
from importlib import reload

In [2]:
reload(customDataset)

class ChunkSampler(sampler.Sampler):
    """Samples elements sequentially from some offset. 
    Arguments:
        num_samples: # of desired datapoints
        start: offset where we should start selecting from
    """
    def __init__(self, num_samples, start = 0):
        self.num_samples = num_samples
        self.start = start

    def __iter__(self):
        return iter(range(self.start, self.start + self.num_samples))

    def __len__(self):
        return self.num_samples
    
synth_dataset = CustomDataset()

-----Synth 2 Loaded------
-----Synth 3 Loaded------
tensor([[76.4000, 76.0500, 76.2000,  ..., 73.0500, 72.8000, 72.6500],
        [75.9000, 75.6500, 75.7500,  ..., 72.8500, 72.8000, 72.7000],
        [75.3500, 75.1500, 75.1000,  ..., 72.8500, 72.7500, 72.8500],
        ...,
        [71.2500, 71.8500, 72.0000,  ..., 83.3500, 83.0500, 82.3000],
        [71.1000, 71.6000, 71.6000,  ..., 83.5000, 83.3000, 82.4500],
        [70.1500, 70.5500, 70.4500,  ..., 83.2500, 83.0500, 82.2000]])
______________________________
tensor([[77.8000, 77.4500, 77.6000,  ..., 74.2000, 73.9000, 73.7500],
        [77.3000, 77.0500, 77.1500,  ..., 74.0000, 73.9000, 73.8000],
        [76.7500, 76.5500, 76.5000,  ..., 73.9500, 73.8500, 73.9500],
        ...,
        [74.7000, 75.1500, 75.2000,  ..., 82.5500, 82.4000, 81.8000],
        [74.3500, 74.9000, 74.8000,  ..., 82.6500, 82.7000, 82.0000],
        [73.1000, 73.5500, 73.3500,  ..., 82.5500, 82.5500, 81.6000]])
______________________________
tensor([[72.0500, 

AttributeError: module 'PIL.Image' has no attribute 'show'

In [9]:
train_loader = torch.utils.data.DataLoader(dataset=synth_dataset,
                                           batch_size=5, 
                                           shuffle=True)

In [10]:
dtype = torch.FloatTensor # the CPU datatype

# Constant to control how frequently we print train loss
print_every = 100

# This is a little utility that we'll use to reset the model
# if we want to re-initialize all our parameters
def reset(m):
    if hasattr(m, 'reset_parameters'):
        m.reset_parameters()

In [11]:
class Flatten(nn.Module):
    def forward(self, x):
        N, C, H, W = x.size() # read in N, C, H, W
        return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

In [6]:
# Here's where we define the architecture of the model... 
simple_model = nn.Sequential(
                nn.Conv2d(3, 32, kernel_size=7, stride=2),
                nn.ReLU(inplace=True),
                Flatten(), # see above for explanation
                nn.Linear(5408, 10), # affine layer
              )

# Set the type of all data in this model to be FloatTensor 
simple_model.type(dtype)

loss_fn = nn.CrossEntropyLoss().type(dtype)
optimizer = optim.Adam(simple_model.parameters(), lr=1e-2) # lr sets the learning rate of the optimizer

PyTorch supports many other layer types, loss functions, and optimizers - you will experiment with these next. Here's the official API documentation for these (if any of the parameters used above were unclear, this resource will also be helpful). One note: what we call in the class "spatial batch norm" is called "BatchNorm2D" in PyTorch.

* Layers: http://pytorch.org/docs/nn.html
* Activations: http://pytorch.org/docs/nn.html#non-linear-activations
* Loss functions: http://pytorch.org/docs/nn.html#loss-functions
* Optimizers: http://pytorch.org/docs/optim.html#algorithms

In [7]:
fixed_model_base = nn.Sequential(
                nn.Conv2d(3, 32, kernel_size=7, stride=1),
                nn.ReLU(inplace=True),
                nn.BatchNorm2d(32),
                nn.MaxPool2d(2, stride=2),
                Flatten(), # see above for explanation
                nn.Linear(5408, 1024), # affine layer
                nn.ReLU(inplace=True),
                nn.Linear(1024, 10)
              )

fixed_model = fixed_model_base.type(dtype)

### GPU!

Now, we're going to switch the dtype of the model and our data to the GPU-friendly tensors, and see what happens... everything is the same, except we are casting our model and input tensors as this new dtype instead of the old one.

If this returns false, or otherwise fails in a not-graceful way (i.e., with some error message), you may not have an NVIDIA GPU available on your machine. If you're running locally, we recommend you switch to Google Colab and follow the instructions to set up a GPU there. If you're already on Google Colab, something is wrong -- make sure you followed the instructions on how to request and use a GPU on your instance. If you did, post on Piazza or come to Office Hours so we can help you debug.

In [8]:
# Verify that CUDA is properly configured and you have a GPU available

torch.cuda.is_available()

True

In [10]:
import copy
gpu_dtype = torch.cuda.FloatTensor

fixed_model_gpu = copy.deepcopy(fixed_model_base).type(gpu_dtype)

x_gpu = torch.randn(64, 3, 32, 32).type(gpu_dtype)
x_var_gpu = Variable(x.type(gpu_dtype)) # Construct a PyTorch Variable out of your input data
ans = fixed_model_gpu(x_var_gpu)        # Feed it through the model! 

# Check to make sure what comes out of your model
# is the right dimensionality... this should be True
# if you've done everything correctly
np.array_equal(np.array(ans.size()), np.array([64, 10]))

True

In [13]:
loss_fn = nn.MSELoss()
optimizer = optim.Adam(fixed_model_gpu.parameters(),lr = 1e-3)

In [14]:
reload(customDataset)

# This sets the model in "training" mode. This is relevant for some layers that may have different behavior
# in training mode vs testing mode, such as Dropout and BatchNorm. 
fixed_model_gpu.train()

# Load one batch at a time.
for t, (x, y) in enumerate(train_loader):
    x_var = Variable(x.type(gpu_dtype))
    y_var = Variable(y.type(gpu_dtype).long())

    # This is the forward pass: predict the scores for each class, for each x in the batch.
    scores = fixed_model_gpu(x_var)
    
    # Use the correct y values and the predicted y values to compute the loss.
    loss = loss_fn(scores, y_var)
    
    if (t + 1) % print_every == 0:
        print('t = %d, loss = %.4f' % (t + 1, loss.item()))

    # Zero out all of the gradients for the variables which the optimizer will update.
    optimizer.zero_grad()
    
    # This is the backwards pass: compute the gradient of the loss with respect to each 
    # parameter of the model.
    loss.backward()
    
    # Actually update the parameters of the model using the gradients computed by the backwards pass.
    optimizer.step()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 1: invalid start byte

Now you've seen how the training process works in PyTorch. To save you writing boilerplate code, we're providing the following helper functions to help you train for multiple epochs and check the accuracy of your model:

In [25]:
def train(model, loss_fn, optimizer, num_epochs = 1):
    for epoch in range(num_epochs):
        print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
        model.train()
        for t, (x, y) in enumerate(train_loader):
            x_var = Variable(x.type(gpu_dtype))
            y_var = Variable(y.type(gpu_dtype))
            
            scores = model(x_var)
            scores = torch.unsqueeze(scores, 2)
            scores = scores[:,:,np.ones(3)]
            
            print("scores=", scores.shape)
            print(scores[0])
            print("y_var=",y_var.shape)
            print(y_var[0])
            
            loss = loss_fn(scores, y_var)
            
            if (t + 1) % print_every == 0:
                print('t = %d, loss = %.4f' % (t + 1, loss.item()))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def check_accuracy(model, loader):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    for x, y in loader:
        x_var = Variable(x.type(gpu_dtype), volatile=True)

        scores = model(x_var)
        _, preds = scores.data.cpu().max(1)
        num_correct += (preds == y).sum()
        num_samples += preds.size(0)
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

### Check the accuracy of the model.

Let's see the train and check_accuracy code in action -- feel free to use these methods when evaluating the models you develop below.

You should get a training loss of around 1.2-1.4, and a validation accuracy of around 50-60%. As mentioned above, if you re-run the cells, you'll be training more epochs, so your performance will improve past these numbers.

But don't worry about getting these numbers better -- this was just practice before you tackle designing your own model.

In [None]:
torch.cuda.random.manual_seed(12345)
fixed_model_gpu.apply(reset)
train(fixed_model_gpu, loss_fn, optimizer, num_epochs=1)
check_accuracy(fixed_model_gpu, loader_val)

### Don't forget the validation set!

And note that you can use the check_accuracy function to evaluate on either the test set or the validation set, by passing either **loader_test** or **loader_val** as the second argument to check_accuracy. You should not touch the test set until you have finished your architecture and hyperparameter tuning, and only run the test set once at the end to report a final value. 

## Train a _great_ model on CIFAR-10!

Now it's your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves **>=70%** accuracy on the CIFAR-10 **validation** set. You can use the check_accuracy and train functions from above.

### Things you should try:
- **Filter size**: Above we used 7x7; this makes pretty pictures but smaller filters may be more efficient
- **Number of filters**: Above we used 32 filters. Do more or fewer do better?
- **Pooling vs Strided Convolution**: Do you use max pooling or just stride convolutions?
- **Batch normalization**: Try adding spatial batch normalization after convolution layers and vanilla batch normalization after affine layers. Do your networks train faster?
- **Network architecture**: The network above has two layers of trainable parameters. Can you do better with a deep network? Good architectures to try include:
    - [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]
- **Global Average Pooling**: Instead of flattening and then having multiple affine layers, perform convolutions until your image gets small (7x7 or so) and then perform an average pooling operation to get to a 1x1 image picture (1, 1 , Filter#), which is then reshaped into a (Filter#) vector. This is used in [Google's Inception Network](https://arxiv.org/abs/1512.00567) (See Table 1 for their architecture).
- **Regularization**: Add l2 weight regularization, or perhaps use Dropout.

### Tips for training
For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:

- If the parameters are working well, you should see improvement within a few hundred iterations
- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
- You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.

### Going above and beyond
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these; however they would be good things to try for extra credit.

- Alternative update steps: For the assignment we implemented SGD+momentum, RMSprop, and Adam; you could try alternatives like AdaGrad or AdaDelta.
- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.
- Model ensembles
- Data augmentation
- New Architectures
  - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.
  - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.
  - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)

If you do decide to implement something extra, clearly describe it in the "Extra Credit Description" cell below.

### What we expect
At the very least, you should be able to train a ConvNet that gets at least 70% accuracy on the validation set. This is just a lower bound - if you are careful it should be possible to get accuracies much higher than that! Extra credit points will be awarded for particularly high-scoring models or unique approaches.

You should use the space below to experiment and train your network. 

Have fun and happy training!

In [26]:
torch.cuda.empty_cache()
# Train your model here, and make sure the output of this cell is the accuracy of your best model on the 
# train, val, and test sets. Here's some code to get you started. The output of this cell should be the training
# and validation accuracy on your best model (measured by validation accuracy).
fixed_model_base = nn.Sequential(
                nn.Conv2d(3, 32, kernel_size=3, stride=1),        
                nn.ReLU(inplace=True),
                nn.BatchNorm2d(32),
    
                nn.Conv2d(32, 64, kernel_size=3, stride=1),        
                nn.ReLU(inplace=True),
                nn.BatchNorm2d(64),
                nn.MaxPool2d(2, stride=2),
    
                Flatten(), # see above for explanation
                nn.Linear(5408, 1024), # affine layer
                nn.ReLU(inplace=True),
                nn.BatchNorm1d(1024),
                nn.Linear(1024, 256),
                nn.ReLU(inplace=True),
                nn.BatchNorm1d(256),
                nn.Linear(256, 10),
                nn.ReLU(inplace=True),
                nn.BatchNorm1d(10)
              )

fixed_model = fixed_model_base.type(dtype)
gpu_dtype = torch.cuda.FloatTensor
fixed_model_gpu = copy.deepcopy(fixed_model).type(gpu_dtype)


model = fixed_model_gpu
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.RMSprop(fixed_model_gpu.parameters(),lr = 1e-3)

train(model, loss_fn, optimizer, num_epochs=3)
check_accuracy(model, loader_val)

Starting epoch 1 / 3
scores= torch.Size([5, 21, 3])
tensor([[ 6.6533e-01,  6.6533e-01,  6.6533e-01],
        [ 6.7023e-01,  6.7023e-01,  6.7023e-01],
        [ 3.1788e-01,  3.1788e-01,  3.1788e-01],
        [ 9.1276e-01,  9.1276e-01,  9.1276e-01],
        [-1.0107e-03, -1.0107e-03, -1.0107e-03],
        [ 1.6729e+00,  1.6729e+00,  1.6729e+00],
        [ 2.8337e-01,  2.8337e-01,  2.8337e-01],
        [-1.4446e-01, -1.4446e-01, -1.4446e-01],
        [-2.1223e-01, -2.1223e-01, -2.1223e-01],
        [ 1.0059e-01,  1.0059e-01,  1.0059e-01],
        [-5.2417e-03, -5.2417e-03, -5.2417e-03],
        [-7.6966e-01, -7.6966e-01, -7.6966e-01],
        [-3.1626e-01, -3.1626e-01, -3.1626e-01],
        [-3.2141e-01, -3.2141e-01, -3.2141e-01],
        [-7.5093e-01, -7.5093e-01, -7.5093e-01],
        [ 1.5285e+00,  1.5285e+00,  1.5285e+00],
        [-8.0646e-03, -8.0646e-03, -8.0646e-03],
        [ 1.6593e+00,  1.6593e+00,  1.6593e+00],
        [-5.4032e-01, -5.4032e-01, -5.4032e-01],
        [-4.8986e

RuntimeError: index 1 is out of bounds for dimension 2 with size 1

In [None]:
model = fixed_model
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.RMSprop(fixed_model_gpu.parameters(),lr = 1e-3)

train(model, loss_fn, optimizer, num_epochs=3)
check_accuracy(model, loader_val)

### Describe what you did 

In the cell below you should write an explanation of what you did, any additional features that you implemented, and any visualizations or graphs that you make in the process of training and evaluating your network.

I read through the information on densenets and liked the architecture of doing multiple convolutions and decreasing in size. However i tried this while still flattening and using linear to bring down the output and got abysmal results, around 15% accuracy on validation.

Naturally, I then tried the exact opposite.
I did maxpool every other convolution rather than every convolution. Then multiple affine layers bringing the output sizes down. I also did 1d batch norm after the affine layers which nearly doubled the validation accuracy.

## Test set -- run this only once

Now that we've gotten a result we're happy with, we test our final model on the test set (which you should store in best_model).  This would be the score we would achieve on a competition. Think about how this compares to your validation set accuracy.

In [None]:
best_model = model
check_accuracy(best_model, loader_test)

## Going further with PyTorch

The next assignment will make heavy use of PyTorch. You might also find it useful for your projects. 

Here's a nice tutorial by Justin Johnson that shows off some of PyTorch's features, like dynamic graphs and custom NN modules: http://pytorch.org/tutorials/beginner/pytorch_with_examples.html

If you're interested in reinforcement learning for your final project, this is a good (more advanced) DQN tutorial in PyTorch: http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html