<img align="center" src="figures/course.png" width="800">

#                                    16720 (B) Neural Networks for Recognition - Assignment 3

     Instructor: Kris Kitani                       TAs: Arka, Jinkun, Rawal, Rohan, Sheng-Yu

## Q4 PyTorch (40 points)

**Please include all the write up answers below to HW3:PDF. For the questions that need code, you need to include the screenshot of code to PDF submission to get points.**

While you were able to derive manual back-propagation rules for sigmoid and fully-connected layers, wouldn't it be nice if someone did that for lots of useful primatives and made it fast and easy to use for general computation?  Meet [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation). Since we have high-dimensional inputs (images) and low-dimensional outputs (a scalar loss), it turns out **forward mode AD** is very efficient. Popular autodiff packages include [pytorch](https://pytorch.org/) (Facebook), [tensorflow](https://www.tensorflow.org/) (Google), [autograd](https://github.com/HIPS/autograd) (Boston-area academics). Autograd provides its own replacement for numpy operators and is a drop-in replacement for numpy, except you can ask for gradients now. The other two are able to act as shim layers for [cuDNN](https://developer.nvidia.com/cudnn), an implementation of auto-diff made by Nvidia for use on their GPUs. Since GPUs are able to perform large amounts of math much faster than CPUs, this makes the former two packages very popular for researchers who train large networks. Tensorflow asks you to build a computational graph using its API, and then is able to pass data through that graph. PyTorch builds a dynamic graph and allows you to mix autograd functions with normal python code much more smoothly, so it is currently more popular among CMU students. 

We will use [pytorch](https://pytorch.org/) as a framework. Many computer vision projects use neural networks as a basic building block, so familiarity with one of these frameworks is a good skill to develop. Here, we basically replicate and slightly expand our handwritten character recognition networks, but do it in PyTorch instead of doing it ourselves. Feel free to use any tutorial you like, but we like [the offical one](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) or [this tutorial](http://cs231n.stanford.edu/notebooks/pytorch_tutorial.ipynb) (in a jupyter notebook) or [these slides](http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture08.pdf (starting from number 35).

**For this section, you're free to implement these however you like. All of the tasks required here are fairly small and don't require a GPU if you use small networks, including 4.2.**

### Q4.1 Train a neural network in PyTorch

#### Q4.1.1 (10 Points Code+WriteUp)
 
Re-write and re-train your fully-connected network on NIST36 in PyTorch. Plot training accuracy and loss over time.

<font color="red">**Please include your answer to HW3:PDF**</font>

<font color="red">**For this question, please also submit screenshot of your code snippets to the write-up**</font>

In [4]:
import numpy as np
import pickle
import scipy.io

In [30]:
train_data = scipy.io.loadmat('data/nist36_train.mat')
valid_data = scipy.io.loadmat('data/nist36_valid.mat')
test_data = scipy.io.loadmat('data/nist36_test.mat')

train_x, train_y = train_data['train_data'], train_data['train_labels']
valid_x, valid_y = valid_data['valid_data'], valid_data['valid_labels']
test_x, test_y = test_data['test_data'], test_data['test_labels']

In [9]:
train_x.shape

(10800, 1024)

In [107]:
train_y.shape

(10800, 36)

In [32]:
valid_x.shape

(3600, 1024)

In [31]:
test_x.shape

(1800, 1024)

Need to convert one-hot labels to multiclass indices, take argmax

In [73]:
target = [2]
target = torch.Tensor(target).type(torch.LongTensor)

In [75]:
train_y.shape

(10800, 36)

In [122]:
train_y_multiclass = np.apply_along_axis(np.argmax, 1, train_y)
train_y_multiclass

array([ 0,  0,  0, ..., 35, 35, 35], dtype=int64)

In [121]:
# train_y_multiclass = np.apply_over_axes(np.argmax, train_y, 1)
# train_y_multiclass

In [118]:
# train_y_multiclass_torch = np.array([torch.Tensor(x).type(torch.LongTensor) for x in train_y_multiclass])
# train_y_multiclass_torch

  train_y_multiclass_torch = np.array([torch.Tensor(x).type(torch.LongTensor) for x in train_y_multiclass])
  train_y_multiclass_torch = np.array([torch.Tensor(x).type(torch.LongTensor) for x in train_y_multiclass])


array([tensor([0]), tensor([0]), tensor([0]), ..., tensor([35]),
       tensor([35]), tensor([35])], dtype=object)

In [11]:
 """
    Train the network with two sequential layers: 
    (1) one layer named "layer1" with sigmoid activation
    (2) one layer named "output" with softmax activation
"""

'\n   Train the network with two sequential layers: \n   (1) one layer named "layer1" with sigmoid activation\n   (2) one layer named "output" with softmax activation\n'

In [59]:
import torch
import torch.nn.functional as F
from torch import nn

# define the network class
class MyNetwork(nn.Module):
    def __init__(self):
        # call constructor from superclass
        super().__init__()
        
        # define network layers
        self.layer1 = nn.Linear(1024, 64)
        self.output = nn.Linear(64, 36)
        
    def forward(self, x):
        # define forward pass
        # x = F.relu(self.fc1(x))
        # x = F.relu(self.fc2(x))
        # x = torch.sigmoid(self.fc3(x))
        x = torch.sigmoid(self.layer1(x.float()))
        x = torch.softmax(self.output(x), dim=-1)
    # * (Tensor input, int dim, torch.dtype dtype, *, Tensor out)
#  * (Tensor input, name dim, *, torch.dtype dtype)

        return x

# instantiate the model
model = MyNetwork()

# print model architecture
print(model)

MyNetwork(
  (layer1): Linear(in_features=1024, out_features=64, bias=True)
  (output): Linear(in_features=64, out_features=36, bias=True)
)


In [24]:
# hyperparams, I just set these arbitrarily
learning_rate = 0.01
epochs = 30

In [25]:
import torch.optim as optim  

# create a stochastic gradient descent optimizer
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)
# create a loss function
criterion = nn.NLLLoss()

In [116]:
train_x.shape

(10800, 1024)

In [123]:
train_y_in = train_y_multiclass

train = torch.utils.data.TensorDataset(torch.tensor(train_x), torch.tensor(train_y_in))
train_loader = torch.utils.data.DataLoader(train, batch_size=64, shuffle=False)

valid = torch.utils.data.TensorDataset(torch.tensor(valid_x), torch.tensor(valid_y))
valid_loader = torch.utils.data.DataLoader(train, batch_size=64, shuffle=False)

test = torch.utils.data.TensorDataset(torch.tensor(test_x), torch.tensor(test_y))
test_loader = torch.utils.data.DataLoader(train, batch_size=64, shuffle=False)

In [68]:
# train_loader = torch.utils.data.DataLoader(
#     datasets.MNIST('../data', train=True, download=True,
#                     transform=transforms.Compose([
#                         transforms.ToTensor(),
#                         transforms.Normalize((0.1307,), (0.3081,))
#                     ])),
#     batch_size=batch_size, shuffle=True)


In [69]:
# test_loader = torch.utils.data.DataLoader(
#     datasets.MNIST('../data', train=False, transform=transforms.Compose([
#         transforms.ToTensor(),
#         transforms.Normalize((0.1307,), (0.3081,))
#     ])),
#     batch_size=batch_size, shuffle=True)

In [104]:
from torch.autograd import Variable

In [125]:
batch_size=200
learning_rate=0.01
log_interval=10

In [126]:
# run the main training loop
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):

        # Variable may be deprecated
        data, target = Variable(data), Variable(target)

        # resize data from (batch_size, 1, 28, 28) to (batch_size, 28*28)
        data = data.view(-1, 32*32)
        optimizer.zero_grad()
        net_out = model(data)

        print("outputshape", net_out.shape)
        print("targetshape", target.shape)
        loss = criterion(net_out, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset),
                           100. * batch_idx / len(train_loader), loss.data[0]))

outputshape torch.Size([64, 36])
targetshape torch.Size([64])


IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

In [None]:
# run a test loop
test_loss = 0
correct = 0
for data, target in test_loader:
    data, target = Variable(data, volatile=True), Variable(target)
    data = data.view(-1, 28 * 28)
    net_out = net(data)
    # sum up batch loss
    test_loss += criterion(net_out, target).data[0]
    pred = net_out.data.max(1)[1]  # get the index of the max log-probability
    correct += pred.eq(target.data).sum()

test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))

In [None]:


max_iters = 50
# pick a batch size, learning rate
batch_size = None
learning_rate = None
# YOUR CODE HERE
raise NotImplementedError()
hidden_size = 64

batches = get_random_batches(train_x,train_y,batch_size)
batch_num = len(batches)

params = {}

# initialize layers (named "layer1" and "output") here
# YOUR CODE HERE
raise NotImplementedError()


# with default settings, you should get loss < 150 and accuracy > 80%
for itr in range(max_iters):
    total_loss = 0
    total_acc = 0
    for xb,yb in batches:
        
        # training loop can be exactly the same as q2!
        # YOUR CODE HERE
        raise NotImplementedError()
    if itr % 2 == 0:
        print("itr: {:02d} \t loss: {:.2f} \t acc : {:.2f}".format(itr,total_loss,total_acc))

# run on validation set and report accuracy! should be above 70%
valid_acc = None
# YOUR CODE HERE
raise NotImplementedError()
print('Validation accuracy: ',valid_acc)

In [None]:
# YOUR CODE HERE
# a single hidden layer with 64 hidden units, and train for at least 30 epochs.
raise NotImplementedError()

#### Q4.1.2 (3 Points Code+WriteUp)
 
Train a convolutional neural network with PyTorch on MNIST. Plot training accuracy and loss over time.

<font color="red">**Please include your answer to HW3:PDF**</font>

<font color="red">**For this question, please also submit screenshot of your code snippets to the write-up**</font>

In [None]:
# Conv2d

# YOUR CODE HERE
raise NotImplementedError()

#### Q4.1.3 (2 Points Code+WriteUp)
 
Train a convolutional neural network with PyTorch on the included NIST36 dataset. Plot training accuracy and loss over time.

<font color="red">**Please include your answer to HW3:PDF**</font>

<font color="red">**For this question, please also submit screenshot of your code snippets to the write-up**</font>

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Q4.1.4 (15 Points Code+WriteUp)
 
Train a convolutional neural network with PyTorch on the EMNIST Balanced dataset  (available in *torchvision.datasets*, use *balanced* split) and evaluate it on the findLetters bounded boxes from the images folder. Find the accuracy on these bounded boxes.

<font color="red">**Please include your answer to HW3:PDF**</font>

<font color="red">**For this question, please also submit screenshot of your code snippets to the write-up**</font>

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Q4.2 Fine Tuning

#### Q4.2.1 (10 Points Code+WriteUp)
 
Fine-tune a single layer classifier using pytorch on the [flowers 17](http://www.robots.ox.ac.uk/~vgg/data/flowers/17/index.html) (or [flowers 102](http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html)!) dataset using [squeezenet1\_1](https://pytorch.org/docs/stable/torchvision/models.html), as well as an architecture you've designed yourself (*3 conv layers, followed 2 fc layers, it's standard [slide 6](http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture09.pdf)*) and trained from scratch. How do they compare? 
    
We include a script in `scripts/` to fetch the flowers dataset and extract it in a way that [PyTorch ImageFolder](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder) can consume it, see [an example](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#afterword-torchvision), from **data/oxford-flowers17**. You should look at how SqueezeNet is [defined](https://github.com/pytorch/vision/blob/master/torchvision/models/squeezenet.py), and just replace the classifier layer. There exists a pretty good example for [fine-tuning](https://gist.github.com/jcjohnson/6e41e8512c17eae5da50aebef3378a4c) in PyTorch.

<font color="red">**Please include your answer to HW3:PDF**</font>

<font color="red">**For this question, please also submit screenshot of your code snippets to the write-up**</font>

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()