In [1]:
import numpy as np
import torch
from torch import nn
import torch.optim as optim
import torch.nn.functional as F

import matplotlib.pyplot as plt
import seaborn as sns
import time

# Homework 3: Neural Networks

Please read all instructions for this homework carefully. Many of the questions ask for specific things to be in your notebooks and many of your questions may be answered in the step-by-step instructions. You will need include _all_ parts of the questions in your answers to recieve full credit.

This homework is going to focus heavily on the practical use of neural networks. You will be required to learn the basics of PyTorch (https://pytorch.org/) to implement and train fully connected and convolutional neural networks.

We strongly recommend working through the PyTorch beginner guide [here](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html), which will both give you a good idea of how PyTorch works and provide you with some boilerplate code you can adapt for this assignment. 

## Question 1: Backpropagation

A computational graph is a framework to represent complex mathematical expressions as a directed graph of function compositions, starting from the simplest, to the most complex. For instance, the expression $e = (a+b)\times(b+1)$ can be represented as a computational graph, as shown in the figure below. Each node represents a variable, where the direction represents the flow of inputs to (intermediate) outputs. All incoming variables are collected to be operated on by a function. 

$$
c = a + b \\
d = b + 1 \\
e = c \times d
$$


In the graph below, these functions are multiplications and additions, and a sequence of simple operations leads to the relatively complex expression of $e$.

![A Computational Graph (http://colah.github.io/posts/2015-08-Backprop/)](http://colah.github.io/posts/2015-08-Backprop/img/tree-def.png)

Computational graphs allow for a computational abstraction to compute _exact_ partial derivatives of complex expressions, as a realization of the chain rule from calculus. For instance, computing $\frac{\partial e}{\partial a}$ involves essentially reversing the arrows and transporting intermediate partial derivatives back to $a$. This method is termed _backpropagation_ (short: _backprop_), and is a key ingredient of modern neural network frameworks that implement gradient descent (including PyTorch). The first-three sections of [Calculus on Computational Graphs: Backpropagation](http://colah.github.io/posts/2015-08-Backprop/) should allow reasonable understanding for the next questions.

### Part A: Computational Graph for Logistic Regression

Build the computational graph of logistic regression for a single input/output pair. As in earlier homeworks/labs, use the variables $\mathbf{w} = \begin{bmatrix}\mathbf{w}_0 & \mathbf{w}_1 & \cdots & \mathbf{w}_d\end{bmatrix}^T \in \mathbb{R}^{d+1}$ for the vector of parameters and $\mathbf{x} = \begin{bmatrix}1 & x_1 & x_2 & \cdots & x_d \end{bmatrix} \in \mathbb{R}^{d}$ for the vector input, with $y$ as the corresponding output. Write the computational graph as a sequence of algebraic equations in terms of simple unary/binary functions only (no graphic needed).

$v_{-1}= {w}, v_{0}={x}, v_1= {wx}, v_2= e^{-wx}= e^{-v_1}, v_3= 1+e^{-wx}= 1+v_2, v_4= \frac{1}{v_3}, y= v_4$

### Part B: Backprop on Logistic Regression

For the computational graph built in I., compute the derivatives of the loss in logistic regression with respect to the parameters $\mathbf{w}$ via backpropagation. Write these as a sequence of equations (no graphic needed), that clearly show _backward_ flow of gradients via the chain rule.

**Hint**: Table 3 of [this manuscript](https://arxiv.org/abs/1502.05767) may be a helpful resource.

$ \bar{v}_4= \bar{y} $ 

$\bar{v}_3 = \bar{v}_4 \frac{\partial v_4}{\partial v_3}=   \bar{v}_4 \cdot -\frac{1}{{v_3}^{2}}  $

$\bar{v}_2 = \bar{v}_3 \frac{\partial v_3}{\partial v_2}=  \bar{v}_3 \cdot 1= \bar{v}_3 $

$\bar{v}_1 = \bar{v}_2 \frac{\partial v_2}{\partial v_1}= \bar{v}_2 \cdot - e^{-v_1}$

$\bar{v}_0 = \bar{v}_1 \frac{\partial v_1}{\partial v_0} = \bar{v}_1 \cdot v_{-1}$

$\bar{v}_{-1}= \bar{v}_1 \frac{\partial v_1}{\partial v_{-1}} = \bar{v}_1 \cdot v_{0}$

### Part C: Challenges in _deep_ neural networks

Now consider a _deep_ neural network with L layers, such that the function is defined via weight matrices at each layer are defined as $W^{(l)}$ for all $l \in [1, \dots, L]$, along with $\sigma$ (sigmoid) non-linearity which is applied pointwise to the activation $h^{(l)}$ of each layer.

$$
h^{(l)} = \begin{cases}
\sigma(W^{(l)}\mathbf{x}),\quad\quad\text{if}~l = 1 \\
\sigma(W^{(l)}h^{(l-1)}),\quad\text{if}~l < L \\
W^{(l)}h^{(l-1)},\quad\quad\text{if}~l = L
\end{cases}
$$

The definition simply defines the operations at the input layer (which involves $\mathbf{x}$), hidden layers (which only involve activations at the previous layer $h^{(l-1)}$), and the output layer where no $\sigma$ (sigmoid) activation is applied. Therefore, the neural network can be recursively defined as $f_\mathbf{w}(h^{(L)})$. Why can very deep networks (i.e. large $L$) with sigmoid non-linearities be hard to learn via backpropagation?

**Hint**: Consider a two hidden layer neural network (i.e. L = 3), and think about how repeated application of chain rule affects the qualitative nature of the gradients as we backpropagate from the output layer to the input layer, as we increase L.

The maximum derivative of the sigmoid activation function is 0.25 and when the sigmoid activation function value is either too high or too low, the derivative of the sigmoid function is very close to 0. Considering a deep network, the derivative of the weight is a product of term, and this small gradient will stack up to a extremly small number. For example, if we have two layers, then the early layer would learn slower than the later hidden layer, and this is called the vansihing gradient.  


## Question 2: A Simple Dataset

The aim of this question is to implement a first neural network, and start to get an intuition about what sorts of functions neural networks are capable of producing.
First, use the provided code to generate 1000 training points and 500 testing points from the _two spirals_ dataset with `noise=1.5`.


**Note** We've also included a two functions to help you. You should _not_ alter these functions (nor should you need to).

`plotter` takes in your model and the training and testing data and will plot the predictions generated by your neural network.

`accuracy` is a simple function that will compute the accuracy of your model on some given `x` and `y` data, i.e. either `train_x` and `train_y`, or `test_x` and `test_y`. 

In [1]:
def twospirals(n_points, noise=1.5, random_state=42):
    """
     Returns the two spirals dataset.
     Note: n_points is points PER CLASS
    """
    n = np.sqrt(np.random.rand(n_points,1)) * 600 * (2*np.pi)/360
    d1x = -1.5*np.cos(n)*n + np.random.randn(n_points,1) * noise
    d1y =  1.5*np.sin(n)*n + np.random.randn(n_points,1) * noise
    return (np.vstack((np.hstack((d1x,d1y)),np.hstack((-d1x,-d1y)))),
            np.hstack((np.zeros(n_points),np.ones(n_points))))

ntrain = 1000
ntest = 500
noise = 1.5

train_x, train_y = twospirals(int(ntrain/2), noise=noise)

train_x, train_y = torch.FloatTensor(train_x), torch.FloatTensor(train_y).unsqueeze(-1)
print(train_y)
test_x, test_y = twospirals(int(ntest/2), noise=noise)
test_x, test_y = torch.FloatTensor(test_x), torch.FloatTensor(test_y).unsqueeze(-1)

NameError: name 'np' is not defined

#### Helper Functions

In [None]:
def plotter(model, train_x, train_y, test_x, test_y):
    '''
    This is just a simple plotting function, you should NOT need to change anything here
    '''
    buffer = 1.
    h = 0.1
    x_min, x_max = train_x[:, 0].min() - buffer, train_x[:, 0].max() + buffer
    y_min, y_max = train_x[:, 1].min() - buffer, train_x[:, 1].max() + buffer

    xx,yy=np.meshgrid(np.arange(x_min.cpu(), x_max.cpu(), h), 
                      np.arange(y_min.cpu(), y_max.cpu(), h))
    in_grid = torch.FloatTensor([xx.ravel(), yy.ravel()]).t()

    pred = torch.sigmoid(model(in_grid)).detach().reshape(xx.shape)
    plt.figure(figsize=(15, 10))
    cmap = sns.color_palette("crest_r", as_cmap=True)
    plt.contourf(xx, yy, pred, alpha=0.5, cmap=cmap)
    plt.title("Classifier", fontsize=24)
    cbar= plt.colorbar()
    cbar.set_label(label=r"$P(Y = 1)$", size=18)
    cbar.ax.tick_params(labelsize=18)
    plt.scatter(train_x[:, 0].cpu(), train_x[:, 1].cpu(), c=train_y[:, 0].cpu(), cmap=plt.cm.binary, alpha=0.5, label="Train")
    plt.scatter(test_x[:, 0].cpu(), test_x[:, 1].cpu(), c=test_y[:, 0].cpu(), cmap=plt.cm.binary, marker='+', s=150, label="Test")
    plt.legend(fontsize=18)
    plt.xticks([])
    plt.yticks([])
    plt.show()
    
def accuracy(model, X, Y):
    
    preds = torch.round(torch.sigmoid((model(X))))
    return 100 * len(torch.where(preds == Y)[0])/Y.numel()

### Part A: Network Definition

Write a class that defines a _fully connected_ neural network with ReLU activations. Your class should be a child class of the `torch.nn.Module` class, i.e. the first line of your class definition should be something like:
```{python}
class NeuralNetwork(nn.Module):
```
The number of layers and width of each layer is up to you - but keep this implementation flexible, as you'll need to come back and tweak your network in order to achieve the desired accuracy on our dataset.
Additionally, this network should be built to work with the two spirals dataset, meaning it should take 2 dimensional inputs and produce a 1 dimensional output. Specifically we are constructing a network $f(x): \mathbb{R}^2 \rightarrow \mathbb{R}$.

To ensure that your model is correctly defined, the last cell in your notebook for this section should contain the following lines, with the correct handle for your neural network class:

```{python}
model = your_network_class(appropriate_arguments)
print(model(train_x).shape)
```

**Answer Here**

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
class NeuralNetwork(nn.Module):
    def __init__(self, insize):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(insize, 14)
        self.fc2 = nn.Linear(14, 6 )
        self.fc3 = nn.Linear(6, 1 )

        
    def forward(self, x):
 #       x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        x = self.fc3(x)
        return x
model = NeuralNetwork(2)

print(model(train_x).shape)

### Part B: Training

Write and execute code to train your neural network. You should train with SGD, using the `torch.nn.BCEWithLogitsLoss` loss function. Tune your network and training routine until you get **at least 90% training accuracy**. You should think about changing: width and depth of your network, the learning rate of your optimizer, the number of training iterations, and including bias terms in your network layers.

After your model is trained plot the predicted function using the `plotter` function. Report the _test_ accuracy achieved by your model as well as the architecture of your network - how many layers did you use, how wide was each layer, did you include a bias term, etc.


**Note**: with the right network setup you can achieve this training accuracy in less than 5 seconds on a laptop, you can use a large model that takes longer to train if you want, but be advised that you don't _need_ a large model to be successful here. 

**Answer Here**

In [None]:
print(torch.cuda.get_device_name(0))

In [None]:
batch_size= 50
training_dataset= torch.utils.data.TensorDataset(train_x, train_y)
train_dataloader = torch.utils.data.DataLoader(training_dataset, batch_size=batch_size, shuffle= True)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, weight_decay=0.0001)

n_epoch=5000
schedule = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=n_epoch)

def train(dataloader, model, loss_fn, optim):
    size = len(dataloader.dataset)
    for epoch in range(n_epoch):
        acc_loss = 0.
        for batch, (X, y) in enumerate(dataloader):
            pred = model(X)
            loss = loss_fn(pred, y)
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            schedule.step()
        if (epoch%100==0):
          acc=accuracy(model,train_x,train_y)
          print("Epoch {:d},  Accuracy = {:.2f}".format(epoch,acc))
          if(acc>90):
            print (acc)
            break

train(train_dataloader, model, loss_fn, optimizer)
plotter(model, train_x, train_y, test_x, test_y)

## Question 3: Convolutional Networks and Image Recognition

In this question we are going to build up to larger models on larger datasets. If you don't already have access to a CUDA-enabled GPU you will need to complete this section of the homework on Google Colab. Google Colab allows you to run hardware accelerated notebooks online for free. By uploading this notebook into Colab then selecting `Runtime` -> `Change Runtime Type` -> And choosing `GPU` as your hardware accelerator you will be able to run your code on a GPU.

Before starting we need to setup our data. We'll use the `torchvision` package to handle sourcing and normalizing the CIFAR-10 dataset, which we'll need to download to continue.

In [None]:
import torchvision
import torchvision.transforms as transforms

cuda0 = torch.device('cuda:0')
device = cuda0

The code cell below sets up a data transformation which casts the data to tensors (so we can pass it through a network), and normalizes each of the 3 RGB layers of the images. The first time you run this you will see a progress bar showing that the dataset is being downloaded. 

After running this code you will have a `trainloader` and a `testloader`. These are iterable python objects that allow you to sample mini-batches of data according the `batch_size` you passed in to the `DataLoader` function call. You can loop through the full dataset by running:
```{python}
for inputs, targets in trainloader:
    ...
```

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
                                          shuffle=True, num_workers=2,
                                          pin_memory=True)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=128,
                                         shuffle=False, num_workers=2,
                                         pin_memory=True)

## Part A: Network Definition

A common strategy for defining convolutional neural networks is to first pass the data through a set of convolutional layers to perform feature extraction, then to pass through a set of linear, or fully connected, layers to perform classification. 

Construct a class that defines a pytorch model to be used with CIFAR-10 containing the following layers, in order:

- A convolutional layer outputting 8 channels with a kernel size of 3
- A ReLU activation
- A max-pooling layer with kernel size of 2
- A convolutional layer outputting 16 channels with a kernel size of 3
- A ReLU activation
- A max-pooling layer with kernel size of 2
- A convolutional layer outputting 32 channels with a kernel size of 3
- A ReLU activation
- A max-pooling layer with kernel size of 2
  - After this operation you will need to reshape the outputs of the max-pooling layer to a tensor of size `batch_size` $\times$ `128`
- A linear layer with an output size of 128
- A ReLU activation
- A linear layer with an output size of 64
- A ReLU activation
- A linear layer with an output size of 10



To show that your network is correctly implemented, after you define this class, create an instance of it and show that if you pass a batch of CIFAR-10 images from `trainloader` through the network you get an output of size `batch_size` $\times$ `10`.

**Answer Here**

In [None]:
class CNN(nn.Module):
    def __init__(self, batch_size):
        super().__init__()
        self.conv1 = nn.Conv2d(3,8,3)
        self.conv2 = nn.Conv2d(8,16,3)
        self.conv3 = nn.Conv2d(16,32,3)
        self.fc1 = nn.Linear(128,128)
        self.fc2 = nn.Linear(128,64 )
        self.fc3 = nn.Linear(64,10 )
        
    def forward(self, x):
        x = torch.nn.functional.max_pool2d(F.relu(self.conv1(x)),2)
        x = torch.nn.functional.max_pool2d(F.relu(self.conv2(x)),2)
        x = torch.nn.functional.max_pool2d(F.relu(self.conv3(x)),2)
        x = torch.flatten(x, start_dim = 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

batch_CNN = 128
model = CNN(batch_CNN)
#model.to(device) 
for batch_CNN, (X, y) in enumerate(trainloader):
#  X= X.to(device)
#
  y= y.to(device)
  print(model(X).shape)
  break

## Part B: Training Loop

Using `torch.nn.CrossEntropyLoss` as your objective function and `torch.optim.SGD` as your optimizer. Train your network for 2 epochs while running on a _CPU_ using a batch size of 128. How long does it take? 

**Answer Here**

In [None]:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(),lr=0.1)
n_epoch=2
def train_CNN_CPU(dataloader, model, loss_fn, optim, n_epoch):
    start.record()
    size = len(dataloader.dataset)
    for epoch in range(n_epoch):
        acc_loss = 0.
        for X, y in dataloader:
            pred = model(X)
            loss = loss_fn(pred, y)
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    end.record()
    torch.cuda.synchronize()
    time = start.elapsed_time(end)
    print("time consumed running on CPU = {:.2f} s ".format(time/1000.0))
    

train_CNN_CPU(trainloader, model, loss_fn, optimizer,2)

## Part C: Porting to a GPU

CUDA-enabled GPUs can speed up training time signficantly, and have really enabled the widespread success of neural networks. Using a GPU, reinitialize your network and train for 2 epochs using a batch size of 128. How long does 2 epochs of training take? If you are correctly utilizing a GPU a single epoch should take only about 50-60% as much time as in Part B. 

To be sure you are doing things correctly: if you have a torch tensor, `X`, you can see what device it is on using the command `X.device`; if the tensor is stored on the GPU you will see `type='cuda'` in the output.

**Answer Here**

In [None]:
model.to(device) 

def train_CNN_GPU(dataloader, model, loss_fn, optim,n_epoch):
    start.record()
    size = len(dataloader.dataset)
    for epoch in range(n_epoch):
        acc_loss = 0.
        for X, y in dataloader:
            X = X.to(device)
            y = y.to(device)
            pred = model(X)
            loss = loss_fn(pred, y)
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
    end.record()
    torch.cuda.synchronize()
    time = start.elapsed_time(end)
    print("time consumed running on GPU = {:.2f} s ".format(time/1000.0))
    

train_CNN_GPU(trainloader, model, loss_fn, optimizer,2)

## Part D: Training Your Network

Now you're going to fully train your network on CIFAR-10. Using SGD with a learning rate of 0.01 and a momentum parameter of 0.9 train your network for 30 epochs using a GPU. 

You should log the training accuracy and average loss per training input at every epoch, and test accuracy and average loss per test input every other epoch. Make sure that you are **not** training the network when evaluating the performance on testing data. 

If you have constructed your network and your training loop correctly training your network should take about 5-10 minutes, including the evaluation on the test data. If it is taking significantly longer than that make sure that you are utilizing the GPU correctly, and that you are not doing any redundant computations in your code. 

**Answer Here**

In [None]:
model = CNN(batch_CNN)
model.to(device) 
accuracy_list_train=[]
accuracy_list_test=[]
loss_list_train=[]
loss_list_test=[]
def accuracy1(model, dataloader):
    acc_loss1 = 0.
    n_correct1 = 0
    for X, y in dataloader:
        X = X.to(device)
        y = y.to(device)
        pred1 = model(X).to(device)
        loss1 = loss_fn(pred1, y)
        acc_loss1 += loss1.item()
        n_correct1 += torch.sum(torch.argmax(pred1, -1) == y)
    print("\t  testset Loss = {:.6f}; Accuracy = {:.2f}".format( acc_loss1/len(testset), 100*n_correct1.item()/len(testset)))
    accuracy_list_test.append(100*n_correct1.item()/len(testset))
    loss_list_test.append(acc_loss1/len(testset))
def train_CNN_GPU1(dataloader, model, loss_fn, optim,n_epoch):
    size = len(dataloader.dataset)
    for epoch in range(n_epoch):
        acc_loss = 0.
        n_correct = 0
        for X, y in dataloader:
            X = X.to(device)
            y = y.to(device)
            pred = model(X).to(device)
            loss = loss_fn(pred, y)
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            acc_loss += loss.item()
            n_correct += torch.sum(torch.argmax(pred, -1) == y)
#            print(torch.argmax(pred, -1))
#            print(y)

            optimizer.step()
        print("Epoch {:d}, trainset Loss = {:.6f}; Accuracy = {:.2f}".format(epoch, acc_loss/len(trainset), 100*n_correct.item()/len(trainset)))
        accuracy_list_train.append(100*n_correct.item()/len(trainset))
        loss_list_train.append(acc_loss/len(trainset))
        accuracy1(model, testloader)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum = 0.9)
epoch1= 30
train_CNN_GPU1(trainloader, model, loss_fn, optimizer, epoch1)


## Part E: Examining Your Results

Take the stored training and testing evaluations from the previous question and make two plots:
  1. A plot of loss vs. time including curves for both training and testing,
  2. A plot of accuracy vs. time including curves for both training and testing.

You should see the gap between the performance of your model start to increase as training progresses. This gap is referred to as the _generalization gap_. Why do you think the generalization gap grows later in training, and why might this be an issue for neural networks specifically.


Please use the following function to compute the accuracy of your trained network - you should **not** need to modify this function.

In [None]:
def plot(loss_list_train,loss_list_test,accuracy_list_train,accuracy_list_test):
    fig1=plt.figure(dpi=100)
    plt.plot(loss_list_train, alpha=0.75, label="training loss")
    plt.plot(loss_list_test, alpha=0.75, label="testing loss")
    plt.ylabel("Loss")
    plt.xlabel("epoch ")
    sns.despine()
    plt.legend()
    plt.show()

    fig2=plt.figure(dpi=100)
    plt.plot(accuracy_list_train, alpha=0.75, label="training accuracy")
    plt.plot(accuracy_list_test, alpha=0.75, label="testing accuracy")
    plt.ylabel("Accuracy")
    plt.xlabel("epoch ")
    sns.despine()
    plt.legend()
    plt.show()

def cifar_accuracy(net, loader):
    acc = 0
    total_num = 0
    for inputs, labels in loader:
        inputs, labels = inputs.cuda(), labels.cuda()
        total_num += inputs.shape[0]
        with torch.no_grad():
            outputs = net(inputs)

            _, predicted = torch.max(outputs.data, 1)
            acc += (predicted == labels).sum().item()

    return acc/total_num
    
plot(loss_list_train,loss_list_test,accuracy_list_train,accuracy_list_test)


**Answer Here**

## Part F: Closing the generalization gap

Re-initialize and re-train your CNN in a way that will reduce the overall generalization gap. There are a lot of ways you can do this, including writing your own custom loss function to use, or simply making the right modifications to the optimizer you are using. 

Again make the following two plots:
  1. A plot of loss vs. time including curves for both training and testing,
  2. A plot of accuracy vs. time including curves for both training and testing.

What do you observe?

A note: you may not see dramatic improvement here, because we are using a fairly small model, however this type of thinking and the problem of closing the generalization gap is an important consideration in how we train very large state of the art models.

**Answer Here**

In [None]:
model = CNN(batch_CNN)
model.to(device) 
accuracy_list_train1=[]
accuracy_list_test1=[]
loss_list_train1=[]
loss_list_test1=[]
def accuracy2(model, dataloader):
    acc_loss1 = 0.
    n_correct1 = 0
    for X, y in dataloader:
        X = X.to(device)
        y = y.to(device)
        pred1 = model(X).to(device)
        loss1 = loss_fn(pred1, y)
        acc_loss1 += loss1.item()
        n_correct1 += torch.sum(torch.argmax(pred1, -1) == y)
    print("\t  testset Loss = {:.6f}; Accuracy = {:.2f}".format( acc_loss1/len(testset), 100*n_correct1.item()/len(testset)))
    accuracy_list_test1.append(100*n_correct1.item()/len(testset))
    loss_list_test1.append(acc_loss1/len(testset))
def train_CNN_GPU2(dataloader, model, loss_fn, optim,n_epoch):
    size = len(dataloader.dataset)
    for epoch in range(n_epoch):
        acc_loss = 0.
        n_correct = 0
        for X, y in dataloader:
            X = X.to(device)
            y = y.to(device)
            pred = model(X).to(device)
            loss = loss_fn(pred, y)
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            acc_loss += loss.item()
            n_correct += torch.sum(torch.argmax(pred, -1) == y)
#            print(torch.argmax(pred, -1))
#            print(y)

            optimizer.step()
            schedule.step()
        print("Epoch {:d}, trainset Loss = {:.6f}; Accuracy = {:.2f}".format(epoch, acc_loss/len(trainset), 100*n_correct.item()/len(trainset)))
        accuracy_list_train1.append(100*n_correct.item()/len(trainset))
        loss_list_train1.append(acc_loss/len(trainset))
        accuracy2(model, testloader)
optimizer = torch.optim.SGD(model.parameters(), lr=0.02, momentum = 0.9, weight_decay=0.001)
schedule = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=n_epoch)
epoch1= 30
train_CNN_GPU2(trainloader, model, loss_fn, optimizer, epoch1)
plot(loss_list_train,loss_list_test,accuracy_list_train,accuracy_list_test)


I use the adapative learning rate, and set the initial learning rate higher as it might be helpful to find a wider local minima and better generalization. Second, I set a weight decay value of 0.001 which is a regularization term to prevent the model from overfitting and therefore generalize better. When I train the network, the accuracy rate goes from (76.00, 65.86) to (73.30, 67.05)

## Part G: Final Step!

Now that we've seen how well even simple CNNs can perform at image classification, let's go back and use our trained models to understand what convolutional layers are doing to help classify images. 

In this step you'll need to take a single image and pass it through just the first convolutional layer in your _trained_ network. 
The input to your first convolutional layer should have dimensions $1 \times 3 \times 32 \times 32$ and the output should have dimensions $1 \times 8 \times 30 \times 30$. 

Now, take your output and make a figure containing $9$ subplots in a $3 \times 3$ grid. The first entry in the subplot should be the image you passed through your convolutional layer, and the $8$ remaining subplots should contain each of the channels output by your convolutional layer as $30\times 30$ images.

You should see that each of the channels output by the convolutional filter highlights a different component of the input image - this behavior is central to the success of CNN! They are able to decompose the image into distinct features and use those features to classify images. Once you have your code working try generating your plot with different input images and see what patterns you can pick up on. Note that not all examples will look great, since the images are very low resolution. And congratulations on building and training your first neural networks!

**Hints:** If you have an $n \times m$ array you can plot it using the `plt.imshow` command, and if you have an $n \times m \times 3$ array, using this same command will generate a color plot where the last dimension is interepreted as the RGB channels. You will need to normalize your tensors so that their entries are in $[0, 1]$ for the plots to be rendered correctly.

Thus, to plot your input image which should be a torch tensor with dimensions $1 \times 3 \times 32 \times 32$, after you normalize it you can call:
```
plt.imshow(normalized_input[0].transpose(2, 1).transpose(0, 2))
```

**Answer Here**

In [None]:
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1,
                                          shuffle=True, num_workers=2,
                                          pin_memory=True)
class oneConv(nn.Module):
    def __init__(self, batch_size):
        super().__init__()
        self.conv1 = nn.Conv2d(3,8,3)

    def forward(self, x):  
        x = self.conv1(x)
        return x

def normalization(a):
  min = torch.min(a)     
  max = torch.max(a)
  normalized = (a- min) / (max-min)
  return normalized

oneConv = oneConv(1)
list=[]
for (X, y) in trainloader:
  list = oneConv(X)
  normalized_input = normalization(list[0])
  fig, axes = plt.subplots(figsize=(10, 10), nrows=3, ncols=3, sharex=True, sharey=True)
  count=0
  normalized_input_filter = normalization(list).detach().numpy()
  for i in range(3):
      for j in range(3):
          if((i == 0) & (j == 0)):
              normalized_input = normalization(X)
              axes[i,j].imshow(normalized_input[0].transpose(2, 1).transpose(0, 2))
          else:
              axes[i,j].imshow(normalized_input_filter[0, count-1, :, :], cmap='gray')
              count=count+1
  break


In [None]:
reak