# Homework 1: Feed-forward Neural Networks (100 points)

### Overview

Below you will find a PyTorch implementation of a feed-forward neural network for image recognition. We use the popular MNIST dataset, where the model predicts a single digit (0-9) for a black-and-white photo of a handwritten digit. This is a _classification_ task.

### NN Architecture

Each image has size 28x28 grayscale pixel values between 0 and 255. In preprocessing, we flatten each image to a single vector of length $28^2 = 784$, which serves as the entire input for the model.

For each image, we aim to predict one of ten classes (0-9). We could use an output layer $y$ of size 1 (a single neuron) -- for example, using a naive mapping like prediction $p = \mathrm{int}(10y)$. But this presupposes that a handwritten 0 is similar to a handwritten 1 and very different than a handwritten 9, which isn't the case. So instead we use an output layer $y$ of size 10, where the prediction $p = argmax(y)$, so each output neuron controls the likelihood for a particular class.

We use a simple two-layer neural network. To begin, we will have an input size of 784, a hidden layer of size 5, and an output layer of size 10.

### Your Task

At the bottom of this notebook file, there are a series of questions testing your understanding of this neural network architecture. Some questions include instructions where you will need to modify hyperparameters (notated in the code below) and re-run the model to investigate the changed results. __There is no need to read through the following code in depth to answer the questions, but it may be useful as a reference.__

Below each question is a cell with the text “Type Markdown and LaTex.” Double-click the cell and type your response to the question. Save your responses by clicking on the floppy disk icon or choosing File - Save and Checkpoint.

After responding to the questions, download your notebook as a `.html` file by choosing File - Download as - html (.html). You will be submitting this `.html` file to your instructor for grading.

In [1]:
import torch
import torch.nn as nn
import torchvision.datasets as datasets
import torchvision.transforms as transforms
torch.manual_seed(0)

<torch._C.Generator at 0x7f6294778e90>

In [2]:
root_dir = 'assets_week1'
trainDataset = datasets.MNIST(root=root_dir, train=True, transform=transforms.ToTensor(), download=True)
testDataset = datasets.MNIST(root=root_dir, train=False, transform=transforms.ToTensor())

In [3]:
class NNModel(nn.Module):
    def __init__(self, inputSize, outputSize, hiddenSize, activate):
        super().__init__()
        
        self.activate = nn.Sigmoid() if activate == "Sigmoid" else nn.Tanh() if activate == "Tanh" else nn.ReLU()
        self.layer1 = nn.Linear(inputSize, hiddenSize)
        self.layer2 = nn.Linear(hiddenSize, outputSize)
        
    def forward(self, X):
        hidden = self.activate(self.layer1(X))
        return self.layer2(hidden)

In [15]:
# The dimensionality of the input
inputSize = 784
# Number of neurons in the first layer
hiddenSize = 300
# Number of neurons in the second layer
outputSize = 10
# Activation function (default: ReLU, options: Sigmoid, Tanh, ReLU)
activation = "ReLU"
# Learning rate
learningRate = .001
# Number of training epochs
numEpochs = 5
# Number of training examples per batch
batchSize = 200

In [16]:
trainLoader = torch.utils.data.DataLoader(dataset=trainDataset, batch_size=batchSize, shuffle=True)
testLoader = torch.utils.data.DataLoader(dataset=testDataset, batch_size=batchSize, shuffle=False)

net = NNModel(inputSize, outputSize, hiddenSize, activation)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=learningRate)

print('>>> Beginning training!')
for epoch in range(numEpochs):
    for i, (images, labels) in enumerate(trainLoader):
        images = images.view(-1, 28*28)
        
        optimizer.zero_grad()
        
        # Forward propagation
        outputs = net(images)
        
        # Backpropagation
        loss = criterion(outputs, labels)
        loss.backward()
        
        # Gradient descent
        optimizer.step()
        
        # Logging
        if (i+1) % 100 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {}'.format(epoch+1, numEpochs, i+1, len(trainDataset)//batchSize, loss))

print()
print('>>> Beginning validation!')
correct, total = 0, 0
for i, (images, labels) in enumerate(testLoader):
    images = images.view(-1, 28*28)
    
    outputs = net(images)
    _, prediction = torch.max(outputs, axis=1)
    correct += torch.sum(prediction == labels)
    total += labels.size(0)
print('Validation accuracy: {}%'.format(correct.item()/total*100))

>>> Beginning training!
Epoch [1/5], Step [100/300], Loss: 0.3068412244319916
Epoch [1/5], Step [200/300], Loss: 0.28995949029922485
Epoch [1/5], Step [300/300], Loss: 0.1639423370361328
Epoch [2/5], Step [100/300], Loss: 0.24531342089176178
Epoch [2/5], Step [200/300], Loss: 0.11324802041053772
Epoch [2/5], Step [300/300], Loss: 0.08485922962427139
Epoch [3/5], Step [100/300], Loss: 0.0982745960354805
Epoch [3/5], Step [200/300], Loss: 0.06796352565288544
Epoch [3/5], Step [300/300], Loss: 0.11601042747497559
Epoch [4/5], Step [100/300], Loss: 0.06404197961091995
Epoch [4/5], Step [200/300], Loss: 0.0819862112402916
Epoch [4/5], Step [300/300], Loss: 0.09553879499435425
Epoch [5/5], Step [100/300], Loss: 0.07789668440818787
Epoch [5/5], Step [200/300], Loss: 0.06512964516878128
Epoch [5/5], Step [300/300], Loss: 0.07752872258424759

>>> Beginning validation!
Validation accuracy: 97.55%


## Homework Questions

Your goal is to improve the model's accuracy by tuning hyperparameters. If a question asks you to modify a hyperparameter and you obtain improved results, retain that hyperparameter change for subsequent questions. Otherwise, revert back to the original hyperparameter value.

**To make sure your code produces consistent results, it is advisable to click "Kernel -> Restart & Run All" every time you want to run your code.**

### Question 1: Loss Minimization & Gradient Descent (5 points)

Given a neural network with model parameters $\theta$, loss function $E$, and learning rate $\alpha$, what is the correct method to perform gradient descent?

a) $\theta_i += \alpha E$

b) $\theta_i -= \alpha E$

c) $\theta_i += \alpha\frac{\partial E}{\partial \theta_i}$

d) $\theta_i -= \alpha\frac{\partial E}{\partial \theta_i}$

C. Update params in the opposite direction of the gradient of the objective function. 

### Question 2: Class Imbalance (10 points)

Imagine you are an engineer tasked with helping a company to identify faulty parts early using an machine learning-based image recognition system. What evaluation metric would you use? More specifically, explain why a raw percent accuracy score would be a poor choice of evaluation metric for this problem space.

I would suggest using Recall as an evaluation metric. To identify faulty parts early, we'd want to limit false negatives (a faulty part slipping past our detection) and therefore increase sensitivity, which Recall does. A raw percent accuracy score is a poor choice of evaluation metric for imbalanced classification because high accuracy is achievable by a model that simply only predicts the majority class. 

### Question 3a:  Size of a Hidden Layer (10 points)

Explain how the hidden layer size influences the architecture of a feed-forward neural network. In doing so, note what can happen if the hidden size is too large and what can happen if the hidden size is too small.

Hidden layer size influences the complexity of functions our feed-forward neural network will be able to fit.  The more nodes we have in our hidden layers, the more parameters we must optimize. If the hidden size is too large, computations requirements become too large and we risk overfitting our data.  If the hidden size is too small we risk underfitting our data and over generalizing our model. 

### Question 3b: Size of a Hidden Layer  (10 points)

Increase the hidden size from 5 to 300 and re-run your trial. How does the accuracy change?

_a) It increases, since the model learns more quickly_

_b) It increases, since the model has more memory and can learn more complex features_

_c) It decreases, since the model has to learn more parameters and it doesn't have enough time_

_d) It decreases, since the model has less memory_

B. It increases, since the model has more memory and can learn more complex features

### Question 4a: Learning Rate  (10 points)

Explain the purpose of a learning rate. In doing so, note what can happen if the learning rate is too large and what can happen if the learning rate is too small.

The learning rate controls how big of a jump (step) we take in response to the estimated error each time we updated model weights on our way to reach a minimum. If the learning rate is too small, our model will take too long to converge and be computationally expensive.  If learning rate is too large, the steps may skip over minimums and converge on suboptimal weights. 

### Question 4b: Learning Rate  (10 points)

Increase the learning rate from 0.001 to 1. How does the accuracy change?

a) It increases, since the model learns more quickly

b) It increases, since the model is better able to converge

c) It decreases, since the model learns too slowly

d) It decreases, since the model is not able to converge

D. It decreases, since the model is not able to converge.

### Question 5a: Activation Functions (10 points)

Explain the main purpose of an activation function in neural networks. Also, explain the main benefit of the Tanh activation function over the Sigmoid activation function, and the main benefit of the ReLU activation function over the Sigmoid activation function.

The activation function transforms the input to the hidden layer into its output and allows us to approximate complex decision boundaries. The main beneift of Tanh over Sigmoid activation function is that Tanh is bound between -1 and 1 (vs. between 0 and 1 in Sigmoid).  These boundaries result in a steeper derivative and greater efficiencies. By replacing all negative values with 0, ReLU is simpler and more computationally efficient than the Sigmoid activiation function. 

### Question 5b: Activation Functions (5 points)

Change the activation function in the hyperparameter list above to determine which activation function is most effective at this task.

a) ReLU

b) Sigmoid

c) Tanh

A. ReLU

### Question 6: Overfitting  (10 points)

Define overfitting and explain how it can damage model training and results.

Models that are tuned too specifically to the training set are said to be overfit.  An overfit model's hyperparameters are too complex (tuned to the training data) which results in poor generalizability to future unseen data. 

### Question 7: Early Stopping  (10 points)

Outline a procedure for early stopping to prevent overfitting. Clearly describe how you’d use the training, validation, and test sets accuracy to decide when to stop.

In order to put early stopping into practice, I would split the training data into  a training set and a validation set.  I would train a model only on the training set and evaluate the model's performance on the validation set after a set number of epochs. As soon as the validation (test) set error is higher than it was on the previous model evaluation stop, I would halt training the model and use the parameters from that previous evaluation.

### Question 8: Regularization  (10 points)

Briefly explain a few common methods of regularization to prevent overfitting.

A few common methods of regularization to prevent overfitting are dropout, early stopping, and data augmentation. Dropout modifies the model itself by randomly selecting neurons to drop from the neural network (creating a simpler model) during each iteration of training. Early stopping involves stopping the iterative training of your neural network after the model stops performing on a held out validation dataset. Data augmentation involves transforming the input data (via rotaion, scaling, brightness) and adding transformed data to the training data.  This method forces the model to generalize as it is unable to overfit to the diverse training data. 