# CS549 Machine Learning 


# CS549 Machine Learning 


# Section 1: Simple Neural Network

**Total points: 50**

In this assignment, you will implement a 2-layer shallow neural network model. 

We will use the model to conduct the same binary classification task , i.e., classify two categories of the sign language dataset. 

The input size is the number of pixels in a image ($64\times 64$). The size of hidden layer is determined by a hyperparameter `n_h`, and the size of output layer is 1.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from utils import *

%matplotlib inline
np.random.seed(1)

In [None]:
# Load data
X_train, Y_train, X_test, Y_test = load_data()

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

### 1.1 Intialize parameters
**4 point**

The parameters associated with the hidden layer are $W^{[1]}$ and $b^{[1]}$, and the parameters associated with the output layer are $W^{[2]}$ and $b^{[2]}$.

We use **tanh** as acitivation function for hidden layer, and **sigmoid** for output layer.

**Instructions:**
- Initialize parameters randomly
- Use `np.random.randn((size_out, size_in))*0.01` to initialize $W^{[l]}$, in which `size_out` is the output size of current layer, and `size_in` is the input size from previous layer. 
- Use `np.zeros()` to initialize $b^{[l]}$

In [None]:
def init_params(n_i, n_h, n_o):
    """
    Args:
    n_i -- size of input layer
    n_h -- size of hidden layer
    n_o -- size of output layer
    
    Return:
    params -- a dict object containing all parameters:
        W1 -- weight matrix of layer 1
        b1 -- bias vector of layer 1
        W2 -- weight matrix of layer 2
        b2 -- bias vector of layer 2
    """
    np.random.seed(2) # DO NOT change this line! 
    
    ### START TODO ### 
    
    ### END TODO ###
    
    params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
    
    return params

In [None]:
# Evaluate Task
ps = init_params(3, 4, 1)
print('W1 =', ps['W1'])
print('b1 =' ,ps['b1'])
print('W2 =', ps['W2'])
print('b2 =', ps['b2'])

**Expected output**

|&nbsp;|&nbsp; |          
|--|--|
|**W1 =**|[[-0.00416758 -0.00056267 -0.02136196] <br>[ 0.01640271 -0.01793436 -0.00841747] <br> [ 0.00502881 -0.01245288 -0.01057952]<br>[-0.00909008  0.00551454  0.02292208]]|
|**b1 =**|[[0.]<br>[0.]<br>[0.]<br>[0.]]|
|**W2 =**|[[ 0.00041539 -0.01117925  0.00539058 -0.0059616 ]]|
|**b2 =**|[[0.]]|

***

### 1.2 Forward propagation

**8 points**

Use the following fomulas to implement forward propagation:
- $Z^{[1]} = W^{[1]}X + b^{[1]}$
- $A^{[1]} = tanh(Z^{[1]})$ --> use `np.tanh` function
- $Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$
- $A^{[2]} = \sigma(Z^{[2]})$ --> directly use the `sigmoid` function provided in `utils` package

In [None]:
def forward_prop(X, params):
    """
    Args:
    X -- input data of shape (n_in, m)
    params -- a python dict object containing all parameters (output of init_params)
    
    Return:
    A2 -- the activation of the output layer
    cache -- a python dict containing all intermediate values for later use in backprop
             i.e., 'Z1', 'A1', 'Z2', 'A2'
    """
    m = X.shape[1]
    
    # Retrieve parameters
    ### START TODO ###
    
    ### END TODO ###
    
    # Implement forward propagation
    ### START TODO ###
    
    ### END TODO ###
    
    assert A1.shape[1] == m
    assert A2.shape[1] == m
    
    cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    return A2, cache

In [None]:
# Evaluate Task
X_tmp, params_tmp = forwardprop_testcase()

A2, cache = forward_prop(X_tmp, params_tmp)
print('mean(Z1) =', np.mean(cache['Z1']))
print('mean(A1) =', np.mean(cache['A1']))
print('mean(Z2) =', np.mean(cache['Z2']))
print('mean(A2) =', np.mean(cache['A2']))

**Expected output**

|&nbsp;|&nbsp; |          
|--|--|
|**mean(Z1) =**|0.13208981347443063|
|**mean(A1) =**|-0.01294750224234301|
|**mean(Z2) =**|-0.028697749001905536|
|**mean(A2) =**|0.5329353691451202|

***

### 1.3 Backward propagation
**17 points**

Use the following formulas to implement backward propagation:
- $dZ^{[2]} = A^{[2]} - Y$
- $dW^{[2]} = \frac{1}{m}dZ^{[2]}A^{[1]T}$ --> $m$ is the number of examples
- $db^{[2]} = \frac{1}{m}$ np.sum( $dZ^{[2]}$, axis=1, keepdims=True)
- $dA^{[1]} = W^{[2]T}dZ^{[2]}$
- $dZ^{[1]} = dA^{[1]}*g'(Z^{[1]})$
    - Here $*$ denotes element-wise multiply
    - $g(z)$ is the tanh function, therefore its derivative $g'(Z^{[1]}) = 1 - (g(Z^{[1]}))^2 = 1 - (A^{[1]})^2$
- $dW^{[1]} = \frac{1}{m} dZ^{[1]}X^T$
- $db^{[1]} = \frac{1}{m}$ np.sum( $dZ^{[1]}$, axis=1, keepdims=True)

In [None]:
def backward_prop(X, Y, params, cache):
    """
    Args:
    X -- input data of shape (n_in, m)
    Y -- input label of shape (1, m)
    params -- a python dict containing all parameters
    cache -- a python dict containing 'Z1', 'A1', 'Z2' and 'A2' (output of forward_prop)
    
    Return:
    grads -- a python dict containing the gradients w.r.t. all parameters,
             i.e., dW1, db1, dW2, db2
    """
    m = X.shape[1]
    
    # Retrieve parameters
    ### START TODO ###
    
    ### END TODO ###
    
    # Retrive intermediate values stored in cache
    ### START TODO ###
    
    ### END TODO ###
    
    # Implement backprop
    ### START TODO ###
    
    ### END TODO ###
    
    grads = {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
    
    return grads

In [None]:
# Evaluate Task
X_tmp, Y_tmp, params_tmp, cache_tmp = backprop_testcase()

grads = backward_prop(X_tmp, Y_tmp, params_tmp, cache_tmp)
print('mean(dW1)', np.mean(grads['dW1']))
print('mean(db1)', np.mean(grads['db1']))
print('mean(dW2)', np.mean(grads['dW2']))
print('mean(db2)', np.mean(grads['db2']))

**Expected output**

|&nbsp;|&nbsp; |          
|--|--|
|**mean(dW1) =**|-0.039558211695590706|
|**mean(db1) =**|0.001467912358907287|
|**mean(dW2) =**|0.1250823230639841|
|**mean(db2) =**|0.13293536800000003|

***

### 1.4 Update parameters
**8 point**

Update $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}$ accordingly:
- $W^{[1]} = W^{[1]} - \alpha\ dW^{[1]}$
- $b^{[1]} = b^{[1]} - \alpha\ db^{[1]}$
- $W^{[2]} = W^{[2]} - \alpha\ dW^{[2]}$
- $b^{[2]} = b^{[2]} - \alpha\ db^{[2]}$

In [None]:
def update_params(params, grads, alpha):
    """
    Args:
    params -- a python dict containing all parameters
    grads -- a python dict containing the gradients w.r.t. all parameters (output of backward_prop)
    alpha -- learning rate
    
    Return:
    params -- a python dict containing all updated parameters
    """
    ### START TODO ###
    # Retrieve parameters 2 points

    # Retrieve gradients 2 points

    # Update each parameter 4 points

    ### END TODO ###
    
    params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
    
    return params

In [None]:
# Evaluate Task
params_tmp, grads_tmp = update_params_testcase()

params = update_params(params_tmp, grads_tmp, 0.01)
print('W1 =', params['W1'])
print('b1 =', params['b1'])
print('W2 =', params['W2'])
print('b2 =', params['b2'])

**Expected output**

|&nbsp;|&nbsp; |          
|--|--|
|**W1 =**|[[-1.9959083  -1.06667372 -0.75475925]<br>[ 0.29418098 -0.98950432 -1.22186039]<br>[-0.27355221 -1.6082775   0.11104952]<br>[-0.14229044  1.07321512  0.4208789 ]<br>[ 0.59648267  0.86048013 -1.60750032]]|
|**b1 =**|[[-0.00062513]<br>[ 0.00046574]<br>[ 0.00069734]<br>[-0.0002741 ]<br>[-0.00033725]]|
|**W2 =**|[[ 1.08049444 -0.25269532  1.0989616   0.20063139  1.45914531]]|
|**b2 =**|[[-0.00132935]]|

***

### 1.5 Integrated model
**8 points**

Integrate `init_params`, `forward_prop`, `backward_prop` and `update_params` into one model.

In [None]:
def nn_model(X, Y, n_h, num_iters=10000, alpha=0.01, verbose=False):
    """
    Args:
    X -- training data of shape (n_in, m)
    Y -- training label of shape (1, m)
    n_h -- size of hidden layer
    num_iters -- number of iterations for gradient descent
    verbose -- print cost every 1000 steps
    
    Return:
    params -- parameters learned by the model. Use these to make predictions on new data
    """
    np.random.seed(3)
    m = X.shape[1]
    n_in = X.shape[0]
    n_out = 1
    
    # Initialize parameters and retrieve them
    ### START TODO ###
    
    ### END TODO ###
    
    # Gradient descent loop
    for i in range(num_iters):
        ### START TODO ###
        # Forward propagation
        
        # Backward propagation
        
        # Update parameters
        
        # Compute cost
        
        ### END TODO ###
        
        # Print cost
        if i % 1000 == 0 and verbose:
            print('Cost after iter {}: {}'.format(i, cost))
    
    return params

In [None]:
# Evaluate Task 1.5
X_tmp, Y_tmp = nn_model_testcase()

params_tmp = nn_model(X_tmp, Y_tmp, n_h=5, num_iters=5000, alpha=0.01)
print('W1 =', params_tmp['W1'])
print('b1 =', params_tmp['b1'])
print('W2 =', params_tmp['W2'])
print('b2 =', params_tmp['b2'])

**Expected output**

|&nbsp;|&nbsp; |          
|--|--|
|**W1 =**|[[ 0.728558   -0.60417473 -0.24274211]<br>[ 0.88560809 -0.67439594 -0.3043778 ]<br>[-0.20781606 -0.59195986 -0.27344463]<br>[-0.51914662  0.61152697  0.17713134]<br>[ 0.00864946 -0.25198231 -0.1411225 ]]|
|**b1 =**|[[ 0.29073376]<br>[ 0.29189656]<br>[ 0.28876041]<br>[-0.32656432]<br>[ 0.09711243]]|
|**W2 =**|[[-1.25312586 -1.40689892 -0.69967068  1.13815825 -0.31472553]]|
|**b2 =**|[[-0.80345148]]|

***

### 1.6 Predict
**2 point**

Use the learned parameters to make predictions on new data. 
- Compute $A^{[2]}$ by calling `forward_prop`. Note that the `cache` returned will not be used in making predictions.
- Convert $A^{[2]}$ into a vector of 0 and 1.

In [None]:
def predict(X, params):
    """
    Args:
    X -- input data of shape (n_in, m)
    params -- a python dict containing the learned parameters
    
    Return:
    pred -- predictions of model on X, a vector of 0s and 1s
    """
    ### START TODO ###
    
    ### END TODO ###
    
    return pred

In [None]:
# Evaluate Task 1.6
# NOTE: the X_tmp and params_tmp are the ones generated in evaluating Task 1.5 (two cells above)
pred = predict(X_tmp, params_tmp)
print('predictions = ', pred)

**Expected output**

|&nbsp;|&nbsp; |          
|--|--|
|**predictions =**|[[0. 1. 0. 0. 1.]]|

***

### 1.7 Train and evaluate

**3 point**

Train the neural network model on X_train and Y_train, and evaluate it on X_test and Y_test.

You can use the code from the previous assignment for Logistic Regression and Evaluation Metrics to compute the accuracy of your predictions.

In [None]:
# Train the model on X_train and Y_train, and print cost
# DO NOT change the hyperparameters, so that your output matches the expected one.
params = nn_model(X_train, Y_train, n_h = 10, num_iters=10000, verbose=True)

# Make predictions on X_test
predictions = predict(X_test, params)

# Compute accuracy by comparing predictions and Y_test
### START TODO ###

### END TODO ###
print('Accuracy = {0:.2f}%'.format(acc * 100))

**Expected output**

|&nbsp;|&nbsp; |          
|--|--|
|**Cost after iter 0:**|0.6931077265775999|
|**Cost after iter 1000:**|0.2482306581297105|
|**Cost after iter 2000:**|0.05471507033938196|
|**Cost after iter 3000:**|0.024326463013581715|
|**Cost after iter 4000:**|0.014595754197204438|
|**Cost after iter 5000:**|0.010131520880123288|
|**Cost after iter 6000:**|0.00764463387660483|
|**Cost after iter 7000:**|0.0060842030981856435|
|**Cost after iter 8000:**|0.005023835721723831|
|**Cost after iter 9000:**|0.0042610856757679645|
|**Accuracy =** |95.20%|

***

# Section 2: Convolutional Neural Network -- ConvNet for image classificaion

**Total points: 50**

In this assignment, you will implement a fully functioning ConvNet model using PyTorch. You will use the model to conduct image classification on the FashionMNST dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

# PyTorch is needed for this assignment
# You can install it following the instructions on the official website: https://pytorch.org/get-started/locally/
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(0)
torch.use_deterministic_algorithms(True)

from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

%matplotlib inline
np.random.seed(1)

## Load data

Load the FashionMNIST dataset provided by PyTorch. You can also change the `download` param to `False`, and copy the "data" folder used in the previous assignment to the current folder.

See <https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader> for more information.

In [None]:
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

batch_size = 64

train_loader = DataLoader(training_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

## Examine data size

Now, you can examine the size of the training/test data, which is important for determining some of the parameters of your model

In [None]:
for i, (X, y) in enumerate(train_loader):
    if i > 0:
        break

print('X.shape: ', X.shape)
print('Y.shape: ', y.shape)

**Expected output**:

X.shape:  torch.Size([64, 1, 28, 28])
y.shape:  torch.Size([64])

***

## Task 2.1. Build the model
**20 points**

You will need to define your ConvNet model as a subclass of `torch.nn.Module`. Becuase we have already imported `torch.nn` as `nn`, we can specify the baseclass simply as `nn.Module`.

You need to override two functions in defining the class, `__init__()` and `forward()`.
- All the parameters, including the convolutional, pooling, and fully-connected layers are defined in `__init__()`. They are declared and initialized as members of the class, using the `self.` notation in Python.
- The forward pass of the computational graph is defined in `forward()`. This function takes as input the training data, and call all operations (conv, pool, etc.) sequentially on the data. The output of a preceding operation is used as the input for the following operation.

**Instructions:**

- Define the model so that the architecture is as follows: <br>
    Conv1 -> ReLU -> BatchNorm-> MaxPool1 -> \
    Conv2 -> ReLU -> BatchNorm-> MaxPool2 -> \
    FullyConnected -> Softmax.
  <br> in which,\
    - `conv1` has filter size $f=3$, stride $s=1$, padding $p=0$, the number of filters $n_f=6$
    - `conv2` has filter  $f=3$, stride $s=2$, padding $p=0$, the number of filters $n_f=12$;
    - all max-pool layers use filter  $f=2$ (stride $s=2$ by default).
  <br>
- *Note* that the *RELU* activation function is implemented in `forward()` rather than `__init__()`, using `F.relu()`, in which `F` is short for `torch.nn.functional` (imported at the beginning).

- The `in_features` of `self.fc` is the total number of output units after the `self.pool2` layer.
- The `out_features` of `self.fc` should match the number of classes in FashionMNIST dataset, which is 10.
- Use the following formula to compute the height and width of ouputs from conv layers.
\begin{equation}\text{Output} = (\lfloor\frac{n+2p-f}{s}\rfloor + 1)\times(\lfloor\frac{n+2p-f}{s}\rfloor + 1)\end{equation}
- For the output of model, need to use `nn.logSoftmax()`.

In [None]:
class ConvNetModel(nn.Module):
    def __init__(self, debug=False):
        super(ConvNetModel, self).__init__()
        self.debug = debug
        
        # The first convolutional layer has in_channels=1, out_channels=6, kernel_size=3, with default stride=1 and padding=0
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.bn1 = nn.BatchNorm2d(6)
        # The first pooling layer is a maxpool with a square window of kernel_size=2 (default stride is same as kernel_size)
        self.pool1 = nn.MaxPool2d(2)
        
        ### START TODO ###
        # The second convolutional layer
        # NOTE: Its in_channels should match the out_channels of conv1 
       
        # The second pooling layer is maxpool with a square window of kernel_size=2
                    
        
        # The fully-connected layer
        # NOTE: Use nn.Linear, and you need to specify the correct in_features
        
        
        # Softmax layer
       
        ### END TODO ###
        
    
    def forward(self, x):
        # Conv1 -> ReLU -> Batchnorm1-> Pool1
        x = self.pool1(self.bn1(F.relu(self.conv1(x))))
        if self.debug:
            print('output shape of pool1:', x.shape)
        
        ### START TODO ###
        # Conv2 -> ReLU -> Batchnorm2 -> Pool2
    
        
        # Flatten the output from the last pooling layer

        
        # Call two fully-connected layer
        
        
        # Call softmax layer
        
        ### END TODO ###
        
        return x

In [None]:
model = ConvNetModel(debug=False) # You can use debug mode to help

# Do not change the test code below
torch.manual_seed(0)
input_data = torch.randn(64, 1, 28, 28)
output = model(input_data)

print('output.size():', output.size())

### Expected output

output.size(): torch.Size([64, 10])

***

## Task 2.2. Train and evaluate
**30 points**

Now you will use the functions you have implemented above to build a full model. Then you train the model on the sign language dataset.

You can refer to the previous assignment or the official documents: See <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html> and <https://pytorch.org/docs/stable/optim.html> for more information.

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer, verbose=True):
    for i, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        ### START TODO ###
        
        ### END YOUR CODE ###

        # Backpropagation
        ### START TODO ###
        
        ### END TODO ###

        if verbose and i % 100 == 0:
            loss = loss.item()
            current_step = i * len(X)
            print(f"loss: {loss:>7f}  [{current_step:>5d}/{len(dataloader.dataset):>5d}]")

In [None]:
@torch.no_grad()
def test_loop(dataloader, model, loss_fn):
    test_loss, correct = 0, 0

    for X, y in dataloader:
        ### START YOUR CODE ###
        
        ### END YOUR CODE ###

    test_loss /= len(dataloader)
    ### START YOUR CODE ###
    
    ### END YOUR CODE ###

    print(f"Test Error: \n Accuracy: {(100*test_acc):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Next, execute the following cell to start the training and testing loop.

In [None]:
model = ConvNetModel() # Reset the model
learning_rate = 1e-3


### START YOUR CODE ###

### END YOUR CODE ###

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    ### START YOUR CODE ###
    
    ### END YOUR CODE ###

print("Done!")

### Expected output

You should be able to reach above 70% test accuracy.


In [None]:
model = ConvNetModel() # Reset the model
learning_rate = 1e-3


### START YOUR CODE ###

### END YOUR CODE ###

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    ### START YOUR CODE ###
    
    ### END YOUR CODE ###

print("Done!")

### Expected output

You should be able to reach above 85% test accuracy.
You should observe that ADAM optimizer leads to a quicker convergence than SGD.

***

## Congratulations!
Now you have successfully built a convolutional neural network model for image classification! 
Hopefully this experience of using PyTorch will help you with your final project.