# Lecture 07

In this assignmnt you will design and train a linear classifier and a multilayer perceptron (MLP) on a simple dataset, and compare the results. The models should be implemented from "scratch" using PyTorch low-level tensor operations (i.e,. without the convenience of `torch.nn` module). In future lectures, we will transition to using the high-level PyTorch `torch.nn` module. In practice, you will likely use the `torch.nn` module frequently for existing layer types, however, it is valuable to know how to program in PyTorch using tensors as this will be necessary for implementing novel layer types, loss functions, models that are not implementd in the `nn.module`. Additionally, this practice will reinforce your theoretical understanding which is easy to bypass using the abstract building blocks defined in `nn.module`.

In [None]:
import os
import glob
import math
import numpy as np
import torch
import matplotlib.pyplot as plt
from matplotlib import cm
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from torchsummary import summary
from utils import callback_plot, create_gif


if torch.backends.mps.is_available():
    print("MPS is available.")
    device = torch.device("mps")
elif torch.cuda.is_available():
    print("CUDA is available.")
    device = torch.device("cuda")
else:
    print("Using CPU")
    device = torch.device("cpu")

print("device: {}".format(device))

LIN_OUTPUT_DIR = "./linear_vizualizations/"
if not os.path.exists(LIN_OUTPUT_DIR):
    os.mkdir(LIN_OUTPUT_DIR)
MLP_OUTPUT_DIR = "./mlp_vizualizations/"
if not os.path.exists(MLP_OUTPUT_DIR):
    os.mkdir(MLP_OUTPUT_DIR)

## Create Dataset

In this assignment, you will use a simple toy dataset for binary classification. The data is stored in a matrix of size $N\times M$ where $N$ is the number of samples and $M$ is the number of features, in this case $N=1000$ and $M=2$. The labels are stored in a vector of size $N$, with values 0 or 1 corresponding to the two classes.

In [None]:
N_SAMPLES = 1000
TEST_SIZE = 100 

X, Y = make_moons(n_samples = N_SAMPLES, noise=0.2, random_state=100)
X = X.astype(np.float32)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=TEST_SIZE, random_state=42)
print("X_train shape: {}".format(X_train.shape))
print("Y_train shape: {}".format(Y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("Y_test shape: {}".format(Y_test.shape))

<br>
Visualize the datset using a matplotlib scatter plot:

In [None]:
fig, ax = plt.subplots(figsize=(4,4))
ax.scatter(X_train[:,0],X_train[:,1], c=Y_train, cmap=plt.cm.coolwarm, edgecolors='black')


## NumPy to torch

PyTorch tensors are very similar to numpy arrays, except tensors can seemlessly run on GPUs for accelerated computing. Additionally, PyTorch tensors are also optimized for automatic differentiation (autograd), greatly simiplifying neural network backpropagation. You will build a MLP using `torch.tensor` objects to store all matrices (data, model parameters, intermediate  layer activations) and perform calculations. Conventiently the API of PyTorch and NumPy are almost identical. For example `torch.sum(x, dim=1, keepdim=True)` and `np.sum(x, dim=1, keepdim=True)` perform the same operation, the only difference is torch operates on `torch.tensor` objects whereas NumPy operates on `np.array` objects. 
Our dataset is currently stored as `np.array`, so first we convert to PyTorch tensors and then transfer them to the desired computing device. Similar to numpy, we can access the size of the tensor using the shape attribute.

In [None]:
inputs = torch.from_numpy(X_train).to(device)
labels = torch.from_numpy(Y_train).to(device)

inputs_test = torch.from_numpy(X_test).to(device)
labels_test = torch.from_numpy(Y_test).to(device)

print("train inputs shape: :{}".format(inputs.shape))
print("train labels shape: :{}".format(labels.shape))

print("test inputs shape: :{}".format(inputs_test.shape))
print("test labels shape: :{}".format(labels_test.shape))


## Part 1: Linear Classifier with torch

Recall the output of a linear classifer is calculated as $z = Wx + b$ where $W\in R^{K\times M}$, $x\in R^M$, $b \in R^{K}$ and the output $z\in R^{K}$. In practice, this operation is performed in parallel on a **batch** of data, or a subset of samples. Intead of $x$ being a vector representing a single sample, it will be a matrix where $x \in R^{N\times M}$ for a batch of $N$ samples, where each row represents a different samle. For the dimensions to be valid for matrix multiplication, the equation needs to be manipulated slightly. Assuming the input is $x \in R^{N\times M}$ and the output is $z \in R^{N\times K}$, write the equation for a linear classifer (in the batch setting) and specify the size of the weights and biases in the markdown cell below.

XXX

Below you will define a `LinearClassifier` class to represent a simple linear classifier. The `__init__` function should initialize the model parameters (weights and biases) using PyTorch tensors. The weights should be intitalized with random values using `torch.randn` and the biases can be be italized to zeros using `torch.zeros`. For example, `torch.randn(4,8, device=device)` will create random matrix with size $4 \times 8$. `torch.randn` will generate random values from a standard normal distribution (i.e. mean 0, standard deviation 1). Ensure all weights and biases have gradient tracking enabled by calling the `requires_grad_()` method on all tensors.


The `__call__` function should impement the forward pass (evaluate the linear classifier equation), using the input data $X \in R^{N\times M}$. 


In [None]:
class LinearClassifier:
    """
    A two-hidden layer multilayer perceptron 
    
    Parameters:
    ----------
    - input_size (int): number of input features (M)
    - output_size (int): number of output classe (K)
    
    Attributes:
    - W1 (torch.tensor): weight matrix 
    - b1 (torch.tensor): bias vector 

    Methods:
    """
    def __init__(self, input_size, output_size):
        ######################################
        #######         TODO           #######
        ######################################
        # replace XXX and create the remaining parameters
        self.W1 = torch.randn(XXX, XXX, device = device)
        self.W1.requires_grad_()
        

    def __call__(self, X):
        ######################################
        #######         TODO           #######
        ######################################
        # implement forward pass
        
        return z1

By implementing the `__call__` function, an object of the LinearClassifer class can be treated as a callable instance. In other words, the LinearClassifier instance behaves similar to calling a function. For example the below code first creates an instance of the `LinearClassifier` class and next the forward pass is executed (by executing the code in `__call__`)

`mymodel = LinearClassifier(2,2)`

`logit = mymodel(input_data)`


Next, implment the softmax function using torch operations (not NumPy). The input to the softmax function are the predicted logits, and the output is the predicted probabilites. The softmax function will be called by the categorical_cross_entropy loss function which is defined below. 

In [None]:
def softmax(x):
    """
    Calculates the softmax on input x with size NxC

    N is the number of samples

    C is the number of classes

    Each row of the output will be a valid probability distribiution
    i.e., all values between 0 and 1, and each row sums to 1

    Inputs:
    - x (torch.tensor): tensor with size NxC

    Returns:
    - p (torch.tensor): tensor with size NxC
    """
    ######################################
    #######         TODO           #######
    ######################################
    e_x = XXX
    p = XXX
    
    return p

def categorical_cross_entropy(y_pred, y_true):
    """
    Calculates the categorical cross entropy.

    y_pred are logits and y_true are class labels

    The function first calcultes softmax of the logits
    and then computes the cross entropy 

    Inputs:
    - y_pred (torch.tensor): tensor of logits with size NxC
    - y_true (torch.tensor): tensor with size N,

    Returns:
    - loss (torch.tensor): scalar loss (average cross entropy over samples)
    """
    N = y_pred.shape[0]
    y_prob = softmax(y_pred)
    
    cross_entropy = -torch.log(y_prob[torch.arange(N).long(), y_true.long()])
    loss = torch.sum(cross_entropy) / N
    return loss

Below write the training loop. The training loop will iterate for 5000 epochs. Since this is a toy example and the network is small, we can pass the entire training dataset through the model each epoch (rather than sampling batches). For each epoch the following steps are performed: forward pass, calculate loss, backward pass, update weights. To perform the backward pass you can call `loss.backward()`, where loss is the tensor returned by the `categorical_cross_entropy` function. The `backward()` functionality is available since we are using PyTorch tensors, this is the convenient autograd feature. After calling backward, you can access the gradient of all variables using the `grad` attribute, i.e. `W1.grad`. Implement gradient descent to update the values of all the weights and biases in the model.

In [None]:
######################################
#######         TODO           #######
######################################

# TODO: initialize model by creating instance of LinearClassifier class
# with two input features, and two ouputs
model = XXX

# Set hyperparameters for training
lr = 0.01 # learning rate/step size for gradient descent
epochs = 5000 

# Set the loss function
loss_func = categorical_cross_entropy # loss function

# Create empty list to store loss value each epoch
history = []


for epoch in range(epochs):

    # TODO: forward pass to calculate logits
    logit = XXX

    # calculate loss 
    loss = loss_func(logit, labels)
    
    history.append(loss.item())
    
    # Backward pass to update gradients
    loss.backward()

    # Exclude weight update from computation graph
    with torch.no_grad(): 
       
        # TODO: update weights using gradient descent on all parameters. 
        # gradient of W1: model.W1.grad 
       
       
        # After updating the weights, clear the gradients for the next epoch
        model.W1.grad.zero_()
        model.b1.grad.zero_()
       
        # Plot decision boundary
        if(epoch % 50 == 0):
            callback_plot(epoch, X_train, Y_train, model, device, LIN_OUTPUT_DIR)


# Plot loss per epoch
fig, ax = plt.subplots()
ax.plot(history)
ax.set_xlabel("epoch")
ax.set_ylabel("loss")


The code below will create a gif of the training visualizations. After running the code, open the linear.gif file to visualize the classification decision boundary changing the model parameters update each epoch during training. 

In [None]:
output_gif_path = "linear.gif"
image_paths = sorted(glob.glob(LIN_OUTPUT_DIR+"/*"))
create_gif(image_paths, output_gif_path, duration=10)

Calculate the accuracy of the model on the test dataset (inputs_test and labels_test)

In [None]:
def accuracy(y_pred, y_true):
    preds = torch.argmax(y_pred, dim=1)
    return (preds == y_true).float().mean()

######################################
#######         TODO           #######
######################################

# TODO: get predictons on test dataset and print accuracy


## Part 2: MLP with torch

Now consider a 2-layer MLP: a network with are two hidden layers, each followed by a ReLU nonlinearity. The network has the following form:

$x \rightarrow \boxed{fc1} \rightarrow z1 \rightarrow \boxed{ReLU} \rightarrow a1 \rightarrow \boxed{fc2} \rightarrow z2 \rightarrow \boxed{ReLU} \rightarrow a2 \rightarrow \boxed{fc3}  \rightarrow ouptut$

where the boxes represent mathematical operations, and non-boxed values represent input and output tensors of each operation. 

In the markdown cell below, write the equaton of the 2-layer MLP above and specify the size of all weights and biases (equation and sizes should be appropirate for batch setting). Assume the input is $x \in R^{N\times M}$ and the output is $z \in R^{N\times K}$

XXX

Below you will define the `MLP` class to represent the 2-hidden layer MLP above. The `__init__` function should initialize the model parameters (weights and biases) using PyTorch tensors. The weights should be intitalized with random values using `torch.randn` and the biases can be be italized to zeros using `torch.zeros`. Improved training dynamics can be achieved by instead initalizing parameters with a standard deviation of $\sqrt(2.0 / n)$ where $n$ is the number inputs to the layer, this is known as "He Initialization". Note, `torch.randn` uses a standard deviation of 1. Ensure all weights and biases have gradient tracking enabled by calling the `requires_grad_()` method on all tensors.


The `__call__` function should impement the forward pass, using the input data $X \in R^{N\times M}$.

In [None]:
class MLP:
    """
    A two-hidden layer multilayer perceptron 
    
    Parameters:
    ----------
    - input_size (int): number of input features (M)
    - h1_size (int): number of neurons in first hiddlen layer
    - h2_size (int): number of neurons in second hiddlen layer
    - output_size (int): number of output classe (K)
    
    Attributes:
    - W1 (torch.tensor): weight matrix from input to hidden layer 1
    - b1 (torch.tensor): bias vector for hidden layer 1
    - W2 (torch.tensor): weight matrix from hidden layer 1 to hidden layer 2
    - b2 (torch.tensor): bias vector for hidden layer 2
    - W3 (torch.tensor): weight matrix from hidden layer 2 to output
    - b3 (torch.tensor): bias vector for output

    Methods:
    """
    def __init__(self, input_size, h1_size, h2_size, output_size):
        
        ######################################
        #######         TODO           #######
        ######################################
        # create all parameters
       
        

    def __call__(self, X):
        
        ######################################
        #######         TODO           #######
        ######################################
        # implement forward pass

      
        return z3

By implementing the `__call__` function, an object of the MLP class can be treated as a callable instance. In other words, the MLP instance behaves similar to calling a function. For example the below code first creates an instance of the `MLP` class and next the forward pass is executed (by executing the code in `__call__`)

`mymodel = MLP(2,8,16,2)`

`logit = mymodel(input_data)`


Below write the training loop.

In [None]:
######################################
#######         TODO           #######
######################################

# TODO: initialize model by creating instance of MLP class
# You can choose the sizes for hidden layers


# Set hyperparameters for training
lr = 0.01 # learning rate/step size for gradient descent
epochs = 5000 

# Set the loss function
loss_func = categorical_cross_entropy # loss function

# Create empty list to store loss value each epoch
history = []



for epoch in range(epochs):
    
    # TODO: forward pass to calculate logits

    
    # TODO: calculate loss (assign to variable named loss)

    
    history.append(loss.item())
    
    # Backward pass to update gradients
    loss.backward()

    # Exclude weight update from computation graph
    with torch.no_grad(): 

        # TODO: update weights using gradient descent on all parameters. 
        # gradient of W1: model.W1.grad 
        
        
        # After updating the weights, clear the gradients for the next epoch
        model.W1.grad.zero_()
        model.b1.grad.zero_()
        model.W2.grad.zero_()
        model.b2.grad.zero_()
        model.W3.grad.zero_()
        model.b3.grad.zero_()

        # Plot decision boundary
        if(epoch % 50 == 0):
            callback_plot(epoch, X_train, Y_train, model, device, MLP_OUTPUT_DIR)


# Plot loss per epoch
fig, ax = plt.subplots()
ax.plot(history)
ax.set_xlabel("epoch")
ax.set_ylabel("loss")


The code below will create a gif of the training visualizations. After running the code, open the mlp.gif file to visualize the classification decision boundary changing the model parameters update each epoch during training. 

In [None]:
output_gif_path = "mlp.gif"
image_paths = sorted(glob.glob(MLP_OUTPUT_DIR+"/*"))
create_gif(image_paths, output_gif_path, duration=10)

Lastly, calculate the accuracy of the model on the test dataset (inputs_test and labels_test)

In [None]:

######################################
#######         TODO           #######
######################################

# TODO: get predictons on test dataset and print accuracy


Lastly, compare the results for the linear classifier and the MLP. Include a comparison of the decision boundaries learned by the two models

XXX