# Neural Networks and Gradient Descent


In this project, we will be implementing a neural network and train it from scratch. In other words, we will temporarily put aside the wheels that others have built (e.g., TensorFlow, PyTorch, Keras, MXNet, etc) and take closer look at the mechamisms of neural networks inside those wrappers (but of course, we would greatly benefit from the linear algebra implementations of, say NumPy).

## Neural Networks

Recall that a neural network is composed of an input layer, an output layer, and an arbitrary number of hidden layers. 

<img src="https://miro.medium.com/max/1000/1*g9agaYlewb0vzuJIOupf3w.png" height=400 />

For example, if we have a
$$
\begin{align*}
    z & = \sigma(W_1x) \\
    y & = \sigma(W_2z) = \sigma(W_2\sigma(W_1x))
\end{align*}
$$
where $\sigma$ is the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function).

One technical caveat is that we can omit the bias term as long as we append an $1$ at the end of $x$, which can somewhat simplify the backpropagation.

In [None]:
import numpy as np
import math as math

# useful activation function
def sigmoid(x):
    # TODO please skim the Wikipedia page for the sigmoid function, which is 
    # hyperlinked above and implmenet the sigmoid function
    output = (1/(1+math.exp(-x)))
    return(output)

# test
sigmoid(0)

0.5

Exercise: what is the derivative of sigmoid? Is it easy to compute numerically?

In [None]:
import numpy as np

# global parameters
INPUT_DIMENSION = 1024
OUTPUT_DIMENSION = 1

class NeuralNetwork:
    def __init__(self, hidden_dimension):
        # naive random weight initialization
        self.w1 = np.random.rand(hidden_dimension, INPUT_DIMENSION)
        self.w2 = np.random.rand(OUTPUT_dimension, hidden_dimension)

def forward(nn, x):
    # TODO please implement the forward pass
    # try to write it without any for loop
    pass


## Gradient Descent

Gradient descent is a way of finding (with some error) the optimal weights with respect to some loss function. It is not guaranteed to produce THE optimal weights (in fact, it almost never produce them in practical applications) but it is extremely effective for optimizing machine learning models like neural networks.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Gradient_descent.svg/700px-Gradient_descent.svg.png" height=400>

Algorithmically speaking, gradient descent is simply repetitively setting

$$
w = w - \eta \frac{\partial L}{\partial w}
$$

given some loss function $L$ and some positive number $\eta$, a.k.a., the learning rate.

Commonly used loss functions include $L_1$ [loss](https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html?highlight=loss#torch.nn.L1Loss "PyTorch documentation on L1 loss"), $L_2$ [loss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html "PyTorch documentation on related MSE loss"), and <a href="https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html?highlight=bce#torch.nn.BCELoss" title="PyTorch documentation on BCE loss">cross entropy loss</a>.

## Back Propogation

Gradient descent is simple yet powerful. But to actually implement it, we still need to find out $\frac{\partial L}{\partial w}$. To do this, we use a method called back propogation (BP). In essense, BP is just a series of chain rules to propate the loss from the output layer back to the input layer.

Take the previous neural network as an example, we have
$$
\begin{align*}
\frac{\partial L}{\partial W_2} & = \frac{\partial L}{\partial y} \frac{\partial y}{\partial W_2} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial (W_2z)} \frac{\partial (W_2z)}{\partial W_2} \\
\end{align*}
$$
where all the terms are easy to compute.

Note that if we do bookkeeping properly, one single backward pass is enough to update all the weights.

Exercise: derive a similar formula for $\frac{\partial L}{\partial W_1}$.

## Prepare your dataset

In this project, we will continue using the fashion MNIST dataset or the MNIST dataset from the previous project. However, please feel free to use your favorite dataset if you want. For example,

- [Iris](https://archive.ics.uci.edu/ml/datasets/Iris/) dataset
- [Heart Attack](https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?select=heart.csv) dataset or any other numerical dataset on Kaggle with a target variable (Kaggle account required for download but it's free)

If the download link does not work, you can also download the dataset from [here](https://www.kaggle.com/zalando-research/fashionmnist "Fashion MNIST dataset on Kaggle") with a Kaggle account.

In [None]:
# Load the fashion MNIST dataset and perform preprocessing as in the last project
import pandas as pd
train_data = pd.read_csv('PATH_TO_TRAIN_DATA')
test_data = pd.read_csv('PATH_TO_TRAIN_DATA')

# Keep only the tops (0) and pants (1) labels using the data.loc method for both datasets.
train_data = train_data.loc[(train_data['label']==0) | (train_data['label']==1)]
test_data = test_data.loc[(test_data['label']==0) | (test_data['label']==1)]

# Convert the data to a numpy array.
# Change the zero labels to -1 so that the same linear solver technique from before will work.
train_np = train_data.to_numpy()
for i in range(len(train_np)):
    if train_np[i][0] == 0:
        train_np[i][0] = -1
test_np = test_data.to_numpy()
for i in range(len(test_np)):
    if test_np[i][0] == 0:
        test_np[i][0] = -1

import matplotlib.pyplot as plt
train_images = train_np[:,1:].reshape(12000,28,28)

plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i])
    plt.xlabel(train_np[i,0])
plt.show()


In [None]:
# backprop parameters
MAX_EPOCH = 100
data = []
lr = 1e-3

class NeuralNetwork:
    def __init__(self, hidden_dimension):
        # naive random weight initialization
        self.w1 = np.random.rand(hidden_dimension, INPUT_DIMENSION)
        self.w2 = np.random.rand(OUTPUT_dimension, hidden_dimension)

def backward_propogation(self, train_xs, train_ys, alpha):
        # First, we need to do a forward propogation while keeping track of 
        # the hidden activations because we need them for back propogation.
        # You might also decide to append a 1 to the end of x as the bias term
        # to make back prop easier

        # Calculate the derivative of the loss w.r.t y with the loss of your choice
        dL_dy = 
        
        # The main course
        # Calculate the derivative of the loss w.r.t. the product of W_2 and z
        # (the input of the activation function of the last layer)
        delta_1 = 

        # Calculate the derivative of the loss w.r.t. W_2


        # Calculate the derivative of the loss w.r.t z


        # Calculate the derivative of the loss w.r.t the product of W_1 and x
        # (the input of the activation function of the hidden layer)
        delta_2 = 

        # Calculate the derivative fo the loss w.r.t x
        
        # update weights
        self.w2 -= lr * 
        self.w1 -= lr * 
def train_model(model, train_ys, train_xs, dev_ys, dev_xs, args):
    training_data = [(train_xs[n], train_ys[n]) for n in range(len(train_xs))]
    for _ in range(MAX_EPOCH):
        for (x, y) in training_data:
            # back propogate
            backward_propogation(model, x, y, args.lr)
        # calculate and print the loss
        train_y_hat = model.forward_propogation(train_xs)
        loss = -np.sum(np.multiply(train_ys, np.log(train_y_hat))) - np.sum(np.multiply(1-train_ys, np.log(1-train_y_hat)))
        print("Total training loss: " + str(loss))
    return model