## Gradient Descent

In this notebook we will implement gradient descent for a single neuron. To simplify this as much as possible we will assume we have no bias `b = 0` and no activation function. The image below shows the neuron we will be coding and the relevant equations needed to update the weight using gradient descent. 

<img src="Gradient Descent Notebook-1.jpg" width=600 align="center">

The recipe we are following is:
1. make a prediction for our single sample
2. calculate the loss for this single sample
3. calculate the value of the derivative for this single sample
4. use the derivative and the learning rate to update the value of the weight
5. repeat until the loss is a minimum (that is, stops decreasing)

#### Stochastic Gradient Descent: A single sample

In [None]:
y = 4 # correct answer
x = 2 # we have one sample and one feature 
w = 0.1 # initial value for our weight

lr = 0.0127 # the learning rate

n_iterations = 100 # number of iterations of gradient descent, that is, number of times we update w

for i in range(n_iterations):
    y_hat = w * x # make prediction using current value for w
    L = (y - y_hat)**2  # calculate the loss (our loss function = mean squared error)
    dL_dw = -2 * (y - y_hat)*x   # calculate derivative needed to update the weight (see image above) 
    w = w - lr*dL_dw # update the weight
    # the code below allows you to print out answers after every x iterations; adjust to suit your needs
    if (i%1 == 0):
        print(f"iteration: {i} weight: {w:.4f}  prediction: {y_hat:.4f}  Loss: {L:.8f}")

#### Stochastic Gradient Descent: Multiple samples

We will now repeat what we did above but now our training data will have 5 samples. 

In [None]:
X = [1, 2, 3, 3, 4]  # 5 samples with 1 feature value each
Y = [2, 1, 4, 2, 5] # correct answers for each sample

Let's plot the data so we can see what we are dealing with. 

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X, Y)

Using the plot above, convince yourself that for our neuron we will not be able to make `L = 0` as we could with the single sample. That is, there is no line we can draw that will pass through all the points (that is how the loss would equal to 0). 

In the following code we will apply stochastic gradient descent (update weight after each sample in the trainging data) to the training data of 5 samples we created above. 

In [None]:
w = 0.1 # initial value for our weight
lr = 0.0013

n_epochs = 100
epoch_counter = 0

for i in range(n_epochs):
    epoch_counter = epoch_counter + 1 # use this to keep track of how many times we have gone through entire training data
    Total_L = 0 # set the total_loss (sum of losses from each sample in training set) to 0 at start of each epoch
    for x, y in zip(X, Y): # iterate through the training data one sample at a time
        y_hat = w * x # make prediction for current sample
        L = (y - y_hat)**2  # calculate loss for current sample
        Total_L = Total_L + L # add the loss for current sample to total loss
        dL_dw = -2 * (y - y_hat)*x   # calculate derivative for current sample
        w = w - lr*dL_dw # update the weight using the derivative and loss for current sample
    print(f"Epoch {epoch_counter} weight: {w:.8f}  Total Loss: {Total_L:.6f}")

#### Compare with scikit-learn

Our neuron is really just performing linear regression since we have no activation function. So, let's perform linear regression on the data and see what value scikit-learn gives us for `w`. 

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

X_arr = np.array(X).reshape(-1, 1)
Y_arr = np.array(Y) 

lin_reg = LinearRegression()
lin_reg.fit(X_arr, Y_arr)

lin_reg.coef_

### Batch Gradient Descent

We will now implement batch gradient descent where we will only make an update to the weight after we have processed all of the samples. The recipe is as follows: 
1. make predictions for all samples in our training data
2. calculate the average loss for all samples in our training data
3. calculate the average value of the derivative for all samples in our training data
4. update the value of the weight
5. repeat until the loss is a minimum (that is, stops decreasing)


In [None]:
X_new = np.array([1, 2, 3, 3, 4]).reshape(-1, 1) 
Y_new = np.array([2, 1, 4, 2, 5]).reshape(-1, 1) 

w = 0.1 
lr = 0.013 

n_epochs = 150

for i in range(n_epochs):
    Y_hat = w*X_new # make predictions for all samples (remember X_new is an array with 5 samples)
    sample_losses = (Y_new - Y_hat)**2 # calculate the loss for all 5 samples (result is an array with 5 values)
    avg_loss = sample_losses.mean() # calculate the average loss for the training data
    dLoss_dw = (-2 * np.multiply((Y_new - Y_hat), X_new)).mean() # calculate the average of the derivatives
    w = w - lr*dLoss_dw # update the weight using the average value of the derivative
    print(f"Epoch {i + 1} weight: {w:.4f}  AVG Loss: {avg_loss:.4f}")