# Gradient Descent
<hr style="border:2px solid black">

![gradient_descent.png](attachment:gradient_descent.png)

## 1. Introduction

### 1.1 What is Gradient Descent?
- optimization algorithm, commonly used to train machine learning models
- iterative update of parameters $\bf{w}$ by computing loss function at every step 
- model learns from trainig data over time, ultimately minimizing the loss function $L(\bf{w})$

### 1.2 The Algorithm

**Learning rate** 

> Also referred to as step size $\alpha$, is the size of the steps that are taken to reach the minimum

![learning_rate.png](attachment:learning_rate.png)

**Parameter update equation**

>$$
w_j \rightarrow w_j - \alpha \frac{\partial L(\bf{w})}{\partial w_j}, \quad j=1,2,\ldots,N
$$

### 1.3 Types of Gradient Descent

**Batch gradient descent**

>Batch gradient descent sums the error for each point in a training set, updating the model only after all training examples have been evaluated. This process referred to as a training epoch.

**Stochastic gradient descent**

>Stochastic gradient descent (SGD) runs a training epoch for each example within the dataset and it updates each training example's parameters one at a time. 

**Mini-batch gradient descent**

>Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic gradient descent. It splits the training dataset into small batch sizes and performs updates on each of those batches. 

### 1.4 Challenges with Gradient Descent

**Local minima and saddle points**

>For convex problems, gradient descent can find the global minimum with ease, but as nonconvex problems emerge, gradient descent can struggle to find the global minimum, where the model achieves the best results.

**Vanishing and exploding gradients**

> If the gradient is too small or too large, it creates problem in deeper neural networks, particular recurrent neural networks

<hr style="border:2px solid black">

## 2. Gradient Descent from Scratch

In [1]:
import numpy as np
import matplotlib.pyplot as plt 

### Step 0. Generation of input data

In [2]:
Xtrue = [np.random.randint(1, 100)*0.01 for x in range(1000)]
len(Xtrue)

1000

### Step 1. Generating the theoretical line

$$
y = w_0 + w_1x + \epsilon
$$

where $\epsilon$ are randomly distributed noise values. To simulate data we have to choose some true values for $w_0$ and $w_1$.

In [None]:
SLOPE = 2.0
INTERCEPT = -1.5

ytrue = [INTERCEPT + SLOPE*x + np.random.normal(0, 0.2) for x in Xtrue]
len(ytrue)

$$
\hat{y} = w_0 + w_1*x
$$

In [None]:
ypred = [INTERCEPT + SLOPE*x for x in Xtrue]
len(ypred)

#### Generating the theoretical line with a for loop and in a function.

In [None]:
def make_line(xdata, slope, intercept):
    """This functions takes x values and makes a line with the
    given intercept and slope."""

    ypred = []

    for x in xdata:
        line = intercept + slope*x 
        ypred.append(line)
    
    return ypred

In [None]:
plt.scatter(Xtrue,  ytrue, s=0.8)
plt.plot(Xtrue, make_line(xdata = Xtrue, slope = 2, intercept = -1.5));

### Step 2. The loss function

- We, as humans, can tell visually whether or not a line is "good" or "bad".
- The computer, which obviously can't "see" the picture, needs some kind of measure / number to let it know how "good" or "bad" its guess is.

In [None]:
def mse(ytrue, ypred): 
    """This function calculates the Mean Squared
    Errors between the true and the predicted data.
    This will be our loss function."""

    vals = []

    for i in range(len(ytrue)):

        vals.append((ytrue[i]-ypred[i])**2)
    
    error = sum(vals)/len(ytrue)

    return error

#### What we did so far:

- found a way to make a line given any slope and intercept
- based on the resulting line, we can calculate the error between that line and the actual points
    - so this helps us determine how good or bad our line is (e.g. compared to a previous attempt)

#### What do we need to do now?

- now that we have a way of determining the "goodness" or "badness" of a line given some initial guess at the slope and intercept, we need a way to figure out how to actually change the slope and intercept in such a way that our error gets lower!!

### Step 3. Calculating the gradient.

In [None]:
def calc_gradient(Xdata, ytrue, slope, intercept):
    """This function gives the direction of the gradient 
    given the parameters. The loss function is defined as
    mse by default."""

    """step_size and the first line"""
    dw = 0.00001
    ypred = make_line(Xdata, slope, intercept) #first guess!

    '''tweak first parameter'''
    slope_change = slope + dw
    # calculate predictions using intercept and change of slope
    ypred_slope = make_line(Xdata, slope_change, intercept) #just tweaking slope!!!
    deriv_slope = (mse(ytrue, ypred_slope) - mse(ytrue, ypred)) / dw

    '''tweak second parameter'''
    intercept_change = intercept + dw
     # calculate predictions using change of intercept and slope
    ypred_intercept = make_line(Xdata, slope, intercept_change)
    deriv_intercept = (mse(ytrue, ypred_intercept) - mse(ytrue, ypred)) / dw

    return [deriv_slope, deriv_intercept] # return both derivations as a list

The function above basically gives us directionality: 
- Would the error go up or down if I increased slope ever so slightly?
- Would the error go up or down if I increased intercept ever so slightly?

### Now time for the algorithm!!

### Step 4. Implementing the Gradient Descent Algorithm

Repeat the following steps `MAX_ITER` times.
In each iteration of the loop:

      1. Calculate the gradient of the loss function
         with respect to each model parameter.
      2. If the gradient becomes really close to zero,
         break out of the loop.
      3. For each model parameter,
         - calculate the updated parameter value
           using the formula from above.
         - overwrite the value with its updated value.
      4. Print all variables to check if they
         are converging to their expected values.

In [None]:
!pip install imageio

In [None]:
import time
import imageio
import math

images = []

SLOPE_START = -8.0
INTERCEPT_START = 2.0

# Parameters for the gradient descent
# Arbitrary learning rate, iterations, threshold that I can accept as it converged.
LR = 0.2
MAX_ITER = 600
THRESHOLD = 0.01

#Threshold is the value for the gradient that you can accept as minimum.

# We hope that this algorithm will eventually take us 
# to the best possible parameters:
# i.e. slope ~ 2.0 and intercept ~ -1.5

for i in range(MAX_ITER):
    
    time.sleep(0.2)
        
    """1. In each iteration of the loop, calculate the gradient 
    of your loss function."""
    
    deriv_slope, deriv_intercept = calc_gradient(Xtrue, ytrue, SLOPE_START, INTERCEPT_START)
        
    """2. If the gradient becomes smaller than some pre-determined 
    threshold value, break out of the loop."""
        
    if ((abs(deriv_slope) <= abs(THRESHOLD)) & (abs(deriv_intercept) <= abs(THRESHOLD))): 
        
        print("CONVERGED: Found threshold sensitivity, slope=",SLOPE_START, "intercept=",INTERCEPT_START )
        print("Derivative",deriv_slope, deriv_intercept, THRESHOLD)
        break
   
    else:        
        
        """2. For each parameter, multiply the corresponding partial derivative by the
        learning rate, then negate it. Add the resulting product to
        the previous value of the parameter to get the updated parameter value."""
    
        SLOPE_NEW = (-deriv_slope * LR + SLOPE_START)
        INTERCEPT_NEW = (-deriv_intercept * LR + INTERCEPT_START)

        """3. At the end of each iteration, overwrite the values of each parameter
        with its modified value."""
    
        SLOPE_START = SLOPE_NEW
        INTERCEPT_START = INTERCEPT_NEW
  
    if i % 20 == 0:
        
        print(f"Iteration Number: {i}, Deriv_Slope:\{deriv_slope},SLOPE: {SLOPE_NEW:.3f},\
        INTERCEPT: {INTERCEPT_NEW:.3f}")
        
        
        plt.figure()
        ypred = make_line(Xtrue, SLOPE_START, INTERCEPT_START)
        plt.plot(Xtrue, ypred)
        plt.title(f'Iteration Number: {i}, Error: {mse(ytrue, ypred)}')
        plt.scatter(Xtrue, ytrue, s=0.8)
        filename = f'iter_{i}.png'.format(i)
        plt.savefig(filename)  
        images.append(imageio.imread(filename))
        
imageio.mimsave('output.gif', images, fps=1)

print('Printed GIF!!!')
        



<hr style="border:2px solid black">

## References

- [Gradient Descent, by IBM Cloud Education](https://www.ibm.com/cloud/learn/gradient-descent)

- [Visualizing the gradient descent method](https://scipython.com/blog/visualizing-the-gradient-descent-method/)

- [Andrew Ng's Lectures](https://www.youtube.com/watch?v=kHwlB_j7Hkc&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=4)