# Stochastic and batch gradient descent



For this exercise we will use the code we wrote for the gradient descent from scratch for the simple linear regression : 

$f(x) = \beta_1 \times x + \beta_0$

* Import the following libraries: 
  * Numpy 
  * random

In [2]:
import numpy as np 
import random


In [3]:
class Model():
  def __init__(self):
    self.beta_1 = np.random.randn(1)
    self.beta_0 = np.random.randn(1)
  
  def __call__(self, x):
    return self.beta_1 * x + self.beta_0

In [4]:
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
print(diabetes.DESCR)
diabetes_data = diabetes.data
y = diabetes.target

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

* We have too much data in this dataset `diabetes_data`, take only the third column of the dataset and store it in a `diabetes_X` variable.

In [5]:
# Use only one feature
diabetes_X = diabetes_data[:,2]
diabetes_X[:5]

array([ 0.06169621, -0.05147406,  0.04445121, -0.01159501, -0.03638469])

In [6]:
def mse(y_true,y_pred):
  return np.mean((y_true - y_pred)**2)

In [7]:
# Calculate model.beta_1's derivate
def derivative_mse_beta_1(y_pred, y_true, x):
  return 2/len(y_pred)*np.sum((x @ (y_pred - y_true)))
  # return 2/len(y_pred) * np.sum(np.dot(x,(y_pred-y_true)))

In [8]:
# Calculate model.b's derivate
def derivative_mse_beta_0(y_pred, y_true):
  return 2/len(y_pred)*(np.sum(y_pred - y_true))

In [9]:
# Define learning rate and a number of iterations 
lr = 0.1
epochs = 1000

We have previously coded the gradient descent algorithm as follows, we are just adding two lines of code to keep in memory the variations of the loss function at each epoch (since we are using gradient descent one epoch equals one adjustment of the coefficients) :

In [10]:
%%time
loss_history = []
model = Model()
for epoch in range(epochs):
  # Calculate the loss function
  current_loss = mse(model(diabetes_X), y)
  loss_history.append(current_loss)

  # Update variables
  model.beta_1 -= lr * derivative_mse_beta_1(model(diabetes_X), y, diabetes_X)
  model.beta_0 -= lr * derivative_mse_beta_0(model(diabetes_X), y)

  # Show updated variables
  if epoch % 100 == 0 or epoch == epochs - 1:
    print("-------------------- Epoch {} --------------------".format(epoch))
    print("Current Loss: {}".format(current_loss))
    print("beta_1 = {}".format(model.beta_1))
    print("beta_0 = {}".format(model.beta_0))

-------------------- Epoch 0 --------------------
Current Loss: 28860.347946206937
beta_1 = [0.51537404]
beta_0 = [30.99004745]
-------------------- Epoch 100 --------------------
Current Loss: 5753.051358511014
beta_1 = [42.5052629]
beta_0 = [152.13348414]
-------------------- Epoch 200 --------------------
Current Loss: 5591.858108112863
beta_1 = [82.63709104]
beta_0 = [152.13348416]
-------------------- Epoch 300 --------------------
Current Loss: 5444.614893626913
beta_1 = [120.99307801]
beta_0 = [152.13348416]
-------------------- Epoch 400 --------------------
Current Loss: 5310.114446749616
beta_1 = [157.65180513]
beta_0 = [152.13348416]
-------------------- Epoch 500 --------------------
Current Loss: 5187.2539789627
beta_1 = [192.68837646]
beta_0 = [152.13348416]
-------------------- Epoch 600 --------------------
Current Loss: 5075.026139611429
beta_1 = [226.17457272]
beta_0 = [152.13348416]
-------------------- Epoch 700 --------------------
Current Loss: 4972.510756491618
b

The model took 54ms to train in total!

In [11]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=diabetes_X, y=y,
                    mode='markers',
                    name='target'))
fig.add_trace(go.Scatter(x=diabetes_X, y=model(diabetes_X),
                    mode='lines',
                    name='predictions'))
fig.update_layout(
    title="Target vs Predictions",
    xaxis_title="BMI",
    yaxis_title="Diabetes metric"
    )
fig.show()

## Stochastic gradient descent

Let's now implement stochastic gradient descent!
Reproduce the training loop for training the model but you will define :
* `sample_size` : the number of observations randomly selected at each step
* `steps_per_epochs` : the number of steps before the model has trained on as many observations as the total number of observations in the dataset.
* `stochastic_loss_history` : a list that will contain the loss after each epoch is finished
* `stochastic_loss_by_step_history` : a list that will contain the loss after each step

⚠️ Don't forget to add `%%time` at the beginning of the cell to measure how long the stochastic gradient descent took to run over 1000 epochs ⚠️ 

In [12]:
%%time
sample_size = 100
steps_per_epochs = int(len(diabetes_X) / sample_size)
stochastic_loss_history = []
stochastic_loss_by_step_history = []
model = Model()
for epoch in range(epochs):
  # Calculate epoch loss 
  current_loss = mse(model(diabetes_X), y)
  stochastic_loss_history.append(current_loss)
  for step in range(steps_per_epochs):
    # define  random sample :
    index = random.sample(range(len(diabetes_X)), sample_size)
    data_sample = diabetes_X[index]
    target_sample = y[index]

    # calculate step loss
    step_loss = mse(model(data_sample), target_sample)
    stochastic_loss_by_step_history.append(step_loss)

    # Update variables
    model.beta_1 -= lr * derivative_mse_beta_1(model(data_sample), target_sample, data_sample)
    model.beta_0 -= lr * derivative_mse_beta_0(model(data_sample), target_sample)

  # Show updated variables
  if epoch % 100 == 0 or epoch == epochs - 1:
    print("-------------------- Epoch {} --------------------".format(epoch))
    print("Current Loss: {}".format(current_loss))
    print("beta_1 = {}".format(model.beta_1))
    print("beta_0 = {}".format(model.beta_0))

-------------------- Epoch 0 --------------------
Current Loss: 28696.48792338513
beta_1 = [0.45872272]
beta_0 = [89.71100572]
-------------------- Epoch 100 --------------------
Current Loss: 5316.445502884823
beta_1 = [156.86806333]
beta_0 = [152.05032075]
-------------------- Epoch 200 --------------------
Current Loss: 4880.068771072977
beta_1 = [289.3453042]
beta_0 = [152.01018549]
-------------------- Epoch 300 --------------------
Current Loss: 4580.65702204444
beta_1 = [398.04223162]
beta_0 = [150.36889519]
-------------------- Epoch 400 --------------------
Current Loss: 4370.358568206217
beta_1 = [489.79152754]
beta_0 = [150.31213182]
-------------------- Epoch 500 --------------------
Current Loss: 4233.809125716643
beta_1 = [565.07898773]
beta_0 = [148.52352764]
-------------------- Epoch 600 --------------------
Current Loss: 4124.947273550393
beta_1 = [628.29912305]
beta_0 = [151.56429351]
-------------------- Epoch 700 --------------------
Current Loss: 4053.855313261767

Let's now compare the loss of classical gradient descent with the loss of stochastic gradient descent in a visualization.

In [13]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=[i for i in range(epochs)][10:], y=loss_history[10:],
              mode="markers+lines",
              name="gradient descent loss"))
fig.add_trace(go.Scatter(x=[i for i in range(epochs)][10:], y=stochastic_loss_history[10:],
              mode="markers+lines",
              name="stochastic gradient descent loss"))
fig.update_layout(
    title="Gradient descent vs. Stochastic gradient descent",
    xaxis_title="epochs",
    yaxis_title="loss"
    )
fig.show()

## Batch gradient descent

Now let's implement batch gradient descent, for this you will need :
* `batch_size` : the number of observations in each batch
* `steps_per_epochs` : the number of steps before the model has trained on as many observations as the total number of observations in the dataset (meaning number of batches).
* `batch_loss_history` : a list that will contain the loss after each epoch is finished
* `batch_loss_by_step_history` : a list that will contain the loss after each step

⚠️ Don't forget to add `%%time` at the beginning of the cell to measure how long the stochastic gradient descent took to run over 1000 epochs ⚠️ 

In [14]:
%%time
batch_size = 100
steps_per_epochs = int(len(diabetes_X) / batch_size)
batch_loss_history = []
batch_loss_by_step_history = []
model = Model()
for epoch in range(epochs):
  # Calculate epoch loss 
  current_loss = mse(model(diabetes_X), y)
  batch_loss_history.append(current_loss)
  index = random.sample(range(len(diabetes_X)), len(diabetes_X))
  for step in range(steps_per_epochs):
    # define the batch index
    index_step = index[step*batch_size:(step+1)*batch_size]
    # define  random sample :
    data_sample = diabetes_X[index_step]
    target_sample = y[index_step]

    # calculate step loss
    step_loss = mse(model(data_sample), target_sample)
    batch_loss_by_step_history.append(step_loss)

    # Update variables
    model.beta_1 -= lr * derivative_mse_beta_1(model(data_sample), target_sample, data_sample)
    model.beta_0 -= lr * derivative_mse_beta_0(model(data_sample), target_sample)

  # Show updated variables
  if epoch % 100 == 0 or epoch == epochs - 1:
    print("-------------------- Epoch {} --------------------".format(epoch))
    print("Current Loss: {}".format(current_loss))
    print("beta_1 = {}".format(model.beta_1))
    print("beta_0 = {}".format(model.beta_0))

-------------------- Epoch 0 --------------------
Current Loss: 29201.441829758343
beta_1 = [1.06705907]
beta_0 = [89.39302718]
-------------------- Epoch 100 --------------------
Current Loss: 5314.742633617406
beta_1 = [157.93144278]
beta_0 = [153.62603049]
-------------------- Epoch 200 --------------------
Current Loss: 4883.214588460467
beta_1 = [288.42374849]
beta_0 = [151.79959347]
-------------------- Epoch 300 --------------------
Current Loss: 4582.357958309155
beta_1 = [397.52239271]
beta_0 = [152.72515193]
-------------------- Epoch 400 --------------------
Current Loss: 4371.975516389186
beta_1 = [488.90457524]
beta_0 = [151.28920909]
-------------------- Epoch 500 --------------------
Current Loss: 4226.8184894449105
beta_1 = [564.59406064]
beta_0 = [152.18206765]
-------------------- Epoch 600 --------------------
Current Loss: 4124.685497327683
beta_1 = [628.40630428]
beta_0 = [152.85921721]
-------------------- Epoch 700 --------------------
Current Loss: 4053.28371828

Let's compare all three methods in a visualization :

In [15]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=[i for i in range(epochs)][10:], y=loss_history[10:],
              mode="markers+lines",
              name="gradient descent loss"))
fig.add_trace(go.Scatter(x=[i for i in range(epochs)][10:], y=stochastic_loss_history[10:],
              mode="markers+lines",
              name="stochastic gradient descent loss"))
fig.add_trace(go.Scatter(x=[i for i in range(epochs)][10:], y=batch_loss_history[10:],
              mode="markers+lines",
              name="batch gradient descent loss"))
fig.update_layout(
    title="Gradient descent vs. Stochastic gradient descent",
    xaxis_title="epochs",
    yaxis_title="loss"
    )
fig.show()

**We can conclude from the graphs that stochastic and batch gradient descent methods converge much faster than classical gradient descent for the same number of epochs** 