In [1]:
#Project Description

In [2]:
#Action items

- The **gradient** is a vector that tells us in what direction the weights need to go. More precisely, it tells us how to change the weights to make the loss change fastest. 
- We call our process **gradient descent** because it uses the gradient to descend the loss curve towards a minimum. 
- **Stochastic** means "determined by chance." Our training is stochastic because the minibatches are random samples from the dataset. And that's why it's called SGD! 

# Example 1

In [3]:
'''
Sample code
'''
import time
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [4]:
def stochastic_gradient_descent(feature_array, target_array, to_predict, learn_rate_type="invscaling"):
    """ Computes Ordinary Least SquaresLinear Regression with Stochastic Gradient Descent as the optimization algorithm.
        :param feature_array: array with all feature vectors used to train the model
        :param target_array: array with all target vectors used to train the model
        :param to_predict: feature vector that is not contained in the training set. Used to make a new prediction
        :param learn_rate_type: algorithm used to set the learning rate at each iteration.
        :return: Predicted cooking time for the vector to_predict and the R-squared of the model.
"""    # Pipeline of transformations to apply to an estimator. First applies Standard Scaling to the feature array.
    # Then, when the model is fitting the data it runs Stochastic Gradient Descent as the optimization algorithm.
    # The estimator is always the last element.
    
    start_time = time.time()
    linear_regression_pipeline = make_pipeline(StandardScaler(), SGDRegressor(learning_rate=learn_rate_type))
    
    linear_regression_pipeline.fit(feature_array, target_array)
    stop_time = time.time()
     
    print("Total runtime: %.6fs" % (stop_time - start_time))
    print("Algorithm used to set the learning rate: " + learn_rate_type)
    print("Model Coeffiecients: " + str(linear_regression_pipeline[1].coef_))
    print("Number of iterations: " + str(linear_regression_pipeline[1].n_iter_))    # Make a prediction for a feature vector not in the training set
    prediction = np.round(linear_regression_pipeline.predict(to_predict), 0)[0]
    print("Predicted cooking time: " + str(prediction) + " minutes")    
    r_squared = np.round(linear_regression_pipeline.score(feature_array, target_array).reshape(-1, 1)[0][0], 2)
    print("R-squared: " + str(r_squared))
    


In [5]:
feature_array = [[500, 80, 30, 10],
                 [550, 75, 25, 0],
                 [475, 90, 35, 20],
                 [450, 80, 20,25],
                 [465, 75, 30, 0],
                 [525, 65, 40, 15],
                 [400, 85, 33, 0],
                 [500, 60, 30, 30],
                 [435, 45, 25, 0]]

In [6]:
target_array = [17, 11, 21, 23, 22, 15, 25, 18, 16]

In [7]:
to_predict = [[510, 50, 35, 10]]


In [8]:
stochastic_gradient_descent(feature_array, target_array, to_predict)

Total runtime: 0.001998s
Algorithm used to set the learning rate: invscaling
Model Coeffiecients: [-3.44034236  1.64723444  0.28599174  1.10821407]
Number of iterations: 249
Predicted cooking time: 13.0 minutes
R-squared: 0.9


In [9]:
stochastic_gradient_descent(feature_array, target_array, to_predict)

Total runtime: 0.002000s
Algorithm used to set the learning rate: invscaling
Model Coeffiecients: [-3.44226981  1.64672679  0.2866046   1.10902962]
Number of iterations: 248
Predicted cooking time: 13.0 minutes
R-squared: 0.9


In [10]:
stochastic_gradient_descent(feature_array, target_array, to_predict, learn_rate_type="adaptive")

Total runtime: 0.001999s
Algorithm used to set the learning rate: adaptive
Model Coeffiecients: [-3.49884884  1.64454487  0.3091154   1.15922034]
Number of iterations: 97
Predicted cooking time: 13.0 minutes
R-squared: 0.91


'''
With the limitations of Gradient Descent in mind, Stochastic Gradient Descent emerged as a way to tackle performance issues and speed up the convergence in large datasets.

Stochastic Gradient Descent is a probabilistic approximation of Gradient Descent. It is an approximation because, at each step, the algorithm calculates the gradient for one observation picked at random, instead of calculating the gradient for the entire dataset.
'''
src: https://towardsdatascience.com/stochastic-gradient-descent-explained-in-real-life-predicting-your-pizzas-cooking-time-b7639d5e6a32

## Example 2

In [17]:
def gradient_loop(runs=3):
    """ Repeatedly computes the gradient of a function
        Computes the gradient given the starting points and then uses the result of the gradient to feed the next iteration, with new points.
        Prints out the result of the function at each iteration
        :param: runs: number of iterations to compute
    """    # starting points
    x = np.array([1, 2, 3])
    
    # quadratic function, a parabola
    y = x**2
    
    for run in range(0, runs):
        print("Iter " + str(run) + ": Y=" + str(y))        # compute first derivative
        x = np.gradient(y, 1)        # update the function output
        y = x ** 2


In [18]:
gradient_loop(runs=10)

Iter 0: Y=[1 4 9]
Iter 1: Y=[ 9. 16. 25.]
Iter 2: Y=[49. 64. 81.]
Iter 3: Y=[225. 256. 289.]
Iter 4: Y=[ 961. 1024. 1089.]
Iter 5: Y=[3969. 4096. 4225.]
Iter 6: Y=[16129. 16384. 16641.]
Iter 7: Y=[65025. 65536. 66049.]
Iter 8: Y=[261121. 262144. 263169.]
Iter 9: Y=[1046529. 1048576. 1050625.]


# Example 3: Neural Network-Based SGD


In addition to the training data we have, we need two more things:
- A **loss function** that measures how good the network's predictions are. it measures the disparity between the the target's true value and the value the model predicts.
  -- A common loss function for regression problems is the **mean absolute error or MAE**. For each prediction y_pred, MAE measures the disparity from the true target y_true by an absolute difference abs(y_true - y_pred).
- An **optimizer** that can tell the network how to change its weights. The optimizer is an algorithm that adjusts the weights to minimize the loss. 
  -- Virtually all of the optimization algorithms used in deep learning belong to a family called **stochastic gradient descent.**  They are iterative algorithms that train a network in steps. One step of training goes like this:

    * Sample some training data and run it through the network to make predictions.
    * Measure the loss between the predictions and the true values.
    * Finally, adjust the weights in a direction that makes the loss smaller.



In [19]:
from IPython.display import Image
Image(url='img/batch-sgd.gif')  

Each iteration's sample of training data is called a minibatch (or often just "batch"), while a complete round of the training data is called an epoch. The number of epochs you train for is how many times the network will see each training example.
- The pale red dots depict the entire training set, while the solid red dots are the minibatches. 
- Every time SGD sees a new minibatch, it will shift the weights (w the slope and b the y-intercept) toward their correct values on that batch. Batch after batch, the line eventually converges to its best fit. 
- We can see that the loss gets smaller as the weights get closer to their true values.