![uc3m](img/uc3m.jpg)

# The Stochastic Gradient Descent and Least Squares

<a href="http://www.est.uc3m.es/nogales" target="_blank">Javier Nogales</a>

- If you rewrite model for lin regression you get additional 0.25 Poinits
- Change with any ML model
- Until 11.12

## Summary

The general framework in machine learning and statistics is:

Data = Model + Noise

In Statistics, the Model is usually known up to some parameters, whereas in Machine Learning the Model is learnt through the data.

Consider the following linear model and data:
$$y=\beta_{0}+\beta_{1}x_1+\dots+\beta_{p}x_p+e$$
where $e\sim N(0,\sigma^{2})$

We will learn how to apply the Stochastic Gradient method to the least-squares problem

<img src="img/SG.png" width="450">



### Generate random data from previous model



In [1]:
%matplotlib notebook
import numpy as np
import time
import random

nsample = 1000 # Observations
nvariables=10 # Samples

# Following the formula
X0 = np.ones([nsample,1]) #the firt column are ones for the beta_0
X1 = np.random.uniform(0,10,([nsample,nvariables])) # random numbs 0 - 10
X = np.concatenate([X0, X1],axis=1)

beta=np.random.randint(-10,10,size=([nvariables+1,1])) # random numbs -10 - 10
error=np.random.normal(0,6,(nsample,1)) #normal random error

Y=np.dot(X,beta)+error





Fit a linear relation between a set of variables ($X$) with respect to a response variable ($y$)

Model: $y = X\beta + u$

Classical estimation: least squares

  \begin{align*}
\text{minimize}\quad & \frac{1}{n}||y-X\beta||_2^2
\end{align*}

The LS solution using textbook formula is: $\beta_{ls}=(X^T X)^{-1}X^T y$

In [2]:
beta_ls_exact=np.dot(np.dot(np.linalg.inv(np.dot(np.transpose(X),X)),np.transpose(X)),Y)

print(beta_ls_exact)

[[ 8.66005715]
 [ 6.96292181]
 [-0.02783633]
 [ 6.07039611]
 [-1.06777975]
 [ 2.99381691]
 [-0.02835074]
 [-5.96335833]
 [-2.95418012]
 [-6.98044486]
 [-0.12237511]]


Compare with the true solution $\beta$

In [3]:
print(beta)

[[ 8]
 [ 7]
 [ 0]
 [ 6]
 [-1]
 [ 3]
 [ 0]
 [-6]
 [-3]
 [-7]
 [ 0]]


### The Stochastic Gradient

The loss function is $l(w)=\frac{1}{n}||y-X\beta||_2^2=\frac{1}{n}\sum_{i=1}^n(y_i-\beta' x_i)^2$, hence the gradient is

$$
  g_k = \nabla l(w) = -\frac{1}{n}2X'(y-X\beta)
$$

This is an expensive gradient because it is using all the information in $(X,y)$

The Stochastic Gradient will use only one random sample ($i$) at each iteration, hence $\hat{g}_k = 2 x_i(y_i-x_i'\beta)$, so that starting at $\beta_0$, we compute $\beta_{k+1}$ as 

$$\beta_{k+1}=\beta_k-\alpha \hat{g}_k$$


In [4]:
(n, p) = X.shape
# one epoch = 1000 iterations
niter = 200*n # equivalent to 200 epoch 
k = 0
alpha = 0.001 # learning rate (hyper-parameter)
beta_stoch = np.zeros(p)

while (k < niter):
    
    i = random.choice(range(n)) # one sample by chance
    stoch_grad = -2*X[i,]*(Y[i]-np.dot(beta_stoch,X[i,])) # compute SGD
    beta_stoch = beta_stoch-alpha*stoch_grad # the movement
    # Most important lines in DL
    k +=1
    
print(beta_stoch)
print('iterations:',k)
print('error', np.linalg.norm(np.transpose(beta_ls_exact) - beta_stoch)/np.linalg.norm(beta_ls_exact)) 


[ 8.54237796  6.99622862 -0.01694278  5.46794307 -0.9557095   2.89029924
 -0.36252295 -6.23839327 -2.77163606 -6.86774362 -0.35740443]
iterations: 200000
error 0.05117128465167299


### The Mini-Batch SG
- Using a small sample

Loss = -2 * Xt * (y - X * beta )

In [5]:
import random

### Redefine gradient function, to calculate the gradient just with the indexes passed as parameters
def least_sq_reg_der_stoc(beta_ls,X,Y,subset,B):
    beta_ls=np.matrix(beta_ls)
    Xsub = np.matrix(X[subset,:])
    Ysub = np.matrix(Y[subset])    
    pp=-2*np.dot((Ysub-np.dot(Xsub,beta_ls.T)).T,Xsub).T
    aa= np.squeeze(np.asarray(pp))/B
    return aa

(n,p)=X.shape

B = 20 # Batch size (hyper-parameter)
niter = 500*n/B # equivalent to 500 epoch
alpha=0.001 # learning rate (hyper-parameter)
beta_lsg=np.zeros(p) #initial value for beta

k=0

while (k <= niter):
    # Same alg just changed 
    subset = np.random.choice([x for x in range(0,n)],B)
    grad = least_sq_reg_der_stoc(beta_lsg, X, Y,subset,B)
    beta_lsg = beta_lsg - alpha * grad

    k +=1

print(beta_lsg)
print('iterations =',k)
print('error=',np.linalg.norm(np.transpose(beta_ls_exact)-beta_lsg)/np.linalg.norm(beta_ls_exact))


[ 6.94684906e+00  7.06260470e+00  6.42263303e-03  6.12142080e+00
 -1.07563110e+00  3.02509750e+00  6.96097543e-03 -5.91618612e+00
 -2.93315331e+00 -6.97238233e+00 -1.34539725e-01]
iterations = 25001
error= 0.10588688686205568


### Momentum

In [6]:
(n,p)=X.shape

B = 10 # Batch size (hyper-parameter)
niter = 200*n/B # equivalent to 200 epoch
alpha=0.001 # learning rate (hyper-parameter)
beta_mom=np.zeros(p) #initial value for beta

v=np.zeros(p) # momentum vector

# Weight to give to the past parameter (60% weight(importance)) 
nu = 0.6 # momentum rate (hyper-parameter)

k=0

while (k <= niter):
    
    subset = np.random.choice([x for x in range(0,n)],B)
    grad = least_sq_reg_der_stoc(beta_mom, X, Y,subset,B)
    # Extra line for the !!momentum!!
    # Information from past iterations (Memory)
    # Improve speed of the alg
    v = grad + nu*v  # current gradient + previous direction == Next direction is closer to the next next direction

    beta_mom = beta_mom - alpha * v

    k +=1

print(beta_mom)
print('iterations =',k)
print('error=',np.linalg.norm(np.transpose(beta_ls_exact)-beta_mom)/np.linalg.norm(beta_ls_exact))

[ 8.32867826e+00  7.02310657e+00  5.73771144e-02  6.00864264e+00
 -1.06925269e+00  3.03903524e+00  4.78202967e-03 -5.79081422e+00
 -3.06299864e+00 -7.02836938e+00 -1.48720795e-01]
iterations = 20001
error= 0.025569876382360912


# Exercise

Apply the same algorithms to a logistic regression problem, or other machine learning tool that can be defined using a loss function