In [11]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 120
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [12]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Gradient Descent Algorithm

We can think of a function of two or more variables. We have an objective function e.g. $$\large L(\beta_1, \beta_2)$$

We hope to share some ideas when $$L(\vec{\beta})$$ is convex in the vector $\vec{\beta}$

In particular we are interested in the situation when $L$ is the regularized sum of squarred errors, such as:

$$L: = \displaystyle\sum_{i=1}^{n} (y_i-\sum_{j=1}^{p} x_{ij} \cdot \beta_j)^2 +P_{\alpha}(\beta)$$

For example consider the Elastic Net:

$$P_{\alpha}(\beta) = \alpha \cdot (\lambda\cdot \sum |\beta_j| + (1-\lambda ) \cdot \sum (\beta_j^2))$$

We can compute the partial derivatives with repect to $\beta_j$:

$$\large \frac{\partial L}{\partial \beta_j}=-2\cdot\displaystyle\sum_{i=1}^{n} (y_i-\sum_{j=1}^{p} x_{ij} \cdot \beta_j)\cdot (x_{ij}) + \alpha\cdot\lambda\cdot\text{sign}(\beta_j) + \alpha\cdot (1-\lambda)\cdot 2\cdot\beta_j$$

In sklearn you provide the L1_ratio that is $$\frac{\lambda}{1-\lambda}$$
We want to know how to update $\vec{\beta}$ so that we make the best progress in minimizing $L.$ We can think updating $\vec{\beta}$ by going $t$ units in a direction $\vec{v}$. So we ask, what is $\vec{v}$ that makes the best progress?

We consider $g(t):= L(\vec{\beta} + t \cdot \vec{v})$ SO the derivative of $g$ is

$\large g'(t)= \nabla L \cdot\vec{v} $ so for the most dramatic decrease of the objective we need $\vec{v}=-\nabla L$

<font color='red' size=5pt>This works well if $L$ is convex in $\beta$ and the only issue would when there is a very shallow basin of the minimum for $L.$</font>

Main goal is to find ground truth coefficient vector in probabilty: $$ \large y=X\dcot \beta^* +\text{noise}$$

We can see the noise as $$\large \sigma\cdot \epsilon$$

so that $$ \large y=X\dcot \beta^* + \sigma\epsilon$$

and then $$\large y - X\beta^* = \sigma\epsilon$$

now think about taking a partial with respect to $\beta_j$ for $\|y-X\beta\|_2$

and we get $$\large -\frac{(y-X\beta)h_j(X)}{\|y-X\beta\|_2}$$ which suggets that rather we should minimize $\|y-X\beta\|_2$ plus a regularization term.

In [13]:
def gradient_mse(beta, x, y, alpha,l): # we defined a function that computes the gradient of the objective function
    n = len(y) # the number of observations
    y_hat = x.dot(beta).flatten()
    error = (y - y_hat)
    mse = (1.0 /n) * np.sum(np.square(error)) + alpha*(l*np.sum(np.abs(beta))+(1-l)*np.sum(beta**2)) # here we have the ridge penalty
    gradient = -(2.0 /n) * error.dot(x) + 2*(1-l)*alpha*beta+alpha*l*np.sign(beta) # the penalty is baked into the gradient as well
    return gradient, mse

In [14]:
data = pd.read_csv('Data/Concrete_Data.csv')
data.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [15]:
y = data[data.columns[-1]]
x = data[data.columns[0:7]]
 
scaler = StandardScaler()
xscaled = scaler.fit_transform(x)

In [16]:
xscaled

array([[ 2.47791487, -0.85688789, -0.84714393, ..., -0.62044832,
         0.86315424, -1.21767004],
       [ 2.47791487, -0.85688789, -0.84714393, ..., -0.62044832,
         1.05616419, -1.21767004],
       [ 0.49142531,  0.79552649, -0.84714393, ..., -1.03914281,
        -0.52651741, -2.24091709],
       ...,
       [-1.27008832,  0.75957923,  0.85063487, ..., -0.01752826,
        -1.03606368,  0.0801067 ],
       [-1.16860982,  1.30806485, -0.84714393, ...,  0.85335628,
         0.21464081,  0.19116644],
       [-0.19403325,  0.30849909,  0.3769452 , ...,  0.40116623,
        -1.39506219, -0.15074782]])

In [17]:
beta = np.random.uniform(-4000,4000,size=x.shape[1]) # a very imprecise guess to initialize the coefficients
lr = .05 # the learning rate
tolerance = 1e-8 # this would be our stopping criteria
alpha = 0.01
 
old_w = []
mse = []

beta

array([-2265.54256373,  3580.98997707,  -511.40669934,   379.83896177,
       -3453.80288249, -3076.50317533, -1586.56032983])

In [18]:
# Perform Gradient Descent
iterations = 1
for i in range(5000):
    gradient, mse_temp = gradient_mse(beta, xscaled, y, alpha,0.5)
    new_beta = beta - lr * gradient # here we update the coefficients in the direction of the negative gradient
 
    # Print error every 10 iterations
    if iterations % 10 == 0:
        print("Iteration: %d - Mean Squarred Error: %.4f" % (iterations, mse_temp))
        old_w.append(new_beta)
        mse.append(mse_temp)
 
    # Stopping Condition
    if np.sum(abs(new_beta - beta)) < tolerance:
        print('The Gradient Descent Algorithm has converged')
        break
 
    iterations += 1
    beta = new_beta
 
print('beta =', beta)

Iteration: 10 - Mean Squarred Error: 3472394.4412
Iteration: 20 - Mean Squarred Error: 996089.8478
Iteration: 30 - Mean Squarred Error: 634165.3602
Iteration: 40 - Mean Squarred Error: 446856.2151
Iteration: 50 - Mean Squarred Error: 320750.4728
Iteration: 60 - Mean Squarred Error: 233471.0378
Iteration: 70 - Mean Squarred Error: 172700.1198
Iteration: 80 - Mean Squarred Error: 130169.6130
Iteration: 90 - Mean Squarred Error: 100214.3080
Iteration: 100 - Mean Squarred Error: 78942.8880
Iteration: 110 - Mean Squarred Error: 63680.7709
Iteration: 120 - Mean Squarred Error: 52588.7121
Iteration: 130 - Mean Squarred Error: 44401.0951
Iteration: 140 - Mean Squarred Error: 38246.2799
Iteration: 150 - Mean Squarred Error: 33523.2676
Iteration: 160 - Mean Squarred Error: 29817.0230
Iteration: 170 - Mean Squarred Error: 26840.3356
Iteration: 180 - Mean Squarred Error: 24393.8976
Iteration: 190 - Mean Squarred Error: 22338.8890
Iteration: 200 - Mean Squarred Error: 20578.1487
Iteration: 210 - Me

In [33]:
#beta = np.array([ 3967.03094142,  2248.80620345, -2070.12405066,  1780.21380294, 971.48969182, -2181.90887589, -3261.74361629])
beta = np.zeros(xscaled.shape[1])

lr = 0.05 # the learning rate
tolerance = 1e-10 # this would be our stopping criteria
alpha = 0.01
 
old_w = []
mse = []

In [34]:
# here we implement an idea of a theoretical learning rate (Borwein 2009)
# Perform Gradient Descent with Browein's learning rate
# Here you should notice how much faster the algorithm converges
iterations = 1
for i in range(500):
    gradient, mse_temp = gradient_mse(beta, xscaled, y, alpha,0.5)
    new_beta = beta - lr * gradient # here we update the coefficients in the direction of the negative gradient
 
    # Print error every 10 iterations
    if iterations % 10 == 0:
        print("Iteration: %d - Mean Squarred Error: %.4f" % (iterations, mse_temp))
        old_w.append(new_beta)
        mse.append(mse_temp)
 
    # Stopping Condition
    if np.sum(abs(new_beta - beta)) < tolerance:
        print('The Gradient Descent Algorithm has converged in : '+str(i+1)+' iterations')
        break
    delta_b = new_beta - beta
    new_gradient, new_mse = gradient_mse(new_beta,xscaled,y,alpha,0.5)
    delta_grad = new_gradient - gradient
    lr = np.abs(np.dot(delta_b,delta_grad))/np.linalg.norm(delta_grad)**2
    iterations += 1
    beta = new_beta
 
print('beta =', beta)

Iteration: 10 - Mean Squarred Error: 1438.5920
Iteration: 20 - Mean Squarred Error: 1437.9555
Iteration: 30 - Mean Squarred Error: 1437.6949
Iteration: 40 - Mean Squarred Error: 1437.6940
Iteration: 50 - Mean Squarred Error: 1437.6940
Iteration: 60 - Mean Squarred Error: 1437.6940
Iteration: 70 - Mean Squarred Error: 1437.6940
Iteration: 80 - Mean Squarred Error: 1437.6940
Iteration: 90 - Mean Squarred Error: 1437.6940
The Gradient Descent Algorithm has converged in : 94 iterations
beta = [10.7150067   6.27858256  3.07142242 -2.5938316   2.14331086  0.23933734
 -0.47963964]


previously determined coefficients: beta = [10.71499348  6.27856966  3.07141116 -2.59384323  2.14330929  0.23932698
 -0.47965246]

### Attention: in sklearn they implement a different calculation of alpha and l1_ratio

<font color='red'>Below I figured it out for you so you can compare their coefficient with what we obtained in class via gradient descent</font>

In [22]:
from sklearn.linear_model import ElasticNet, Ridge, Lasso, LinearRegression
l = 0.5
model= ElasticNet(alpha=alpha*(1-l/2),l1_ratio=alpha*l/(2*alpha-alpha*l),fit_intercept=False)

In [23]:
model.fit(xscaled,y)

ElasticNet(alpha=0.0075, fit_intercept=False, l1_ratio=0.33333333333333337)

In [24]:
model.coef_

array([10.71449477,  6.27809228,  3.07100032, -2.59426765,  2.14325534,
        0.23895578, -0.48010823])

In [35]:
model = Lasso(alpha=1, fit_intercept=False)
model.fit(xscaled,y)
model.coef_

array([ 7.58251759,  3.17593773,  0.        , -1.69984606,  3.3160567 ,
       -0.        , -0.71392935])

### Good Reads
How to make your own compliant functions/algorithms 
(how to make stuff that can use .fit and .predict) using Soft-Thresholding with Lasso Regression

https://www.kaggle.com/residentmario/soft-thresholding-with-lasso-regression

https://eeweb.engineering.nyu.edu/iselesni/lecture_notes/SoftThresholding.pdf