In this week's lab you are going to implement Gradient Descent Algorithm from the scratch. To understand how it works you will need some basic math and logical thinking. Gradient Descent can be used in different machine learning algorithms, including neural networks. For this lab, you are going to build it for a linear regression problem, because it’s easy to understand and visualize.

### Linear Regression

In order to fit the regression line, we tune two parameters: $slope (b_1)$ and $intercept (b_0).$ Once optimal parameters are found, we usually evaluate results with a $ mean squared error (MSE).$ We remember that smaller MSE — better. In other words, we are trying to minimize it. For a look back you can have a look at the previus week's lab [here](https://www.kaggle.com/redwankarimsony/simple-linear-regression-for-beginners)


### Gradient Descent
Minimization of the function is the exact task of the Gradient Descent algorithm. It takes parameters and tunes them till the local minimum is reached.

Let’s break down the process in steps and explain what is actually going on under the hood:
1. First, we take a function we would like to minimize, and very frequently it will be Mean Squared Errors function. 
2. We identify parameters, such as m and b in the regression function and we take partial derivatives of MSE with respect to these parameters. This is the most crucial and hardest part. Each derived function can tell which way we should tune parameters and by how much.
2. We update parameters by iterating through our derived functions and gradually minimizing MSE. In this process, we use an additional parameter **learning rate** which helps us define the step we take towards updating parameters with each iteration. By setting a smaller learning rate we make sure our model wouldn’t jump over a minimum point of MSE and converge nicely.


The formula of the Mean Squared Error MSE is as follows: 
$$ MSE = \frac{1}{n}\sum\limits_{i=1}^n(y_{i} - \hat{y})^2$$  where$$ \hat{y} = b_0 + b_1x_i$$

## Loading Libraries

In [8]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error

In [9]:
# data = pd.read_csv('../input/auto-insurance-in-sweden/swedish_insurance.csv')
data = pd.read_csv('D:\Microsoft\OneDrive\OneDrive - iut-dhaka.edu\Projects\LabWorksML\lab1\swedish_insurance.csv')
data.head()
data.sort_values(['X'], inplace = True)
data.reset_index(inplace = True)

In [10]:
data.head()

Unnamed: 0,index,X,Y
0,30,0,0.0
1,15,2,6.6
2,49,3,39.9
3,23,3,13.2
4,18,3,4.4


In [11]:
def gradient_descent(X, y, lr=0.0001, epoch=12):
    
    '''
    Gradient Descent for a single feature
    '''
    
    b1, b0 = 0.0, 0.0 # parameters
    log, mse = [], [] # lists to store learning process
   #### START YOUR CODE ####
    Xm = np.array(X)
    ym = np.array(y).reshape((len(Xm),1))
    Xm = np.reshape(Xm,((len(Xm),1)))
    Xm = np.concatenate((Xm,np.ones((len(Xm),1))),axis=1)
    
    b = np.array([b1,b0]).reshape((2,1))
    
    for i in range(epoch):

        db = ((-1*Xm)*(ym - np.dot(Xm,b)))

        db = np.sum(db,axis=0).reshape((2,1)) / len(X)

        b -= lr*db
        b1,b0 = b[0,0],b[1,0]
   #### YOUR CODE ENDS HERE ###
        log.append((b1, b0))
        mse.append(mean_squared_error(y, (b1*X + b0)))        
    
    return b1, b0, log, mse


b1, b0, log, mse = gradient_descent(data['X'], data['Y'] , epoch = 20)

In [21]:
b1, b0, log, mse = gradient_descent(data['X'], data['Y'] , epoch = 40)

In [22]:
(b0,b1,mse[-1])

(0.1633426848035021, 3.841750711943148, 1449.6078980268794)

In [23]:
mse

[11187.829822207941,
 7491.59212524454,
 5198.493787054008,
 3775.8792311913635,
 2893.2980028901707,
 2345.7442636979827,
 2006.0358802769233,
 1795.2712215196657,
 1664.5010582535226,
 1583.358121728882,
 1533.0030752103369,
 1501.7483147989751,
 1482.343049937738,
 1470.2890135600949,
 1462.7955573299344,
 1458.1314105108295,
 1455.2225222004556,
 1453.402569496842,
 1452.2581770106203,
 1451.5328936845233,
 1451.0676216463426,
 1450.7636592830972,
 1450.559773793198,
 1450.4179773879853,
 1450.3147030892371,
 1450.2353303441037,
 1450.170788809791,
 1450.115451399364,
 1450.0658271632246,
 1450.019750375029,
 1449.975877454385,
 1449.9333748662561,
 1449.8917254990681,
 1449.8506085449746,
 1449.8098249796237,
 1449.7692513310697,
 1449.7288109987942,
 1449.688456460516,
 1449.6481582335941,
 1449.6078980268794]

In [25]:
import plotly.graph_objects as go
fig = go.Figure()

(b1, b0) = log[-1]
y_hat = b0 + b1 * data['X']
fig.add_trace(go.Scatter(x=data['X'], y=data['Y'], name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=data['X'], y=y_hat, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))
fig.update_layout(title = f'Swedish Automobiles Data\n (visual comparison for correctness)',title_x=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

In [28]:
idx = 0
for i in range(rows):
    for j in range(cols ):
        (b1, b0) = log[idx]
        y_hat = b0 + b1*data['X']
        fig.add_trace(go.Scatter(x=data['X'], y=data['Y'], mode='markers', marker_color='rgba(152, 0, 0, .8)'), row=i+1, col=j+1)
        fig.add_trace(go.Scatter(x=data['X'], y=y_hat, mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'), row=i+1, col=j+1)
        idx +=1
fig.show()

In [27]:
rows = 3
cols = 4
from plotly.subplots import make_subplots
fig = make_subplots(rows=rows, cols=cols, subplot_titles=tuple(["Iter: {} MSE {:.2f}".format(idx+1, mse[idx] ) for idx in range(rows*cols)]))

# <u>Challenges and Outcomes:</u>

The task was to code the back propagation for linear regression. Given the epoch number and the learning rate as hyper-parameters for the function we needed to calculate a log for b$_{0}$ and b$_{1}$ along with the MSE for each epoch

To calculate the back prop we followed the following equations for the derivative part

$$
\frac{\delta}{\delta b_{1}} f(b_{0},b_{1}) = \frac{1}{n} \sum\limits_{i=1}^n(y_{i}
 (-2 x_{i}) (y_{i}-(b_{1}x_{i}+b_{0}) 
$$

$$
\frac{\delta}{\delta b_{0}} f(b_{0},b_{1}) = \frac{1}{n} \sum\limits_{i=1}^n(y_{i}
 (-2) (y_{i}-(b_{1}x_{i}+b_{0}) 
$$
Then finally we updated the variables as follows
$$
 b_1 := b_1 - \eta * \frac{\delta}{\delta b_{1}} f(b_{0},b_{1})
$$
$$
 b_0 := b_0 - \eta * \frac{\delta}{\delta b_{0}} f(b_{0},b_{1})
$$

In conclusion what we can say is after each epoch, $b_{0}$ and $b_{1}$ gradually  reached their final value through back propagation.



In [32]:
mse_arr = []
l_rates = np.array(range(1,19,1))*0.0001
epo = 40
print(l_rates)
for l_rate in l_rates:
    a,b,l,mse = gradient_descent(data['X'], data['Y'] , epoch = epo, lr = l_rate)
    mse_arr.append(mse[-1])

print(mse_arr)

import plotly.graph_objects as go
fig = go.Figure()

fig.add_trace(go.Scatter(x=l_rates, y=mse_arr, name='MSE', mode='lines+markers', marker_color='rgba(152, 0, 0, .8)'))

fig.update_layout(title = f'Swedish Automobiles Data\n (visual comparison for correctness)',title_x=0.5, xaxis_title= "Learning Rate", yaxis_title="MSE")
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.show()

[0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009 0.001
 0.0011 0.0012 0.0013 0.0014 0.0015 0.0016 0.0017 0.0018]
[1452.3898161191175, 1449.6078980268794, 1448.8052352177879, 1448.0058509708965, 1447.209652874392, 1446.4166283884363, 1445.6267650226505, 1444.8400503351884, 1444.056471932548, 1443.2760174693863, 1442.498674648334, 1441.724431219813, 1440.953274981851, 1440.1851937799013, 1439.4201755066588, 1438.6582081074564, 1437.899737233181, 1446.369482347725]
