# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Calculus for Linear Regression

## Learning Objectives

At the end of the mini-project, you will be able to :

* perform Linear Regression using different optimization algorithms - full batch gradient descent, RMSProp, Adam, Momentum

## Information

### Linear Regression

Linear regression assumes a linear or straight line relationship between the input variables (X) and the single output variable (y). More specifically, that output (y) can be calculated from a linear combination of the input variables (X). When there is a single input variable, the method is referred to as a simple linear regression; for more than one, the process is called multiple linear regression.

Here we limit to 2-dimensional space, thus a Cartesian plane. Let us develop gradually from the ground up starting with y=mx format and then y=mx+c regression.

##  Grading = 10 Points

Marks:

Exercise 1, 2, 5: Total marks = 0.5 x 3 = 1.5

Exercise 3, 4, 6, 7: Total marks = 1 x 4 = 4

Exercise 8, 9, 10: Total marks = 1.5 x 3 = 4.5

#### Import required packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

### Dataset

We will use Sweden insurance dataset to demonstrate simple linear regression. The dataset is called the “Auto Insurance in Sweden” dataset and involves predicting the **total payment for all the claims in thousands of Swedish Kronor (y)** given the total **number of claims (x)**.

**Exercise 1: Read the swedish_insurance.csv dataset and visualize total payment (y) vs number of claims (x).**

**Hint:** pd.read_csv()

In [None]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/swedish_insurance.csv

In [None]:
# Read dataset

# YOUR CODE HERE

In [None]:
# Visualize y vs x

# YOUR CODE HERE

### Simplified Scenario of $y = mx$

In this case, we are going to fit a line to data that passes through the origin. Let's develop the loss functions and see how it behaves.

$$y = mx \rightarrow h_{\theta}(x) = \theta x$$

$$J(\theta) = \frac{1}{2n} \sum_1^n (h_{\theta}(x_i) - y_i)^2$$

Note that we use the squared error divided by 2n where n is the number of data points. Therefore, we can consider this as a mean value. Precisely, this is the half of mean squared error. The intuition behind the division by 2 helps to have a more simplified derivate for the loss function.

Now if we plot the loss function for varying θ values we would get the plot shown below.

**Exercise 2: Create and plot the loss function for varying θ values.**

* create a function to compute mean squared error loss
$$J(\theta) = \frac{1}{2n} \sum_1^n (\theta x_i - y_i)^2$$
* compute loss for different theta values (50 points between 0 and 10)
* plot the computed loss corresponding to theta values

**Hint:** np.linspace()

In [None]:
# YOUR CODE HERE

Now we have an approximate value for the minimum of the loss function. Let us see how we can arrive at this particular minima computationally.

### The Gradient Descent

Gradient descent is an optimization algorithm that is used when training a machine learning model. It is based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum.

The learning rate represents the step size and is denoted as $\alpha$. Smaller $\alpha$ value denotes the smaller step and slower the algorithm. However, taking a larger step could make us miss the minima. 

So we can formulate the change of θ as follows:

$$\theta = \theta - \alpha \frac{∂}{∂\theta}J(\theta)$$

where, 
$$\frac{∂}{∂\theta}J(\theta) = \frac{∂}{∂\theta} \frac{1}{2n} \sum_1^n (h_{\theta}(x_i) - y_i)^2$$

$$\frac{∂}{∂\theta}J(\theta) = \frac{∂}{∂\theta} \frac{1}{2n} \sum_1^n (\theta x_i - y_i)^2$$

$$\frac{∂}{∂\theta}J(\theta) = \frac{1}{n} \sum_1^n (\theta x_i - y_i)x_i$$


**Exercise 3: Create a function to evaluate the derivative of the loss function with respect to θ then perform gradient descent.**

* create a gradient function
* perform gradient descent for n number of iterations
* visualize the updates of theta
* tune learning rate and number of iterations for faster convergence

Hint: [Calculus behind Linear Regression](https://towardsdatascience.com/calculus-behind-linear-regression-1396cfd0b4a9).

In [None]:
# Create a gradient function

# YOUR CODE HERE

Perform gradient descent iteratively until we meet minima or the number of iterations is reached.

In [None]:
# YOUR CODE HERE

Visualize the gradient traversal:

In [None]:
# Visualize the updates of theta

# YOUR CODE HERE

**Exercise 4: Visualize the final regression line along with the intermediate lines.**

* plot a regression line for every updated theta
* plot final regression line using the last theta value

In [None]:
# Visualize updates of Full Batch Gradient descent

# YOUR CODE HERE

### Complete Scenario $y = mx + c$

This is the extension of the previous case and we can model the estimated equation and the loss function as follows:

$$y = mx + c \rightarrow h_{\theta}(x) = \theta_1 + \theta_2 x$$

$$J(\theta_1, \theta_2) = \frac{1}{2n} \sum_1^n (h_{\theta}(x_i) - y_i)^2$$



Now that we have a loss function with 2 variables, the loss function will be a 3D plot with the third axis corresponding to the loss value.

**Exercise 5: Create and plot the loss function for varying θ1 and θ2 values.**

**Hint:** np.arange(), np.meshgrid(), plt.axes(projection='3d'), ax.contour3D()

In [None]:
# Create loss function

# YOUR CODE HERE

In [None]:
# Visualize loss function

# YOUR CODE HERE

The gradient descent can be similarly derived to reach the following set of equations:

$$\theta_1 = \theta_1 - \alpha \frac{∂}{∂\theta_1}J(\theta_1,\theta_2) $$

$$\theta_2 = \theta_2 - \alpha \frac{∂}{∂\theta_2}J(\theta_1,\theta_2) $$

$$\frac{∂}{∂\theta_1}J(\theta_1, \theta_2) = \frac{1}{n} \sum_1^n (\theta_1 + \theta_2 x_i - y_i)$$

$$\frac{∂}{∂\theta_2}J(\theta_1, \theta_2) = \frac{1}{n} \sum_1^n (\theta_1 + \theta_2 x_i - y_i)x_i$$

Note that the gradients must be updated simultaneously so that update of one θ value will not affect the other.

**Exercise 6: Create a function to evaluate the derivative of the loss function with respect to θ1 and θ2 then perform gradient descent.**

* create one gradient function for each theta1 and theta2 
* perform gradient descent for n number of iterations
* visualize the updates of theta1 and theta2
* tune learning rate and number of iterations for faster convergence

In [None]:
# Create gradient functions w.r.t theta1 and theta2

# YOUR CODE HERE

Run gradient descent iteratively until we meet minima or the number of iterations is reached.

In [None]:
# Perform gradient descent

# YOUR CODE HERE

Visualize the descent over the gradient of the loss function as follows:

In [None]:
# Visualize the updates of theta1, theta2

# YOUR CODE HERE

**Exercise 7: Visualize the final regression line along with the intermediate lines.**

* plot a regression line for every updated theta1 and theta2
* plot final regression line using the last theta1 and theta2 values

In [None]:
# Visualize updates of Gradient descent

# YOUR CODE HERE

In [None]:
# Parameter values

# YOUR CODE HERE

### RMSProp 

Using RMSProp optimization to find the minima of the cost function.

Weights update is given as: 

$$E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta)(\frac{∂c}{∂w})^2$$

$$w_t = w_{t-1} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}}(\frac{∂c}{∂w})$$

where $E[g^2]$ is the moving average of squared gradients, $∂c/∂w$ is gradient of the cost function with respect to the weight, $\eta$ is the learning rate and $\beta$ is moving average parameter (default value 0.9). The $\epsilon$ is a small scalar (e.g. $10^{-8}$) used to prevent division by 0.

**Exercise 8: Perform the optimization steps using RMSProp optimization.**

* perform rmsprop optimization for n number of iterations
* visualize the updates of theta1 and theta2
* tune learning rate and number of iterations for faster convergence
* plot a regression line for every updated theta1 and theta2
* plot final regression line using the last theta1 and theta2 values

In [None]:
# RMSProp optimization

# YOUR CODE HERE

In [None]:
# Visualize the updates of theta1, theta2

# YOUR CODE HERE

In [None]:
# Visualize updates of RMSProp

# YOUR CODE HERE

In [None]:
# Parameter values

# YOUR CODE HERE

### Adam

Using Adam optimization to find the minima of the cost function.

Weights update is given as: 

$$m_w^{(t+1)} \leftarrow \beta_1m_w^{(t)} + (1-\beta_1)∇_wL^{(t)},$$

$$v_w^{(t+1)} \leftarrow \beta_2v_w^{(t)} + (1-\beta_2)(∇_wL^{(t)})^2,$$

$$\hat{m}_w = \frac{m_w^{(t+1)}}{1-\beta_1^{t+1}},$$  

$$\hat{v}_w = \frac{v_w^{(t+1)}}{1-\beta_2^{t+1}},$$

$$w^{(t+1)} \leftarrow w^{(t)} - \eta\frac{\hat{m}_w}{\sqrt{\hat{v}_w}+\epsilon}$$

where $\epsilon$  is a small scalar (e.g. $10^{-8}$) used to prevent division by 0, and $\beta _{1}$ (e.g. 0.9) and $\beta _{2}$ (e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done elementwise.

**Exercise 9: Perform the optimization steps using Adam optimization.**

* perform adam optimization for n number of iterations
* visualize the updates of theta1 and theta2
* tune learning rate and number of iterations for faster convergence
* plot a regression line for every updated theta1 and theta2
* plot final regression line using the last theta1 and theta2 values

In [None]:
# Adam optimization

# YOUR CODE HERE

In [None]:
# Visualize the updates of theta1, theta2

# YOUR CODE HERE

In [None]:
# Visualize updates of Adam

# YOUR CODE HERE

In [None]:
# Parameter values

# YOUR CODE HERE

### Momentum optimization

Using Momentum optimization to find the minima of the cost function.

Momentum involves maintaining the change in the position and using it in the subsequent calculation of the change in position.

If we think of updates over time, then the update at the current iteration or time (t) will add the change used at the previous time (t-1) weighted by the momentum hyperparameter, as follows:

$$change\_w_t = step\_size * f'(w_{t-1})\ +\ momentum * change\_w_{t-1}$$

The update to the position is then performed as before.

$$w_t = w_{t-1}\ –\ change\_w_t$$

The change in the position accumulates the magnitude and direction of changes over the iterations of the search, proportional to the size of the momentum hyperparameter.

**Exercise 10: Perform the optimization steps using Momentum optimization.**

* perform momentum optimization for n number of iterations
* visualize the updates of theta1 and theta2
* tune learning rate and number of iterations for faster convergence
* plot a regression line for every updated theta1 and theta2
* plot final regression line using the last theta1 and theta2 values

In [None]:
# Momentum optimization

# YOUR CODE HERE

In [None]:
# Visualize the updates of theta1, theta2

# YOUR CODE HERE

In [None]:
# Visualize updates of Momentum optimizer

# YOUR CODE HERE

In [None]:
# Parameter values

# YOUR CODE HERE

Discussions:

* Compare the RMSProp and Adam optimizers.
* What will happen if we take the different learning rates for theta1 and theta2 for all optimizers?
* What is the significance of momentum factor in Momentum optimization?