# Lab 3. Linear regression & gradient descent

#### Table of contents

1. Overview
2. Diffusion background
3. Prepare the data
    - 3.1. Load the data
    - 3.2. Clean the data
    - 3.3. Standardize the data
4. Cost function
5. Gradient descent

## 1. Overview

The goal of this Lab is to extract the activation energy of water in its liquid state. This will be achieved by performing linear regression of water diffusion coefficients at various temperatures. You will learn how to clean and plot data, and perform linear regression with gradient descent.

## 2. Diffusion background

Diffusion is the net movement of anything (for example, atom, ions, molecules) from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in concentration. For example, if you spray perfume at one end of a room eventually the gas particles will be all over the room [wikipedia]. Even 
if there is no concentration gradient the process of molecular diffusion has ceased and is instead governed by the process of self-diffusion, originating from the random motion of the molecules. Molecular diffusion is a thermally activated process and therefore governed by an Arrhenius equation:

$D = Ae^{-\frac{E_A}{k_BT}}$

with $D$, $A$, $E_A$, $k_B$ and $T$ the diffusion coefficient (at temperature T), a prefactor, the activation energy, Boltzmann constant and the temperature, respectively. Diffusion coefficients represent how long it takes a particular substance to move through a particular medium and has for units distance$^2$/time. The activation energy represents the height of the energy barrier the substance has to overcome to succesfully perform a moving step and its units is an energy. A rearrangement of the Arrhenius equation taking natural logarithms gives the linear function:

$\ln{D} = \ln{A} - \frac{E_A}{k_BT}$

A plot of the natural logarithm of the diffusion coefficient $\ln{D}$ against $1/T$ will be a straight line if the substance diffusing obeys the Arrhenius equation and one can extract the activation energy $E_A$ as the slope. In this lab, based on diffusion coefficients of water molecules at different temperatures, we will extract the corresponding activation energy.

## 3. Prepare the data

Several publications provide values of self-diffusion coefficients of water at different temperatures (we will use those summarized [here](https://dtrx.de/od/diff/)). To help you getting started, these data were compiled and stored in the file `water_diffusion.csv` that you should get from the blackboard and upload to your workspace.

### 3.1. Load the data

__Q.1.__ Load the dataset `water_diffusion.csv` as a pandas DataFrame and store it into the variable `wd`. Write your answer between the `### BEGIN SOLUTION` and `### END SOLUTION` comment lines (1 mark).

In [None]:
import pandas as pd # We first load pandas
### BEGIN SOLUTION
### END SOLUTION

Let's have a look at the DataFrame `wd` more in detail. If you get errors in the following code lines, this means you did not load the DataFrame properly.

In [None]:
wd.head()

In [None]:
wd.tail()

In [None]:
wd.info()

Based on last week's Lab we see that there are 56 entries, 3 columns and no null objects. We note that the Temp (in C and K) are both strings, indeed, these entries are a mixture of numbers and letter characters therefore Python identifies them as `str`. Diffusion coeffecients are floats. 

### 3.2. Clean the data

Let's prepare the diffusion data. Here we just need to create a Series with all diffusion coefficients. Let's call this Series `diff`.

In [None]:
diff = wd['D (um2/ms)']

Let's create the List `y` containing the log of the diffusion coefficients.

In [None]:
import math
y = []
for di in diff:
    y.append(math.log(di))
print(y)
# Note that this previous 4-lines of code can be written in 1 as the nested loop: y = [np.log(di) for di in diff]

Here is the line-by-line explanation of the above code:

- we first import the math package to use its method `log`
- we initialize an empty List `y`
- we loop over the elements of diff and store each in the variable `di`
- we append at the end of the List `y` the natural log of the variable `di`
- we print the List `y`

You can verify that, for example, the log of the first value of the Series `diff`: 1.149 gives the results stored in the first element of the List `y`: 0.13889199886661865.

__Q.2.__ Create the Series `temp` containing the temperature in Kelvin of the measurments reported in the DataFrame `wd`. Do not clean the data for now, we will do it right after. Just put the raw temperatures with the 'K' as it is in the original DataFrame (1 mark).

In [None]:
### BEGIN SOLUTION
### END SOLUTION

Let's now transform the `temp` Series into a List, clean the data and store the inverse temperature in the list `x`. Again, if you get errors when you execute the following lines, you did not answer right question 2 (or even 1) and you should get back to it.

In [None]:
temp_list = list(temp)
x = [1/float(ti.split('K')[0]) for ti in temp_list]

Here is a quick explanation of the previous lines of code:

- We first transform the Series `temp` to the List `temp_list`
- Here is a nested loop. We loop over elements `ti` of the List `temp_list`
- We split each elements `ti` at the character 'K' which should return a List of two elements (because there is only one instance of 'K' in each element `ti`): the number temperature and an empty string (because there is nothing after the character 'K')
- We take the number (element in the list at index 0: `ti.split('K')[0]`) and transform it to a float
- We finally take the inverse of this float and store it into the List `x` 

We now have two lists, `x` and `y` containing the inverse temperature and the log of the corresponding diffusion coefficients, respectively. We can plot $y = f(x)$ to see how these quantities correlate.

In [None]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(x,y,marker='o',lw=0)
plt.xlabel('1/T (K)')
plt.ylabel('D (um2/ms)')

This looks quite linear.

### 3.3. Standardize the data

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. We will see that feature scaling is essential when dealing with multiple features however it is also useful to help gradient descent  converge faster.

Standardization is a very efficient technique to re-scales a feature value so that it has distribution with 0 mean value and variance equals to 1.

$x_{\text{std}}^{(i)} = \frac{x^{(i)}-\mu}{\sigma}$

with $\mu$ and $\sigma$ the mean and standard deviation of the values of $x^{(i)}$.
The mean and standard deviation of the value of a list can be accessed as:

In [None]:
import numpy as np
mu = np.mean(x)
std = np.std(x)
print(mu,std)

__Q.3.__ Complete the function below that performs standardization of a list of number taken as argument. The function should return the standardized list. You should verify that the mean of the standardized data is close to zero and that the standard deviation is close to 1.0 (2 marks).

In [None]:
def standard(x_list):
    mu = np.mean(x_list)
    std = np.std(x_list)
    x_standard = []
    ### BEGIN SOLUTION
    ### END SOLUTION
    return x_standard

Let's have a look at the plot based on the standardized data.

In [None]:
x_standard = standard(x)
y_standard = standard(y)
plt.plot(x_standard,y_standard,marker='o',lw=0)
plt.xlabel('1/T (K)')
plt.ylabel('D (um2/ms)')

Both data should now vary within similar range.

## 4. Cost function

We would like now to perform linear regression of the standardized data.
As discussed in the lecture, this corresponds to minimizing the cost function:

$J\left(\theta_0,\theta_1\right) = \frac{1}{2m}\sum_{i=1}^{m}\left(h(x^{(i)})-y^{(i)}\right)^2$

with respect to the coefficients $\theta_0$ and $\theta_1$, given the hypothesis $h$ defined as:

$h(x^{(i)}) = \theta_0+\theta_1x^{(i)}$

$x^{(i)}$ and $y^{(i)}$ represent the input data and the corresponding target output, respectively, and $m$ the number of input examples. Let's first define the cost function.

__Q.4.__ Complete the cost function below that takes as arguments the two coefficients $\theta_0$ and $\theta_1$ (`t0` and `t1`) and the two lists `x` and `y` of data (2 marks).

In [None]:
def J(t0,t1,x,y):
    m = len(x) # this is the number of examples in the data
    err = 0.0 # We initialize the variable err
    ### BEGIN SOLUTION
    ### END SOLUTION
    return err

## 5. Gradient descent

We now wish to find the optimal parameters $\theta_0$ and $\theta_1$ that minimize the cost function. We will implement gradient descent. The idea is to update the coefficients $\theta_0$ and $\theta_1$ by changing their values such that it will bring the cost function toward smaller and smaller values. The update of the coefficients $\theta_0$ and $\theta_1$ corresponds to the mathematical equations:

$\theta_0 := \theta_0-\alpha\frac{1}{m}\sum_{i=1}^m\left(h(x^{(i)})-y^{(i)}\right)$

$\theta_1 := \theta_1-\alpha\frac{1}{m}\sum_{i=1}^m\left(h(x^{(i)})-y^{(i)}\right)x^{(i)}$

with $\alpha$ the learning rate. It is important that these two equations are updated simultaneously. This means that in the second equation, the hypothesis was computed based on $\theta_0$ coefficient __before__ its update in the first equation! A simple way to satisfy this is to evaluate $h$ beforehand and then pass it to the function to update the coefficients $\theta_0$ and $\theta_1$.

__Q.5.__ Complete the function below to update $\theta_0$ (`t0`). The function takes the current value of `t0`, the learning rate $\alpha$, the hypothesis $h$ (the list $\theta_0+\theta_1x^{(i)}$ of length $m$) and the target values of the data $y$. Just define the variable `grad_t0` as the gradient of the cost function with respect to $\theta_0$. You can also look at the gradient descent code below to understand better the shape of the arguments (2 marks).

In [None]:
def update_t0(t0,alpha,h,y):
    m = len(y)
    grad_t0 = 0.0
    ### BEGIN SOLUTION
    ### END SOLUTION
    new_t0 = t0-alpha*grad_t0
    return new_t0

__Q.6.__ Complete the function below to update `t1`. Just define the variable `grad_t1` as the gradient of the cost function with respect to $\theta_1$ (2 marks).

In [None]:
def update_t1(t1,alpha,h,x,y):
    m = len(y)
    grad_t1 = 0.0
    ### BEGIN SOLUTION
    ### END SOLUTION
    new_t1 = t1-alpha*grad_t1
    return new_t1

If you have done everything right, the code below should perform gradient descent based on the 3 functions you implemented and it should converge in approximately 100 steps. Read carfully the commands and be sure you understand the whole code. Try to play with the learning rate.

In [None]:
# Here we initialize t0 and t1
t0, t1 = 0,0
# Define the learning rate alpha
alpha = 0.1
# Define m as the number of examples
m = len(x_standard)

# We will loop for 100 steps
for step in range(1,100):

    # This is the hypothesis computed with t0 and t1
    h = [t0+t1*x_standard[i] for i in range(m)]
    
    # Here we perform the update of the coefficients
    t0 = update_t0(t0,alpha,h,y_standard)
    t1 = update_t1(t1,alpha,h,x_standard,y_standard)
    
    # We now compte the error based on the cost function defined above and the updated coefficients t0 and t1
    err = J(t0,t1,x_standard,y_standard)
    
    # Here we print the step number, t0, t1 and the error value
    print(step,t0,t1,err)

# We print the final value of the coefficients
print("Final values of the coefficients t0 and t1:", t0, t1)

Based on the optimized coefficients, we can plot the hypothesis $h$ that should reproduce the data quite well.

In [None]:
# We plot the original data
plt.plot(x_standard,y_standard,marker='o',lw=0,label="data")
# Here define a list of 100 numbers between min and max of x_standard
x_fit = np.linspace(min(x_standard),max(x_standard),100)
# We compute the value of the hypothesis over x_fit based on the final coefficients t0 and t1
y_fit = [t0+t1*xi for xi in x_fit]
# Plot the straight line
plt.plot(x_fit,y_fit,lw=1,label="best fit")
plt.xlabel('1/T (K)')
plt.ylabel('D (um2/ms)')
plt.legend()

We have obtained the values of the standardized regression coefficients $\theta_0$ and $\theta_1$. The regression coefficients of the original data $\beta_0$ and $\beta_1$ can be deducted as:

$\beta_1 = \theta_1\frac{\sigma_y}{\sigma_x}$

$\beta_0 = \frac{1}{m}\sum_i (y^{(i)}-\beta_1x^{(i)})$

with $\sigma_x$ and $\sigma_y$ the standard deviation of the initial $x$ and $y$ data.

In [None]:
b1 = t1*np.std(y)/np.std(x)
b0 = np.mean([y[i]-b1*x[i] for i in range(m)])
print("The regression coefficients b0 and b1 are:", b0, b1)

Given the Boltzmann constant $k_B$=8.61733$\times$10$^{-5}$ eV/K, the activation energy (in eV) of liquid water can be computed as $E_A = -k_B\beta_1$.

In [None]:
kB = 8.61733*1e-5
EA = -b1*kB
print("The activation energy of liquid water is:", EA, "eV")

Here you should find an activation energy of approximately 0.2 eV. If you find a very different number you might want to review the steps that brought you here.