# Lab 4. Multivariate linear regression

#### Table of contents

1. Overview
2. About pm$_{2.5}$
3. Prepare the data
4. Univariate linear regression
5. Multivariate linear regression

## 1. Overview

In this Lab session we will use multivariate linear regression to predict the PM$_{2.5}$ concentration in Hong Kong based on atmospheric data. Following Lab3, we will extend the codes and functions to compute the cost function and gradient descent to deal with multiple variables.

## 2. About pm2.5

PM$_{2.5}$ are fine particules with aerodynamic diameters equal to or smaller than 2.5 microns, which is recognized as a major component for air pollution, and has been shown to lead to multiple adverse health outcomes. Usually, the concentration of PM$_{2.5}$ in the air is measured by ground stations and the coverage is extended via spacial interpolations. However, the results may contain uncertainties due to the limited number of monitoring stations and sampling points for the interpolation. To compensate this information gap, satellite, meteorological and additional air quality index data have been used to monitor air quality. In the following lab, we will investigate the relationship between the concentration of PM$_{2.5}$ and air quality index indicators such as the concentration of NO$_2$, O$_3$, etc.

## 3. Prepare the data

We will use data available from Hong Kong's environmental protection department. Original data can be downloaded from their [website](https://cd.epic.epd.gov.hk/EPICDI/air/station/?lang=en) however, we have already compiled and partially cleaned data between 1 January 2019 to 31 December 2019, recoreded by the central/western station. You can download the csv from the blackboard.

__Q.1.__ Load the data, drop the column corresponding to CO chemical (labelled `CO`) which is empty, and drop all rows that have no data. The final dataframe will be stored in the variable `pm25` (2 marks).

In [None]:
import pandas as pd
### BEGIN SOLUTION
### END SOLUTION

In [None]:
pm25.info()

In [None]:
pm25.head()

Here we have a time series with date and hours, and various chemical concentration. FSP stands for fine suspended particle and corresponds to the concentration of PM$_{2.5}$. All polluants are given in $\mu$g/m$^3$. Our goal is to predict FSP based on the concentration of various chemicals NO$_2$, NO$_x$, O$_3$ and SO$_2$. Respirable suspended particulates (RSP) are another type of suspended particles (larger in size) and their concentration strongly correlates with FSP (see below) therefore, we will ignore them for the prediction, focusing only on FSP.

Let's have a look at the scatter matrix first to have an idea of the data.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from pandas.plotting import scatter_matrix
scatter_matrix(pm25[['FSP','RSP','NO2','NOX','O3','SO2']],figsize=(12,12))

In [None]:
pm25.corr()

## 4. Univariate linear regression

This first attempt in predicting the concentration of PM$_{2.5}$ is based on what we learnt during our previous Lab session. Let's redefine below the model we developed. All key ingredients to perform feature scaling and linear regression have been compiled in functions. You must review the functions carefully and be sure you understand all the details.

In [None]:
import numpy as np

def standard(x):
    # Standardize a list x
    mu = np.mean(x)
    std = np.std(x)
    x_std = [(xi-mu)/std for xi in x]
    return x_std, mu, std

def recover_beta(t0,t1,x,y,stdx,stdy):
    # Recover beta linear regression coefficients from scaled ones
    m = len(x)
    b1 = t1*stdy/stdx
    b0 = np.mean([y[i]-b1*x[i] for i in range(m)])
    return b0,b1

def hypothesis_uni(b0,b1,x):
    # The univariate hypothesis
    h_uni = [b0+b1*xi for xi in x]
    return h_uni

def cost_function(h,y):
    # Computes square average error
    m = len(y)
    err = (1.0/(2.0*m))*sum([(h[i]-y[i])**2 for i in range(m)])
    return err

def update_t0(t0,alpha,h,y):
    # Update t0 coefficient during GD
    m = len(y)
    grad_t0 = (1.0/m)*sum([h[i]-y[i] for i in range(m)])
    new_t0 = t0-alpha*grad_t0
    return new_t0

def update_t1(t1,alpha,h,x,y):
    # Update t1 coefficient during GD
    m = len(y)
    grad_t1 = (1.0/m)*sum([x[i]*(h[i]-y[i]) for i in range(m)])
    new_t1 = t1-alpha*grad_t1
    return new_t1

def gd(x,y,Niter):
    # Gradient descent
    t0, t1 = 0,0
    alpha = 0.1
    for step in range(1,Niter):
        h = hypothesis_uni(t0,t1,x)
        t0 = update_t0(t0,alpha,h,y)
        t1 = update_t1(t1,alpha,h,x,y)
        h = hypothesis_uni(t0,t1,x)
        err = cost_function(h,y)
        print(step,t0,t1,err)
    print("Final values of the coefficients t0 and t1:", t0, t1)
    return t0,t1
    
def plot_data(x,y,t0,t1):
    # Plot y = x data and a the line t0+t1*x
    plt.plot(x,y,marker='.',lw=0,label="data")
    x_fit = np.linspace(min(x),max(x),100)
    y_fit = [t0+t1*xi for xi in x_fit]
    plt.plot(x_fit,y_fit,lw=1,label="linear fit",color='r')
    plt.xlabel('NO2')
    plt.ylabel('FSP')
    plt.legend()
    plt.show()

Now let's select the concentration of NO$_2$ as the input feature and try to linearly fit the concentration of PM$_{2.5}$ with gradient descent.

In [None]:
x = list(pm25['NO2'])
y = list(pm25['FSP'])

x_std, mux, stdx = standard(x)
y_std, muy, stdy = standard(y)
t0,t1 = gd(x_std,y_std,100)
b0,b1 = recover_beta(t0,t1,x,y,stdx,stdy)
print("Coefficients b0 and b1 corresponding to the best linear fit:",b0,b1)
plot_data(x,y,b0,b1)

As a reminder, the Pearson correlation between NO$_2$ and FPS was approximately 0.5 hence a poor linear fit.
To evaluate how well observed outcomes are replicated by the model, we can use various quantitative and qualitative analysis:

- the final cost
- r2 score
- plots of model vs prediction

This is illustrated in the following.
The final cost i.e. the mean square average or its square root i.e. the root mean square average (RMS) provide information about the error between the model and data however, it is relative to the actual values in the dataset.
Therefore, in linear regression, we often compute the coefficient of determination (R2) defined as:

$R^2 = 1-\frac{SS_{res}}{SS_{tot}}$

with $SS_{res}$, the sum of squares of residuals, also called the residual sum of squares (proportional to the cost function):

$SS_{res} = \sum_i\left(y^{(i)}-h^{(i)}\right)^2$

and, $SS_{tot}$ the total sum of squares (proportional to the variance of the data):

$SS_{tot} = \sum_i\left(y^{(i)}-\mu_y\right)^2$

This leads to a residual $R2$ being a value between 0 and 1. The closest value to 1 indicates a better fit.

__Q.2.__ Complete the function r2 below that computes the coefficient of determination based on values of the output data `y` and the hypothesis `h`, both lists of length `m` (the number of examples in the dataset) (2 marks).

In [None]:
def r2(y,h):
    m = len(y)
    mu = np.mean(y)
    ### BEGIN SOLUTION
    ### END SOLUTION
    return 1-ss_res/ss_tot

h_uni = hypothesis_uni(b0,b1,x)
print("R2 score :{:.4f}".format(r2(y,h_uni)))

This is not a great value for R2 and we will try to improve our model predicton later by introducing additional features.

Another qualitative way to appreciate the model accuracy is to plot the actual data and the predicted data. Here we have data measured over time so we can represent the concentration of PM$_{2.5}$ and the value predicted by the linear model as a function of time. Each row in the dataframe represent one hour, we will then use the elapsed time since the first data point (1 Jan 2019, midnight) as the x-axis. Moreover, it can be interesting to zoom in a time period so we define an initial and final time in hour `Ni` and `Nf`, respectively. According to the information we know on the dataframe we have approximately 8300 rows hence, a little less than a full year in hour (365x24) because of missing value we deleted. 

In [None]:
Ni = 0
Nf = 8300
plt.figure(figsize=(20,10))
plt.plot(range(Ni,Nf),y[Ni:Nf],marker='.',ms=0,color='r',lw=1.0,label='data')
plt.plot(range(Ni,Nf),h_uni[Ni:Nf],marker='o',ms=0,color='b',lw=1.0,label='univariate model prediction')
plt.legend()
plt.ylabel("PM$_{25}$ ($\mu$g/m$^3$)",fontsize=22)
plt.xlabel("time since Jan 1st (h)",fontsize=22)
plt.show()

We can also plot the output data (here PM$_{2.5}$) as a function of the predicted output based on the linear model. Moreover, it is common to add the line `y = x` corresponding to a perfect model.

In [None]:
plt.plot(y,h_uni,marker='.',color='r',lw=0.0,ms=2.0)
plt.plot(range(150),range(150),color='k',lw=0.5,label='y = x')
plt.xlabel('PM predicted',fontsize=22)
plt.ylabel('PM data',fontsize=22)
plt.legend()
plt.show()

Since most of the red dots appear below the `y=x` line, this means we often overestimate the actual value of PM$_{2.5}$ concentration. This can also be appreciated in the time series plot.
Overall, the univariate linear model somehow reflects the variation of the PM$_{2.5}$ concentration but with limited accuracy.

Note that here we actually train and test the model on the same dataset. The proper way of evaluating the performance of a model is to have separated training and testing datasets. The model is trained on the training set and the model performance is evaluated on the test set. This is a very important point that we will explore further in the class.

## 5. Multivariate linear regression

Let's now modify the functions to perform multivariate linear regression.
The goal is to modify the previous functions to take as input not just a feature vector but a feature matrix (often called design matrix). To illustrate this, we will only consider 4 features however, the model must be general an valid for `n` features. The mulitvariate linear regression hypothesis can be written as follow:

$h(x_0,x_1,x_2,x_3,x_4) = \theta_0x_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+\theta_4x_4$

Or in matrix notation:

\begin{align}
\begin{bmatrix}
h(X^{(1)})\\
h(X^{(2)})\\
\vdots\\
h(X^{(m)})\\
\end{bmatrix}
=
\begin{bmatrix}
\theta_0\\
\theta_1\\
\theta_2\\
\theta_3\\
\theta_4
\end{bmatrix}
\begin{bmatrix}
1 & x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & x_4^{(1)}\\
1 & x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & x_4^{(2)}\\
\vdots & \vdots & \vdots & \vdots & \vdots \\
1 & x_1^{(m)} & x_2^{(m)} & x_3^{(m)} & x_4^{(m)}
\end{bmatrix}= \Theta X
\end{align}

First we prepare the design matrix `X` as a 2D numpy array.  As dicussed in the class, we need to add a row of ones to the design matrix X to account for the intercept in the hypothesis.

In [None]:
X = pm25[['NO2','NOX','O3','SO2']].to_numpy()
X = np.c_[np.ones(X.shape[0]), X]
y = list(pm25['FSP'])

print("Shape as row, columns",X.shape)
print(X)

For multiple features, it is fundamental to scale data before we use gradient descent. We will use standardization.

__Q.3.__ Complete the function `standard_multi` below that apply standardization to each column (but the first) of a given feature matrix `X`. The function must return the standardized design matrix and a list of mean and standard deviation for each column of the matrix. We will assume that the mean and standard deviation of the first row are equal to 1 and 1, respectively (2 marks).

In [None]:
def standard_multi(X):
    m,n = X.shape
    X_std = np.ones(X.shape) # The standardize feature matrix initialized with ones
    # The following 2 lists will contain the mean and standard deviation of each column
    # We initialize the lists with the mean and standardization of the first column as ones
    mu,std = [1],[1] 
    ### BEGIN SOLUTION
    ### END SOLUTION
    return X_std, mu, std

__Q.4.__ Define the hypothesis function `hypothesis_multi` that returns a list of the hypothesis evaluated for each row of the feature matrix. The list of hypothesis must therefore be of length `m`; the number of examples in the dataset or rows of the feature matrix (2 marks).

In [None]:
def hypothesis_multi(ts,X):
    m,n = X.shape
    h = []
    ### BEGIN SOLUTION
    ### END SOLUTION
    return h

__Q.5.__ Complete the function `update_ts` that updates the coefficients theta during gradient descent. This function takes in a list of the theta values, the learning rate (alpha), the hypothesis list evaluated previously based on previous theta values, the feature matrix `X`, and the list of output values `y`. The function should return a list of updated values of theta. You should look at the `gd` function below to understand better the role of the `update_ts` function (2 marks).

In [None]:
def update_ts(ts,alpha,h,X,y):
    m,n = X.shape
    grads = []
    
    ### BEGIN SOLUTION
    ### END SOLUTION
    
    for i in range(n):
        ts[i] = ts[i]-alpha*grads[i]
    return ts

Finally, we provide the gradient descent function. You should minimize the cost and obtain the parameters $\Theta$.

In [None]:
def gd(X,y,Niter):
    ts = [0]*X.shape[1]
    alpha = 0.1
    for step in range(1,Niter):
        h = hypothesis_multi(ts,X)
        ts = update_ts(ts,alpha,h,X,y)
        h = hypothesis_multi(ts,X)
        err = cost_function(h,y)
        print(step,ts,err)
    print("Final values of the coefficients ts and MSE:", ts,err*2.0)
    return ts

X_std,mux,stdx = standard_multi(X)
y_std,muy,stdy = standard(y)
ts = gd(X_std,y_std,100)

You can now retrive the unscaled coeffients $\beta$ and plot the time series together with the univariate linear regression.

In [None]:
# retrive the actual coefficients b0, b1
m,n = X.shape
bs = [ts[i]*stdy/stdx[i] for i in range(n)]
bs[0] = np.mean([y[j]-sum([bs[i]*X[j][i] for i in range(1,n)]) for j in range(m)])
h_multi = hypothesis_multi(bs,X)

print("R2 score uni: {:.4f} and final cost: {:.4f}".format(r2(y,h_uni),cost_function(h_uni,y)))
print("R2 score multi: {:.4f} and final cost: {:.4f}".format(r2(y,h_multi),cost_function(h_multi,y)))

In [None]:
Ni = 0
Nf = 8300
plt.figure(figsize=(20,10))
plt.plot(range(Ni,Nf),y[Ni:Nf],marker='.',ms=0,color='r',lw=1.0,label='data')
plt.plot(range(Ni,Nf),h_uni[Ni:Nf],marker='o',ms=0,color='b',lw=1.0,label='uni model')
plt.plot(range(Ni,Nf),h_multi[Ni:Nf],marker='o',ms=0,color='g',lw=1.0,label='multi model')
plt.legend()
plt.ylabel("PM$_{25}$ ($\mu$g/m$^3$)",fontsize=22)
plt.xlabel("time since Jan 1st (h)",fontsize=22)
plt.show()

In [None]:
plt.plot(y,h_uni,marker='.',color='r',lw=0.0,ms=2.0,label='uni')
plt.plot(y,h_multi,marker='.',color='g',lw=0.0,ms=2.0,label='multi')
plt.plot(range(150),range(150),color='k',lw=0.5,label='y = x')
plt.xlabel('PM predicted',fontsize=22)
plt.ylabel('PM data',fontsize=22)
plt.legend()
plt.show()

From both plots, we can appreciate the improvement in the predicted PM$_{2.5}$ based on the multivariate linear regression compared to univariate case.