<a href="https://colab.research.google.com/github/MJMortensonWarwick/WBS2003/blob/main/1_2_Linear_Regression_(ML).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Linear Regression: Machine learning approach
Following on from our [previous Notebook](https://colab.research.google.com/github/MJMortensonWarwick/WBS2003/blob/main/1_1_Linear_Regression_(statistics).ipynb), this tutorial will address the same regression problem, but this time taking a machine learning approach to things.

We will begin by importing the packages and data, and setting up our variables (all as before):

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  
import seaborn as sns 

# Only works on Jupyter/Anaconda
%matplotlib inline  

from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score

# read in the data
df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv", header=None)

# serparate the x and Y values
x_values = df.drop([13], axis = 1)
print(f'X values: \n {x_values.head()}\n')

y_value = df[13]
print(f'Y value: \n {y_value[0:5]}')

X values: 
         0     1     2   3      4      5     6       7   8      9     10  \
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296.0  15.3   
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242.0  17.8   
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242.0  17.8   
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222.0  18.7   
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222.0  18.7   

       11    12  
0  396.90  4.98  
1  396.90  9.14  
2  392.83  4.03  
3  394.63  2.94  
4  396.90  5.33  

Y value: 
 0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: 13, dtype: float64


###Train-Test Split
As per the slides, our next step will be to split the data such that part can be used for training the model, and part can be reserved for testing the model. We can do this with just a couple of lines of code:  

In [3]:
# split data into training and test
from sklearn.model_selection  import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x_values, y_value, test_size = 0.2, random_state=1234)

# print the shapes to check everything is OK
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(404, 13)
(102, 13)
(404,)
(102,)


In our code we have split using an 80:20 ratio (80% of the data for training; 20% for testing). We specify _random\_state_ as a fixed number so that this split will be the same each time (the splitting algorithm is based on random numbers). We can confirm this has worked by looking at the size of our different datasets:


*   _X\_train_ (the $x$ values we use for training) is 404 rows and 13 columns. 13x columns is what we would expect;
*   _X\_test_ (the $x$ values we use for testing) is 102 rows and 13 columns. Again we expect the 13x columns, but also we can compare the number of rows with _X\_train_ ... 102 rows is approximately 20% of the total;
*   _Y\_train_ (the $Y$ values we use for training) is 404 rows and a single column - again, what we would expect;
*   _Y\_test_ (the $Y$ values we use for testing) is 102 rows and a single columns. All seems to be correct!
<br>
<br>

###Regularisation
Now we can turn our attention to the model. As per the slides, our linear regression will remain the same, but we will use a modified algorithm to determine it. Specifically we will use $L1$ regression (LASSO) where the objective function (cost/loss function) is:

$OLS\ objective + \alpha \cdot  \Sigma |\beta_i|$

I.e. we add a penalty to the ordinary least squares (OLS) objective to limit the sum of the absolute values of the co-efficients ($\beta$ weights). We can see the amount of influence this penalty has is controlled by the hyperparameter $\alpha$. Hyperparameters are something we will return to later in the module but for the moment we will just use an arbitrary value of 0.25. 

With this in mind we can learn our model from the data:


In [6]:
l1_model = Lasso(alpha=0.25)

# fit the model to the training data
l1_model_fit = l1_model.fit(X_train, Y_train)

# predict the data
predict = l1_model_fit.predict(X_test)

# calculate R^2 by comparing the predicted values and real values
r2 = r2_score(Y_test, predict)
print(f'R^2 score is {round(r2, 2)}')

R^2 score is 0.76


We can see that $R^2$ is lower than our previous (statistical) model which achieved 96%. However, we need to recall that in this case we are testing the model against previously unseen data ... a much harder problem. Overall an $R^2$ at 76% is generally going to be considered quite high for this kind of task - so good news. That is not to say that this is the "optimal" model, and in practice we would experiment further.

You may also notice the conspicuous absence of any hypothesis testing ($p$-values). Typically we would not report or test these in a machine learning approach. We do not have the same underlying (implicit) beliefs of the model as generator of the data, and ultimately no hypotheses to test (in the traditional sense). In essence the model is the hypothesis (our belief that this model will be able to score ($R^2$ in this case) reasonably well on unseen/test data. In machine learning terminology, you will often here the possible models to be tested described as the 'hypothesis space'. Given this change in focus, $p$-values are simply irrelevant.

However, we may also want to inspect the $\beta$ values of the model:

In [7]:
# print the beta values of the model (co-efficients)
betas_l1 = l1_model_fit.coef_
counter = 0
for col in x_values.columns:
    if counter == 0:
        print("Beta weights/co-efficients - LASSO")
        print("-----------------------------------------")
    print(f"{col} : {round(betas_l1[counter], 4)}")
    counter +=1

Beta weights/co-efficients - LASSO
-----------------------------------------
0 : -0.0831
1 : 0.0657
2 : -0.0228
3 : 0.0
4 : -0.0
5 : 2.3925
6 : -0.0087
7 : -1.2545
8 : 0.305
9 : -0.0165
10 : -0.7983
11 : 0.011
12 : -0.6034


We may observe two things. Firstly we can see the $\beta$ weights for feature 3 and 4 have been set at zero - effectively deleting these features entirely (any number multipled by zero is zero). From the class metaphor, the algorithm has decided these two decision makers are not improving the decision quality and their influence increases the variance of the model. 

Secondly, we will notice that nearly all of the $\beta$ weights are smaller (closer to zero) than in the statistical appraoch of the last Notebook. For instance, feature 5 had a $\beta$ weight of 5.93 in the original model, compared with 2.39 here (less than half). Again, this fits with our metaphor - the algorithm has tried to ensure that no voices are too loud in  our group discussion. I.e. our model will be less likely to over-respond to excess noise in any particular feature as the associate $\beta$ weight is smaller.

So there we have it. Hopefully the comparison of the two approaches gives you some insight into how these two methodologies differ, and how the statistical approach has effectively been adapted to the supervised machine learning problem ... building models that can learn from data. 