<a href="https://colab.research.google.com/github/MJMortensonWarwick/WBS2003/blob/main/1_1_Linear_Regression_(statistics).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression: Traditional statistics approach
As above, we begin the module by looking at the simple linear regression algorithm, but from two contrasting methodological approaches. This first tutorial will utilise a traditional statistical/scientific approach to linear regression, which we can then contrast (in tutorial two) with a machine learning approach.

We'll begin by importing the libraries/packages and the data:

In [1]:
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  
import seaborn as sns 

# Only works on Jupyter/Anaconda
%matplotlib inline  

import statsmodels.api as sm

df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv", header=None)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


As this tutorial focuses on the differences in model building, we will forgo the usual data understanding and cleaning/feature engineering steps and use the data as-is (we will, of course, return to these topics in much more detail later in the module). 

In our data set we have 13x $x$ values (independent variables/features) as columns 0-12, and one $Y$ value (dependent variable/target) as column 13. We'll seperate that out ready for modelling:

In [3]:
x_values = df.drop([13], axis = 1)
print(f'X values: \n {x_values.head()}\n')

y_value = df[13]
print(f'Y value: \n {y_value[0:5]}')

X values: 
         0     1     2   3      4      5     6       7   8      9     10  \
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296.0  15.3   
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242.0  17.8   
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242.0  17.8   
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222.0  18.7   
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222.0  18.7   

       11    12  
0  396.90  4.98  
1  396.90  9.14  
2  392.83  4.03  
3  394.63  2.94  
4  396.90  5.33  

Y value: 
 0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: 13, dtype: float64


As we can see, our $Y$ value is a float (real number) and a sensible candidate for linear regression (assuming our relationship is indeed linear). More specifically, we are going to be a build a model of the form:

$ Y = α + \beta_1x_1 + \beta_2x_2 + [...] + \beta_{13}x_{13} + \epsilon$

Where 
$Y$ is our dependent variable (target), $\alpha$ is the intercept, the various $\beta$ values (1 to 13) represent the coefficient of the corresponding $x$ values (1 to 13) and $\epsilon$ is the error.

We'll next build this model (using _statsmodel_ as we did in the previous module) by fitting the algorithm it to the data:

In [5]:
mod = sm.OLS(y_value, x_values)
res = mod.fit()
print(res.summary())

                                 OLS Regression Results                                
Dep. Variable:                     13   R-squared (uncentered):                   0.959
Model:                            OLS   Adj. R-squared (uncentered):              0.958
Method:                 Least Squares   F-statistic:                              891.3
Date:                Sun, 19 Mar 2023   Prob (F-statistic):                        0.00
Time:                        11:19:19   Log-Likelihood:                         -1523.8
No. Observations:                 506   AIC:                                      3074.
Df Residuals:                     493   BIC:                                      3128.
Df Model:                          13                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

Following our statistical appraoch there are several necessary steps in interpreting these results:


1.   _Examining the 'goodness of fit' metrics_: Depending on our prefered method, this could be examing $R^2$, adjusted $R^2$, AIC, BIC or some combination of these. In this case we will limit ourselves to just the first ($R^2$). At 96% this is clearly high, and strong support for our implicit theory that this model would represent an appropriate data generation process for our observations;
2.   _Evaluate the hypotheses tests_: Our overall $F$-statistic is ~0 which is obviously passing our hypothesis test at 95% confidence. Most of our independent variables report $p$-values below 0.05 with the exception of 3, 5 and 7. We would probably want to experimentally remove these to arrive at a model where all $x$ values are significant;
2.   _Evaluate the other metrics and information_: In a 'proper' analysis we would also want to consider measures such as the skewdness and kurtosis of the data, and we also get a helpful warning that there may be some multicolinearity in the data. Inspection of the correlation matrix reveals this is the case, so in the real-world we would have further work required to finsalise this model.

However, for the purposes of this tutorial we have reached our goal! We have fitted a linear regression model (in the statistical style) and performed an initial interpretation of the results. Our next tutorial will do this all again, but this time with a machine learning mindset.

