# Regression models for prediction

In this notebook I will explore both the statistics and implementation of Regression models. I will focus on linear regression, but the same methods should apply to other forms of regression such as with decision trees, or neural networks.

Fundamentally, a regression model is a model that tries to predict a target random variable $Y$, given a set of $p$ predictor (feature) variables $\pmb{X} = \{X_i\}_{i=1}^p$. This can be extended to multivariate regression if mutiple variables variables have to be predicted. However I will not consider this case in this notebook.

A dataset $\mathcal{D} = \{\pmb{x}_i,y_i\}_{i=1}^{D}$ is obtained from the (true) data generating distribution $p_{X,Y}^{true}(x,y)$, and we will use this dataset to train and test a model on. We will therefore devide the dataset into a training, and test set: $\mathcal{D}_{train}$, $\mathcal{D}_{test}$. Those can be used for training and testing a model respectively.

In [3]:
import pandas as pd
dataset_df = pd.read_csv("Real estate.csv", sep=',', index_col=0)
dataset_df.head()

Unnamed: 0_level_0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2012.917,32.0,84.87882,10,24.98298,121.54024,37.9
2,2012.917,19.5,306.5947,9,24.98034,121.53951,42.2
3,2013.583,13.3,561.9845,5,24.98746,121.54391,47.3
4,2013.5,13.3,561.9845,5,24.98746,121.54391,54.8
5,2012.833,5.0,390.5684,5,24.97937,121.54245,43.1


In [4]:
dataset_df.dtypes

X1 transaction date                       float64
X2 house age                              float64
X3 distance to the nearest MRT station    float64
X4 number of convenience stores             int64
X5 latitude                               float64
X6 longitude                              float64
Y house price of unit area                float64
dtype: object

In [6]:
X_df = dataset_df.drop('Y house price of unit area', axis=1)
Y_df = dataset_df['Y house price of unit area']

In [9]:
X_df['X1 transaction date']

No
1      2012.917
2      2012.917
3      2013.583
4      2013.500
5      2012.833
         ...   
410    2013.000
411    2012.667
412    2013.250
413    2013.000
414    2013.500
Name: X1 transaction date, Length: 414, dtype: float64

## Linear regression
### PART 1: the MLE estimate

In linear regression, the generative distribution $p_{X,Y}(x,y)$ is not modeled directly; but the conditional distribution instead: $p_{Y|X}(y|\pmb{x})$. This is easier to model than the generative distribution and apparently achieves better predictions. It is a distribution over $Y$, since $X$ is considered given. In normal linear regression, the model is a normal distribution where the mean is predicted as a linear function of $X$ and the variance is constant (homoscedasticity). 
$$ p_{Y|X}(y|\pmb{x}, \pmb{\theta}) = \mathcal{N}(\pmb{w}^T\pmb{x}, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\bigg(\frac{-(y-\pmb{w}^T\pmb{x})^2}{2\sigma^2}\bigg) $$

Once such a model is fitted, it can make predictions $\hat{y} = \pmb{w}^T\pmb{x}$ for new data, which can be tested with the test set. 

One way to find the parameters $\pmb{\theta}=\pmb{w}$ that fit the dataset, is to find the parameters that maximize the likelihood of obtaining the training dataset $\mathcal{D}_{train}$. This is the maximum likelihood estimate and it is the argmax of the likelihood function $L(\mathcal{D}_{train}, \pmb{w}) = \prod_{i=1}^D \frac{1}{\sqrt{2\pi}\sigma} \exp\bigg(\frac{-(y_i-\pmb{w}^T\pmb{x_i})^2}{2\sigma^2}\bigg)$, which is equivalant to minimizing the negative log likelihood (NLL) because the likelihood is a monotone function. Therefore the MLE of the parameters $\pmb{w}$ is given by: 

$$\pmb{w}^{MLE} = \argmin \text{NLL} = \argmin \sum_{i=1}^D (y_i-\pmb{w}^T\pmb{x}_i)^2$$

The MLE thus minimizes the Residual Sum of Squares RSS=$\sum_{i=1}^D (y_i-\pmb{w}^T\pmb{x})^2$. This can be compared to the Total Sum of Squares TSS=$\sum_{i=1}^D (y_i-\bar{y})^2$, which is just the variance of the targets of the dataset, and can be seen as a prediction with the mean $\bar{y}=\frac{1}{D}\sum_{i=1}^Dy_i$ of the targets (therefore not using $X$ at all). The \textbf{coefficient of determination}, denoted by $R^2$ measures this difference: $$ R^2 = 1-\frac{RSS}{TSS}$$ 
If the prediction is perfect, i.e. RSS = 0, $R^2$ will be 1.

### Part 2: the MAP estimate

The MLE can easily result in overfitting, since we are exposing the model only to the training dataset. In a bayesian setting however, we can add some 'prior' information that the weights should be close to zero and only depart from zero when there is (strong) evidence to do so. Ridge regression assumes a gaussian prior for the weights: $p(\pmb{w}) = \mathcal{N}(0, \lambda^{-1}\mathbb{I})$, where $\lambda$ is the 'strength' of our belief. This is called $\mathcal{l}_2$ regularization. The MAP is the maximum of the posterior distribution which equals:
$$ \pmb{w}^{\text{MAP}} = \argmax_{\pmb{w}} p(\pmb{w}| x, y) = \argmax_{\pmb{w}} \frac{p(y| \pmb{w}, x)p(\pmb{w})}{p(y| x)} = \argmax_{\pmb{w}} p(y| \pmb{w}, x)p(\pmb{w})$$

To optimize this objective, we again minimize the negative log likelihood: 
$$ \pmb{w}^{\text{MAP}} = \argmin_{\pmb{w}} \sum_{i=1}^D (y_i-\pmb{w}^T\pmb{x}_i)^2 + \lambda |\pmb{w}|^2 $$