# Regression models for prediction

In this notebook I will explore both the statistics and implementation of Regression models. I will focus on linear regression, but the same methods should apply to other forms of regression such as with decision trees, or neural networks.

Fundamentally, a regression model is a model that tries to predict a target random variable $Y$, given a set of $p$ predictor (feature) variables $\pmb{X} = \{X_i\}_{i=1}^p$. This can be extended to multivariate regression if mutiple variables variables have to be predicted. However I will not consider this case in this notebook.

A dataset $\mathcal{D} = \{\pmb{x}_i,y_i\}_{i=1}^{D}$ is obtained from the (true) data generating distribution $p_{X,Y}^{true}(x,y)$, and we will use this dataset to train and test a model on. We will therefore devide the dataset into a training, and test set: $\mathcal{D}_{train}$, $\mathcal{D}_{test}$. Those can be used for training and testing a model respectively.

In [6]:
import pandas as pd
dataset_df = pd.read_csv('https://raw.githubusercontent.com/probml/probml-data/main/data/prostate/prostate.csv', sep='\t', index_col=0)
dataset_df.head()

Unnamed: 0,lcavol,lweight,age,lbph,svi,lcp,gleason,pgg45,lpsa,train
1,-0.579818,2.769459,50,-1.386294,0,-1.386294,6,0,-0.430783,T
2,-0.994252,3.319626,58,-1.386294,0,-1.386294,6,0,-0.162519,T
3,-0.510826,2.691243,74,-1.386294,0,-1.386294,7,20,-0.162519,T
4,-1.203973,3.282789,58,-1.386294,0,-1.386294,6,0,-0.162519,T
5,0.751416,3.432373,62,-1.386294,0,-1.386294,6,0,0.371564,T


In [13]:
# split in training and test set
dataset_train = dataset_df.loc[dataset_df['train']=='T'].drop(['train'], axis=1)
dataset_test = dataset_df.loc[dataset_df['train']=='F'].drop(['train'], axis=1)
print(dataset_train.head())
print("Train, test dataset size: ", len(dataset_train), len(dataset_test))

     lcavol   lweight  age      lbph  svi       lcp  gleason  pgg45      lpsa
1 -0.579818  2.769459   50 -1.386294    0 -1.386294        6      0 -0.430783
2 -0.994252  3.319626   58 -1.386294    0 -1.386294        6      0 -0.162519
3 -0.510826  2.691243   74 -1.386294    0 -1.386294        7     20 -0.162519
4 -1.203973  3.282789   58 -1.386294    0 -1.386294        6      0 -0.162519
5  0.751416  3.432373   62 -1.386294    0 -1.386294        6      0  0.371564
Train, test dataset size:  67 30


In [18]:
dataset_df.gleason.describe()

count    97.000000
mean      6.752577
std       0.722134
min       6.000000
25%       6.000000
50%       7.000000
75%       7.000000
max       9.000000
Name: gleason, dtype: float64

## Linear regression
### PART 1: the MLE estimate

In linear regression, the generative distribution $p_{X,Y}(x,y)$ is not modeled directly; but the conditional distribution instead: $p_{Y|X}(y|\pmb{x})$. This is easier to model than the generative distribution and apparently achieves better predictions. It is a distribution over $Y$, since $X$ is considered given. In normal linear regression, the model is a normal distribution where the mean is predicted as a linear function of $X$ and the variance is constant (homoscedasticity). 
$$ p_{Y|X}(y|\pmb{x}, \pmb{\theta}) = \mathcal{N}(\pmb{w}^T\pmb{x}, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\bigg(\frac{-(y-\pmb{w}^T\pmb{x})^2}{2\sigma^2}\bigg) $$

Once such a model is fitted, it can make predictions $\hat{y} = \pmb{w}^T\pmb{x}$ for new data, which can be tested with the test set. 

One way to find the parameters $\pmb{\theta}=\pmb{w}$ that fit the dataset, is to find the parameters that maximize the likelihood of obtaining the training dataset $\mathcal{D}_{train}$. This is the maximum likelihood estimate and it is the argmax of the likelihood function $L(\mathcal{D}_{train}, \pmb{w}) = \prod_{i=1}^D \frac{1}{\sqrt{2\pi}\sigma} \exp\bigg(\frac{-(y_i-\pmb{w}^T\pmb{x_i})^2}{2\sigma^2}\bigg)$, which is equivalant to minimizing the negative log likelihood (NLL) because the likelihood is a monotone function. Therefore the MLE of the parameters $\pmb{w}$ is given by: 

$$\pmb{w}^{MLE} = \argmin \text{NLL} = \argmin \sum_{i=1}^D (y_i-\pmb{w}^T\pmb{x})^2$$

The MLE thus minimizes the Residual Sum of Squares RSS=$\sum_{i=1}^D (y_i-\pmb{w}^T\pmb{x})^2$. This can be compared to the Total Sum of Squares TSS=$\sum_{i=1}^D (y_i-\bar{y})^2$, which is just the variance of the targets of the dataset, and can be seen as a prediction with the mean $\bar{y}=\frac{1}{D}\sum_{i=1}^Dy_i$ of the targets (therefore not using $X$ at all). The \textbf{coefficient of determination}, denoted by $R^2$ measures this difference: $$ R^2 = 1-\frac{RSS}{TSS}$$ 
If the prediction is perfect, i.e. RSS = 0, $R^2$ will be 1.

### Part 2: the MAP estimate

The MLE can easily result in overfitting, since we are exposing the model only to the training dataset. In a bayesian setting however, we can add some 'prior' information that the weights should be close to zero and only depart from zero when there is (strong) evidence to do so. Ridge regression assumes a gaussian prior for the weights: $p(\pmb{w}) = \mathcal{N}(0, \lambda^{-1}\mathbb{I})$, where $\lambda$ is the 'strength' of our belief. This is called $\mathcal{l}_2$ regularization. The MAP is the maximum of the posterior distribution which equals:
$$ \pmb{w}^{\text{MAP}} = \argmax_{\pmb{w}} p(\pmb{w}| x, y) = \argmax_{\pmb{w}} \frac{p(y| \pmb{w}, x)p(\pmb{w})}{p(y| x)} = \argmax_{\pmb{w}} p(y| \pmb{w}, x)p(\pmb{w})$$