# Accuracy Metrics for Regression

When using metrics to determine error of a training set for regression most often the model overfits the data and produces a lower MSE for the training than the test set. Here we will be looking at methods of compensating for this systematic error.

## Mallow's $C_p$

Mallows $C_p$ adjusts the RSS of a model with $d$ predictors using the variance of the predictions:

$$C_p=\frac 1 n (\text{RSS}+2d\hat{\sigma}^2)$$

Let's implement it below:

In [1]:
import numpy as np
np.random.seed(42)

In [2]:
#generate the data
X = 2 * np.random.rand(100,1)
y = 4 + 3 * X + np.random.randn(100,1)

In [9]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression() #apply model and make predictions
lin_reg.fit(X, y)
y_pred = lin_reg.predict(X)

residuals = y_pred - y
RSS = np.sum(residuals**2)
mse = 1/len(y)*RSS #MSE formula
print("MSE: " + str(mse))

MSE: 0.8065845639670532


In [14]:
MSE_C_p = 1/len(y)*(RSS+2*X.shape[1]*np.var(residuals))
MSE_C_p

0.8227162552463942

## Akaike information criterion (AIC)

AIC works by measuring information loss. If there are $k$ parameters and the maximum value of the log likelihood function is $\hat{L}$ then the AIC is:

$$\text{AIC}=2k-2\hat{L}$$

A smaller AIC means more information per parameter which can be useful for measuring complexities of models. For least squares regression the log maximum likelihood is $-\frac{n}{2}\ln (\text{RSS}/n)+C$ and $k$ is the number of features plus 1 (2 in our case). For OLS is can even be shown the $C_p$ proportional to AIC.



In [31]:
2*2+2*len(y)*np.log(RSS/len(y))/2

-17.49465338856038

## Bayesian information criterion (BIC)

BIC is very similar to AIC but defines information loss as:

$$\text{AIC}=k\ln n-2\hat{L}$$

In [32]:
2*np.log(len(y))+2*np.log(RSS/len(y))

8.780447304204976

## Adjusted $R^2$

We know that the $R^2$ is defined as 1-RSS/TSS. However to adjust this to select models according to n and d we get:

$$\text{Adjusted }R^2=1-\frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}$$

Unlike the previous accuracy metrics we want to maximise the adjusted $R^2$ instead of minimising it.

In [34]:
TSS = (y-(y-y_pred).mean()).sum()
adjusted_R_squared = 1 - (RSS/(len(y)-1-1))/(TSS/(len(y)-1))
adjusted_R_squared

0.8805257289083981