# Regression Error Estimation Performance Metrics

When using traditional performance metrics (let's assume RMSE) to determine error of a training set for regression most often the model overfits the data and produces a lower RMSE for the training than the test set. To compensate for the systematic error of overfitting, changes to the performance metric can be made.

## Mallow's $C_p$

Mallows $C_p$ adjusts the RSS (recall RMSE = RSS $/n$) of a model with $d$ predictors using the variance $\hat \sigma$ of the predictions:

$$C_p=\frac 1 n (\text{RSS}+2d\hat{\sigma}^2)$$

Let's implement it below:

In [2]:
import numpy as np
np.random.seed(42)

In [3]:
#generate the data
X = 2 * np.random.rand(100,1)
y = 4 + 3 * X + np.random.randn(100,1)

In [4]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_pred = lin_reg.predict(X)

residuals = y_pred - y
RSS = np.sum(residuals**2)
RMSE = 1/len(y)*RSS #MSE formula
print("RMSE: " + str(RMSE))

RMSE: 0.8065845639670532


In [5]:
MSE_C_p = 1/len(y)*(RSS+2*X.shape[1]*np.var(residuals))
MSE_C_p

0.8227162552463942

## Akaike information criterion (AIC)

AIC works by measuring information loss. If there are $k$ parameters and the maximum value of the log likelihood function is $\hat{L}$ then the AIC is:

$$\text{AIC}=2k-2\hat{L}$$

A smaller AIC means more information per parameter which can be useful for measuring complexities of models. For ordinary least squares regression it can be shown the AIC is proportional to $C_p$ and has the value:

$$\text{AIC}=\frac{1}{n\hat \sigma^2}(RSS+2d\hat \sigma^2)$$

In [9]:
AIC = 1/(len(y)*np.var(residuals))*(RSS+2*X.shape[1]*np.var(residuals))
AIC

1.0199999999999998

## Bayesian information criterion (BIC)

BIC is very similar to AIC but defines information loss as:

$$\text{BIC}=k\ln n-2\hat{L}$$

And for ordinary least squares gives:

$$\text{BIC}=\frac{1}{n\hat \sigma^2}(RSS+\ln(n)d\hat \sigma^2)$$

As we can see the $2$ is replaces by $ln(n)$ which places a penalty on models with many parameters.

In [8]:
BIC = 1/(len(y)*np.var(residuals))*(RSS+np.log(len(y))*X.shape[1]*np.var(residuals))
BIC

1.046051701859881

## Adjusted $R^2$

We know that the $R^2$ is defined as 1-RSS/TSS where TSS is the total sum of squares. However to adjust this to select models according to n and d we get:

$$\text{Adjusted }R^2=1-\frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}$$

Unlike the previous accuracy metrics we want to maximise the adjusted $R^2$ instead of minimising it. Unlike the traditional $R^2$ the adjusted $R^2$ penalises unnecessary variables ($d$): 

In [10]:
TSS = (y-(y-y_pred).mean()).sum()
adjusted_R_squared = 1 - (RSS/(len(y)-1-1))/(TSS/(len(y)-1))
adjusted_R_squared

0.8805257289083981