# Model Selection

## Selection Criteria: MLR

In [None]:
## removing everything from memory
rm(list=ls())
## turning all warnings off
options(warn=-1)

## for easy data manipulation and visualization
if (!require(tidyverse)) install.packages('tidyverse')
library(tidyverse)

## provides helper functions for computing regression model performance metrics
if (!require(modelr)) install.packages('modelr')
library(modelr)

## creates easily a tidy data frame containing the model statistical metrics
if (!require(broom)) install.packages('broom')
library(broom)

We first list selection criteria for the linear regression model $y_{i}=x_{i}^{\prime} \boldsymbol{\beta}+e_{i}$ with $\sigma^{2}=E\left(e_{i}^{2}\right)$ and a $(k+1)\times 1$ coefficient vector $\boldsymbol{\beta}$. Let $\widehat{\boldsymbol{\beta}}$ be the OLS estimator, $\widehat{e}_{i}$ the OLS residual, and $\widehat{\sigma}^{2}=n^{-1} \sum_{i=1}^{n} \widehat{e}_{i}^{2}$ be the variance estimator. The number of estimated parameters ( $\boldsymbol{\beta}$ and $\sigma^{2}$ ) is $K=k+2$.

In [None]:
## installing the 'wooldridge' package if not previously installed
if (!require(wooldridge)) install.packages('wooldridge')

data(hprice3)

## Obs:   321

##  1. year                     1978, 1981
##  2. age                      age of house
##  3. agesq                    age^2
##  4. nbh                      neighborhood, 1 to 6
##  5. cbd                      dist. to central bus. dstrct, feet
##  6. inst                     dist. to interstate, feet
##  7. linst                    log(inst)
##  8. price                    selling price
##  9. rooms                    # rooms in house
## 10. area                     square footage of house
## 11. land                     square footage lot
## 12. baths                    # bathrooms
## 13. dist                     dist. from house to incin., feet
## 14. ldist                    log(dist)
## 15. lprice                   log(price)
## 16. y81                      =1 if year = 1981
## 17. larea                    log(area)
## 18. lland                    log(land)
## 19. linstsq                  linst^2

hprice3.copy <- hprice3
hprice3.copy$lcbd <- log(hprice3.copy$cbd)
hprice3.copy$y81 <- as.factor(hprice3.copy$y81)

## for easy data manipulation and visualization
if (!require(caret)) install.packages('caret')
library(caret)

## creating dummy variable for y81==1 and dropping dummy y81==0
dmy <- dummyVars(" ~ y81", data = hprice3.copy, fullRank=T)
datos <- subset(hprice3.copy, select=c("lprice","lland","larea","lcbd","nbh","rooms",
                                       "linst","linstsq","ldist","baths","age","agesq"))
datos <- cbind(datos,data.frame(predict(dmy, newdata = hprice3.copy)))
head(datos)

💻 The following code estimates two linear regression models ```model1``` uses all the predictors while ```model2``` excludes ```linstsq``` and ```agesq```.

In [None]:
model1 <- lm(lprice ~., data = datos)
model2 <- lm(lprice ~. -linstsq-agesq, data = datos)

***
**_Adjusted $\bar{R}^2$_**
$$
\bar{R}^{2}=1-\left(1-R^{2}\right) \frac{n-1}{n-K-1},
$$
where $R^2$ is the standard regression coefficient of determination.

**_Bayesian Information Criterion_**
$$
\mathrm{BIC}=n+n \log \left(2 \pi \widehat{\sigma}^{2}\right)+K \log (n).
$$
**_Akaike Information Criterion_**
$$
\mathrm{AIC}=n+n \log \left(2 \pi \widehat{\sigma}^{2}\right)+2 K.
$$
***

💻 Here, we use the function ```glance()``` to simply compare the overall quality of our two models:

In [None]:
## metrics for model 1
glance(model1) %>%
  dplyr::select(adj.r.squared, BIC, AIC)

## metrics for model 2
glance(model2) %>%
  dplyr::select(adj.r.squared, BIC, AIC)

💻 The first model has higher $\bar{R}^2$ and lower value for $\mathrm{BIC}$ and $\mathrm{AIC}$, so we would choose ```model1``` over ```model2```.

***
**_Mallows' $C_p$_**
$$
C_{p}=n \widehat{\sigma}^{2}+2 K \widetilde{\sigma}^{2},
$$
where $\widetilde{\sigma}^{2}$ is a preliminary estimator of $\sigma^{2}$ (typically based on fitting a large model, i.e., the one containing all the predictors).
***

In [None]:
## recovering sigma for models 1 and 2
sigma1 <- glance(model1) %>%
             dplyr::select(sigma)
sigma2 <- glance(model2) %>%
             dplyr::select(sigma)

## calculating Mallows' Cp for models 1 and 2
Cp1 <- nobs(model1)*(sigma1^2) + 2*(model1$rank)*(sigma1^2)
Cp2 <- nobs(model2)*(sigma2^2) + 2*(model2$rank)*(sigma1^2)

cbind(Cp1,Cp2)

💻 Again, the ```model1``` has smaller $C_p$ than ```model2```, so ```model1``` is preferred.

***
**_Shibata_**
$$
\text{Shibata}=\widehat{\sigma}^{2}\left(1+\frac{2 K}{n}\right).
$$

**_Final Prediction Error_**
$$
\mathrm{FPE}=\widehat{\sigma}^{2}\left(\frac{1+K / n}{1-K / n}\right).
$$

**_Generalized Cross-Validation_**
$$
\mathrm{GCV}=\frac{n \widehat{\sigma}^{2}}{(n-K)^{2}}.
$$
***

In [None]:
## manually calculating Shibata, FPE, and GCV for model1
shibata1 <- (sigma1^2)*(1+2*model1$rank/nobs(model1))
FPE1 <- (sigma1^2)*(1+model1$rank/nobs(model1))/(1-model1$rank/nobs(model1))
GCV1 <- nobs(model1)*(sigma1^2)/(nobs(model1)-model1$rank)

## manually calculating Shibata, FPE, and GCV for model2
shibata2 <- (sigma2^2)*(1+2*model2$rank/nobs(model2))
FPE2 <- (sigma1^2)*(1+model2$rank/nobs(model2))/(1-model2$rank/nobs(model2))
GCV2 <- nobs(model2)*(sigma2^2)/(nobs(model2)-model2$rank)

data.frame(shibata=unname(rbind(shibata1,shibata2)),
           FPE=unname(rbind(FPE1,FPE2)),
           GCV=unname(rbind(GCV1,GCV2)))

💻 Again, the ```model1``` has smaller $\mathrm{Shibata}$ and $\mathrm{GCV}$ than ```model2```, but ```model2``` is preferred based on the $\mathrm{FPE}$ criteria.

***
**_Cross-Validation_**
$$
\mathrm{CV}=\frac{1}{n}\sum_{i=1}^{n} \widetilde{e}_{i}^{2},
$$
where $\widetilde{e}_{i}$ are the least squares leave-one-out prediction errors.

<ins>Prediction erros</ins>: We define the leave-one-out estimator as that obtained by applying an estimation formula to the sample omitting the $i$th observation, i.e.,

$$
\widehat{\boldsymbol{\beta}}_{(-i)}=\widehat{\boldsymbol{\beta}}-\frac{1}{\left(1-h_{i i}\right)}\left(\boldsymbol{X}^{\prime} \boldsymbol{X}\right)^{-1} \boldsymbol{x}_{i} \widehat{e}_{i},
$$

where $\widehat{e}_{i}$ are the least squares residuals and $h_{ii}$ are the [leverage](https://en.wikipedia.org/wiki/Leverage_(statistics)) values. We also define the leave-one-out residual or prediction error as that obtained using the leave-one-out regression estimator, thus

$$
\tilde{e}_{i}=y_{i}-x_{i}^{\prime} \widehat{\boldsymbol{\beta}}_{(-i)}=\left(1-h_{i i}\right)^{-1} \widehat{e}_{i}.
$$

We define the out-of-sample mean squared error as
$$
\tilde{\sigma}^{2}=\frac{1}{n} \sum_{i=1}^{n} \widetilde{e}_{i}^{2}=\frac{1}{n} \sum_{i=1}^{n}\left(1-h_{i i}\right)^{-2} \widehat{e}_{i}^{2}
$$
***

In [None]:
## calculating manually the CV criteria for models 1 and 2
CV1 <- mean((resid(model1)/(1 - hatvalues(model1)))^2) 
CV2 <- mean((resid(model2)/(1 - hatvalues(model2)))^2)
data.frame(CV=unname(rbind(CV1,CV2)))

💻 Again, the ```model1``` has smaller $\mathrm{CV}$ than ```model2```.