# Model Selection in Linear Regression

## Index

* [Data-set ](#1)

  <br>
  
* [F-test: test to compare models](#2)
* * [F-test in `Python`](#3)
* * [F-test in `R`](#4)
* * [ANOVA test as a F-test](#3)
* * [Significance test as a F-test](#4)

<br>

* [Iterative Algorithms](#5) 
* * [Metrics]
* * * [$\widehat{R}^2] 
* * * [AIC]
* * * [BIC]
* * * [Cp]
* * [Best Subset Selection](#6) 
* * [Forward](#7) 
* * [Backward](#8) 

  


## Data-set <a class="anchor" id="1"></a>

The description of the data-set that we are going to use could be found in the following article:

https://fabioscielzoortiz.github.io/Estadistica4all.github.io/Articulos/Linear%20Regression%20in%20Python%20and%20R.html

### Loading the data-set in `Python` <a class="anchor" id="2"></a>

In [48]:
import pandas as pd
import numpy as np

In [49]:
data_Python = pd.read_csv('data_Python_copy.csv')

Converting $quality$ variable to categorical:

In [50]:
data_Python['quality'] = data_Python['quality'].astype('category')

In [51]:
data_Python.head(7)

Unnamed: 0,price,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,2700000,100.242337,55.138932,25.113208,1,2,1
1,2850000,146.972546,55.151201,25.106809,2,2,1
2,1150000,181.253753,55.137728,25.063302,3,5,1
3,2850000,187.66406,55.341761,25.227295,2,3,0
4,1729200,47.101821,55.139764,25.114275,0,1,1
5,3119900,94.296545,55.139764,25.114275,1,2,1
6,8503600,191.565986,55.139764,25.114275,2,3,2


### Loading the data-set in `R` <a class="anchor" id="2"></a>

In [52]:
import rpy2

%load_ext rpy2.ipython

import rpy2.robjects as robjects

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [53]:
%%R 

library(tidyverse)

data_R = read_csv('data_Python_copy.csv')

Rows: 1905 Columns: 7
-- Column specification --------------------------------------------------------
Delimiter: ","
dbl (7): price, size_in_m_2, longitude, latitude, no_of_bedrooms, no_of_bath...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.


Converting $quality$ variable to categorical:

In [54]:
%%R
data_R$quality <- as.factor(data_R$quality)

In [55]:
%%R

data_R[1:7, ]

# A tibble: 7 x 7
    price size_in_m_2 longitude latitude no_of_bedrooms no_of_bathrooms quality
    <dbl>       <dbl>     <dbl>    <dbl>          <dbl>           <dbl> <fct>  
1 2700000       100.       55.1     25.1              1               2 1      
2 2850000       147.       55.2     25.1              2               2 1      
3 1150000       181.       55.1     25.1              3               5 1      
4 2850000       188.       55.3     25.2              2               3 0      
5 1729200        47.1      55.1     25.1              0               1 1      
6 3119900        94.3      55.1     25.1              1               2 1      
7 8503600       192.       55.1     25.1              2               3 2      


## F-test: test to compare models <a class="anchor" id="2"></a>


The ANOVA and significance test are a particular case of a more general test that is useful to compere linear regression models, under the assumptions of the model.

-   We have a linear regression model with $k$ coefficients (betas)
    $\hspace{0.1cm} \Rightarrow \hspace{0.1cm} \Omega_k$

     -  $\Omega_k \ : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{k-1}\cdot x_{i(k-1)} + \varepsilon_i$ 

<br>

-   We have another linear regression model with only
    $q<k$ coefficientss of $\Omega_p$ $\hspace{0.1cm} \Rightarrow \hspace{0.1cm}  \omega_q$

     -  $\omega_q \ : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{q-1}\cdot x_{i(q-1)} + \varepsilon_i \hspace{0.2cm}$ , with $q < k$

<br>

-   $\omega_q$ is a sub-model of $\Omega_k$ , we can denote this as
    $\hspace{0.1cm} \omega_q \subset \Omega_k$
    




The hypothesis test we want to carry out is the following:

$$
H_0: \hspace{0.15cm} \omega_q  \\
H_1: \hspace{0.15cm} \Omega_p
$$


Where :

- $\omega_q \subset \Omega_p$

- Reject $H_0$ means  $\Omega_k$ is a "better" model than $\omega_q$,

- Not Reject $H_0$ means $\Omega_k$ isn´t a "better" model than $\omega_q$

<br>



Now we have to determinate a rule to reject $H_0$ in favor of $H_1$ or not

A first approach is the following:

Let :

$$
RSS_{\Omega_k} =  \sum_{i=1}^n ( \hat{\varepsilon}_{\Omega_k \hspace{0.05cm},\hspace{0.05cm} i} )^2 =  \sum_{i=1}^n \left( y_i - \hat{y}_{\hspace{0.01cm} \Omega_k \hspace{0.05cm},\hspace{0.02cm} i}\right)^2   
$$

$$
RSS_{\omega_q} =  \sum_{i=1}^n ( \hat{\varepsilon}_{\omega_q \hspace{0.05cm},\hspace{0.05cm} i} )^2 =  \sum_{i=1}^n \left( y_i - \hat{y}_{\omega_q \hspace{0.05cm},\hspace{0.02cm} i}\right)^2
$$



-   If $RSS_{\omega_q} - RSS_{\Omega_k}$ is **small**, then the
    predictions of the smaller model are almost as good as the larger
    model, so we would prefer the smaller model on the grounds of
    simplicity.

-   If $RSS_{\omega_q} - RSS_{\Omega_k}$ is **large**, then the
    predictions of the smaller model are much worse than the larger
    model, so we would prefer the larger model.




That suggest that something like

$$
\dfrac{RSS_{\omega_q} - RSS_{\Omega_k}}{RSS_{\Omega_k}}
$$

would be a potentially good test statistic, where the denominator is used for scaling purposes.





### Statistic Test

Finally we can get to an statistic test based on the previous
expression, called F-statistic:




$$
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_k}/(n-k)} \sim F_{\hspace{0.05cm} k-q \hspace{0.05cm},\hspace{0.05cm} n-k}
$$



Where:

$k$ is the number of coefficients (betas) of the model $\Omega_k$

$q$ is the number of coefficients (bets) of the model $\omega_q$


The beauty of this approach is you only need to know the general form.
In any particular case, you just need to figure out which model
represent the null and alternative hypothesis, fit them and compute the test statistic.




### F-test in Python <a class="anchor" id="3"></a>

In [56]:
import statsmodels.formula.api as smf

from statsmodels.stats.anova import anova_lm

from statsmodels.formula.api import ols

In [57]:
M1_py = smf.ols(formula = 'price ~ size_in_m_2*quality  + no_of_bedrooms + no_of_bathrooms + latitude + longitude', 
                 data =data_Python).fit()

M2_py = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude ', 
                 data =data_Python).fit()

M3_py = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + latitude + longitude', 
                 data =data_Python).fit()

M4_py = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms', 
                 data =data_Python).fit()

M5_py = smf.ols(formula = 'price ~ size_in_m_2', 
                 data =data_Python).fit()

$M1$:$\hspace{0.15cm}$ price ~ size_in_m_2*quality  + no_of_bedrooms + no_of_bathrooms + latitude + longitude

$M2$:$\hspace{0.15cm}$  price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude 

$M3$:$\hspace{0.15cm}$  price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + latitude + longitude

$M4$:$\hspace{0.15cm}$  price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms

$M5$:$\hspace{0.15cm}$  price ~ size_in_m_2



Note that:

$M5 \subset M4 \subset M3 \subset M2 \subset M1$

> anova_lm(Model $H_0$ , Model $H_1$)

Remember that: $\hspace{0.15cm}$ Model $H_0$ $\subset$ Model $H_1$

In [58]:
anova_lm(M2_py, M1_py)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,1896.0,4882082000000000.0,0.0,,,
1,1893.0,4694567000000000.0,3.0,187514700000000.0,25.203983,5.483226e-16


In [59]:
anova_lm(M3_py, M2_py)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,1899.0,4898667000000000.0,0.0,,,
1,1896.0,4882082000000000.0,3.0,16584980000000.0,2.146975,0.092395


In [60]:
anova_lm(M4_py, M3_py)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,1901.0,5080302000000000.0,0.0,,,
1,1899.0,4898667000000000.0,2.0,181634500000000.0,35.205892,9.703235e-16


In [61]:
anova_lm(M5_py, M1_py)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,1903.0,5593743000000000.0,0.0,,,
1,1893.0,4694567000000000.0,10.0,899175600000000.0,36.257642,2.127326e-65


### F-test in R <a class="anchor" id="4"></a>

In [62]:
%%R

M1_R <- lm( price ~ size_in_m_2*quality  + no_of_bedrooms + no_of_bathrooms + latitude + longitude , data = data_R)

M2_R <- lm( price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude , data = data_R)

M3_R <- lm( price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + latitude + longitude , data = data_R)

M4_R <- lm( price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms , data = data_R)

M5_R <- lm( price ~ size_in_m_2 , data = data_R)

In [63]:
%%R

anova(M2_R, M1_R)

Analysis of Variance Table

Model 1: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + 
    latitude + longitude
Model 2: price ~ size_in_m_2 * quality + no_of_bedrooms + no_of_bathrooms + 
    latitude + longitude
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   1896 4.8821e+15                                   
2   1893 4.6946e+15  3 1.8751e+14 25.204 5.483e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [64]:
%%R

anova(M3_R, M2_R)

Analysis of Variance Table

Model 1: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + latitude + 
    longitude
Model 2: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + 
    latitude + longitude
  Res.Df        RSS Df  Sum of Sq     F Pr(>F)  
1   1899 4.8987e+15                             
2   1896 4.8821e+15  3 1.6585e+13 2.147 0.0924 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [65]:
%%R

anova(M4_R, M3_R)

Analysis of Variance Table

Model 1: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms
Model 2: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + latitude + 
    longitude
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   1901 5.0803e+15                                   
2   1899 4.8987e+15  2 1.8163e+14 35.206 9.703e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [66]:
%%R

anova(M5_R, M1_R)

Analysis of Variance Table

Model 1: price ~ size_in_m_2
Model 2: price ~ size_in_m_2 * quality + no_of_bedrooms + no_of_bathrooms + 
    latitude + longitude
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   1903 5.5937e+15                                   
2   1893 4.6946e+15 10 8.9918e+14 36.258 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


### ANOVA test as a F-test <a class="anchor" id="3"></a>



Remember that the hypothesis of the ANOVA test are these:

\begin{gather*}
\hspace{-0.7 cm} H_0: \hspace{0.15cm} \beta_1=...=\beta_p=0 \\
H_1: \hspace{0.15cm} \exists \ j=1,...,p , \hspace{0.2cm} \beta_j \neq 0
\end{gather*}


Let us consider the following models:

-   $\Omega_k \ : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{p}\cdot x_{ip} + \varepsilon_i$

-   $\omega_q \ : \ \ y_i = \beta_0 + \varepsilon_i \hspace{0.15cm}$ (The
    Null Model)



Then, the ANOVA test is equivalent to the following:

\begin{gather*}
H_0: \hspace{0.15cm}  \ \hat{y}_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{p}\cdot x_{ip} + \varepsilon_i  \ \hspace{0.2cm} ( \Omega_p ) \\
\hspace{-3.7cm} H_1: \hspace{0.15cm} \hat{y}_i = \beta_0 + \varepsilon_i  \ \hspace{0.2cm} (  \omega_q )
\end{gather*}


Where:

- $k=p+1$

- $q=1$

-   $RSS_{\Omega_k} = \sum_{i=1}^n ( y_i - \hat{y}_{\hspace{0.01cm} \Omega_k \hspace{0.05cm},\hspace{0.02cm} i})^2 = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1} +...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$

-   $RSS_{\omega_q} = \sum_{i=1}^n ( y_i - \hat{y}_{\hspace{0.01cm} \omega_k \hspace{0.05cm},\hspace{0.02cm} i})^2 = \sum_{i=1}^n ( y_i - \hat{\beta}_0 )^2$

-   Note that in the null model $\hspace{0.1cm} \hat{\beta}_0=\overline{y} \hspace{0.1cm}$, therefore we have $\hspace{0.1cm} RSS_{\omega_q}=\sum_{i=1}^n ( y_i - \overline{y} )^2= TSS_{\omega_q}= TSS_{\Omega_k}=TSS$





Using these facts and the F-statistic we get the statistic test of the ANOVA test:

<br>

\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_k}/(n-k)} = \dfrac{(TSS-RSS)/(k-1)}{RSS/(n-k)} \sim F_{k-1 \hspace{0.03cm},\hspace{0.03cm} n-k}
\end{gather*}





Where:

$TSS= RSS_{\omega_q}$

$RSS= RSS_{\Omega_k}$




#### Anova test as an F-test in Python


In [67]:
full_model_py = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', 
                        data =data_Python).fit()

null_model_py = smf.ols(formula = 'price ~ 1', data =data_Python).fit()

In [68]:
anova_lm(null_model_py , full_model_py)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,1904.0,1.615874e+16,0.0,,,
1,1896.0,4882082000000000.0,8.0,1.127666e+16,547.423881,0.0


#### Anova test as an F-test in R


In [69]:
%%R

full_model_R <- lm( price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude , data = data_R)
null_model_R <- lm( price ~ 1 , data = data_R)

In [70]:
%%R

anova(null_model_R, full_model_R)

Analysis of Variance Table

Model 1: price ~ 1
Model 2: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + 
    latitude + longitude
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   1904 1.6159e+16                                   
2   1896 4.8821e+15  8 1.1277e+16 547.42 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


### Significance test as a F-test <a class="anchor" id="4"></a>






Remember that the hypothesis of the significance test of $\beta_j$ are these:

\begin{gather*}
H_0: \beta_j=0 \\
H_1: \beta_j \neq 0
\end{gather*}

Let us consider the following models:

-   $\omega_q  : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_{j-1} \cdot x_{i,j-1}+\beta_{j+1} \cdot x_{i,j+1}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i$

-   $\Omega_k : \ \ y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_j \cdot x_{ij}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i$
    




Then, the significance test of $\beta_j$ is equivalent to the following:

\begin{gather*}
H_0: \hspace{0.2cm} y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_{j-1} \cdot x_{i,j-1}+\beta_{j+1} \cdot x_{i,j+1}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i \ \hspace{0.2cm} (\omega_q) \\
\hspace{-2.7cm} H_1: \hspace{0.2cm}  y_i = \beta_0 + \beta_1\cdot x_{i1} +...+\beta_j \cdot x_{ij}+...+ \beta_{p}\cdot x_{ip} + \varepsilon_i  \ \hspace{0.2cm} ( \Omega_k )
\end{gather*}




Where:

- $k=p+1$

- $q=k-1=p$

-   $RSS_{\omega_q} = \sum_{i=1}^n ( y_i - \hat{y}_{\hspace{0.01cm} \omega_k \hspace{0.02cm},\hspace{0.02cm} i})^2 = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1}  +..+ \hat{\beta}_{j-1} \cdot x_{i,j-1} +  \hat{\beta}_{j+1} \cdot x_{i,j+1}+..+...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$

-   $RSS_{\Omega_k} = \sum_{i=1}^n ( y_i - \hat{y}_{\hspace{0.01cm} \Omega_k \hspace{0.02cm},\hspace{0.02cm} i})^2 = \sum_{i=1}^n\left( y_i -  ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1}  +... +  \hat{\beta}_{j} \cdot x_{i,j}+...+ \hat{\beta}_{p}\cdot x_{ip} )  \right)^2$





So, the statistic test is obtained applying the F-statistic formula:

\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_k}/(n-k)} \sim F_{k-q, n-k}
\end{gather*}



The results of the test using the F-test is approximately equal to the result obtained with the other alternative (t-test).

**Important**: $\hspace{0.1cm}$ this is the way to test the significance of categorical variables (compare the model without the categorical variable vs the model with it), and also to test the significance of two or more variables at the same time.

#### Significance test as a F-test in Python

$$
H_0: \beta_{quality} = 0 \\
H_1: \beta_{quality} \neq 0
$$

In [71]:
Model_with_quality_py =  smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', data =data_Python).fit()
Model_without_quality_py =  smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + latitude + longitude', data =data_Python).fit()

In [72]:
anova_lm(Model_without_quality_py, Model_with_quality_py)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,1899.0,4898667000000000.0,0.0,,,
1,1896.0,4882082000000000.0,3.0,16584980000000.0,2.146975,0.092395


$$
\hspace{-1.4cm} H_0: \beta_{longitude}= \beta_{latitude} = 0 \\
H_1: \hspace{0.3cm} \beta_{longitude}\neq 0 \hspace{0.2cm} or \hspace{0.2cm} \beta_{longitude}\neq 0
$$

In [73]:
M1_py = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality', data=data_Python).fit()
M2_py = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', data =data_Python).fit()

In [74]:
anova_lm(M1_py, M2_py)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,1898.0,5066797000000000.0,0.0,,,
1,1896.0,4882082000000000.0,2.0,184715000000000.0,35.867848,5.131535e-16


#### Significance test as a F-test in R

$$
H_0: \beta_{quality} = 0 \\
H_1: \beta_{quality} \neq 0
$$

In [75]:
%%R

Model_with_quality_R <- lm(price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude , data = data_R)
Model_without_quality_R <- lm(price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms  + latitude + longitude , data = data_R)

In [76]:
%%R

anova(Model_without_quality_R, Model_with_quality_R)

Analysis of Variance Table

Model 1: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + latitude + 
    longitude
Model 2: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + 
    latitude + longitude
  Res.Df        RSS Df  Sum of Sq     F Pr(>F)  
1   1899 4.8987e+15                             
2   1896 4.8821e+15  3 1.6585e+13 2.147 0.0924 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


$$
\hspace{-1.4cm} H_0: \beta_{longitude}= \beta_{latitude} = 0 \\
H_1: \hspace{0.3cm} \beta_{longitude}\neq 0 \hspace{0.2cm} or \hspace{0.2cm} \beta_{longitude}\neq 0
$$

In [77]:
%%R

M1_R <- lm('price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality', data=data_R)
M2_R <- lm('price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', data =data_R)

In [78]:
%%R

anova(M1_R , M2_R)

Analysis of Variance Table

Model 1: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality
Model 2: price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + 
    latitude + longitude
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   1898 5.0668e+15                                   
2   1896 4.8821e+15  2 1.8471e+14 35.868 5.132e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


## Iterative Algorithms <a class="anchor" id="5"></a>

Now we are going to see various algorithms aimed at choosing a linear regression model over many or all possible ones, under certain criteria based in some metric.

These metrics are one of the most important concepts in modern statistics and machine learning. 

Some of them are:

- Cross validation test error
- $\widehat{R}^2$  
- AIC
- BIC
- Cp



A detailed review of cross-validation methods will be done in a future article on my blog. This criterion will not be used in our practical implementation of the following iterative algorithms, because in the selection of linear regression models the most common is to use AIC, BIC or $\widehat{R}^2$, so if we will do a review of these last.

But in model selection in general, cross-validation plays a very prominent role.

###  $\widehat{R}^2$ 

This metric is explained with more details in the following article about linear regression :  

https://fabioscielzoortiz.github.io/Estadistica4all.github.io/Articulos/Linear%20Regression%20in%20Python%20and%20R.html

Here we will just show the formula that characterizes the adjusted $R^2$:

Given a linear regression model $\hspace{0.05cm} M \hspace{0.05cm}$  with $\hspace{0.05cm} p_M \hspace{0.051cm}$ predictors and $n$ observations:

\begin{gather*}
\widehat{R}^2 = 1 - \left( 1- R^2 \right) \cdot \dfrac{n-1}{n-p_M}
\end{gather*}

Where:

\begin{gather*}
R^2 = \dfrac{RegSS}{TSS}
\end{gather*}

This metric is usually used as a criteria to select linear regression models.

**$\widehat{R}^2$ criteria**: 

Given $h$ linear regression models $\hspace{0.1cm}M_1 , M_2, \dots, M_h$

If $\hspace{0.1cm}  \widehat{R}^2 (M_j) > \widehat{R}^2 (M_r) \hspace{0.2cm} , \forall r \in \lbrace 1,...,h\rbrace - \lbrace j \rbrace \hspace{0.2cm} \Rightarrow \hspace{0.2cm} $ $M_j$ is selected instead of $M_r$  $ \hspace{0.1cm} , \forall r \in \lbrace 1,...,h\rbrace - \lbrace j \rbrace$

That is, the model with the **highest** $\widehat{R}^2$ is selected over the rest.


### $AIC$

Given a linear regression model $\hspace{0.05cm} M \hspace{0.05cm}$  with $\hspace{0.05cm} p_M \hspace{0.051cm}$ predictors and $n$ observations:

$$ AIC(M) = -2 \cdot ln\left(\widehat{L}(M)\right) + 2 \cdot  \left(\hspace{0.1cm} p(M) +1 \hspace{0.1cm}\right)  $$


Where:

$\hat{L}(M)$ is the likelihood function of the model $M$ evaluated at the MLE (Maximum Likelihood Estimators)



This metric is usually used as a criteria to select linear regression models.

**$AIC$ criteria**:

Given $h$ linear regression models $\hspace{0.1cm}M_1 , M_2, \dots, M_h$

If $\hspace{0.1cm}  AIC (M_j) < AIC(M_r) \hspace{0.2cm} , \forall r \in \lbrace 1,...,h\rbrace - \lbrace j \rbrace \hspace{0.2cm} \Rightarrow \hspace{0.2cm} $ $M_j$ is selected instead of $M_r$  $ \hspace{0.1cm} , \forall r \in \lbrace 1,...,h\rbrace - \lbrace j \rbrace$

That is, the model with the **less** $AIC$ is selected over the rest.


#### $AIC$ in Python

### $BIC$

Given a linear regression model $\hspace{0.05cm} M \hspace{0.05cm}$  with $\hspace{0.05cm} p_M \hspace{0.051cm}$ predictors and $n$ observations:

$$ BIC(M) = -2 \cdot ln\left(\widehat{L}(M)\right) + \left(\hspace{0.1cm} p_M +1 \hspace{0.1cm}\right) \cdot ln(n) $$

Where:

$\hat{L}(M)$ is the likelihood function of the model $M$ evaluated at the MLE (Maximum Likelihood Estimators)

This metric is usually used as a criteria to select linear regression models.

$BIC$ **criteria**:


Given $h$ linear regression models $\hspace{0.1cm}M_1 , M_2, \dots, M_h$

If $\hspace{0.1cm}  BIC (M_j) < BIC(M_r) \hspace{0.2cm} , \forall r \in \lbrace 1,...,h\rbrace - \lbrace j \rbrace \hspace{0.2cm} \Rightarrow \hspace{0.2cm} $ $M_j$ is selected instead of $M_r$  $ \hspace{0.1cm} , \forall r \in \lbrace 1,...,h\rbrace - \lbrace j \rbrace$

That is, the model with the **less** $BIC$ is selected over the rest.

#### BIC in Python

### Maximum Likelihood Estimation in the Linear Regression Model


Given a linear regression model $\hspace{0.1cm} M \hspace{0.1cm}$  with $\hspace{0.1cm} p_M \hspace{0.1cm}$ predictors and $n$ observations:

$$ y_i \sim N(\hspace{0.1cm} x_i^t  \cdot \beta \hspace{0.1cm} , \hspace{0.1cm} \sigma^2 \hspace{0.1cm} )$$

$$ y_i \sim f(y_i) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \cdot exp\lbrace \hspace{0.1cm} - \dfrac{1}{2\sigma^2} \cdot (y_i - x^t_i\cdot \beta)^2 \hspace{0.1cm} \rbrace $$


The likelihood function of the model $M$ is:

$$ L(  M )=L(\beta, \sigma \hspace{0.1cm}|\hspace{0.1cm} x_i, y_i)= \prod_{i=1}^{n} f(y_i) = (2\pi \sigma^2)^{-n/2} \cdot exp\lbrace \hspace{0.1cm} - \dfrac{1}{2\sigma^2}\cdot \sum_{i=1}^{n} (y_i - x^t_i\cdot \beta)^2 \hspace{0.1cm} \rbrace$$

Taking natural logarithm we have:

$$ln\left(\hspace{0.1cm}L(M)\hspace{0.1cm}\right)=ln(\hspace{0.1cm} L(\beta, \sigma \hspace{0.1cm}|\hspace{0.1cm} x_i, y_i)\hspace{0.1cm}) = - \dfrac{n}{2} \left(ln(2\pi) + ln(\sigma^2) \right) - \dfrac{1}{2\sigma^2} \sum_{i=1}^{n} \left( y_i - x^t_i\cdot \beta \right) ^2   $$



The maximum likelihood estimators  of the parameters $\hspace{0.1cm} \beta$ , $\sigma \hspace{0.1cm}$ of the linear regression model $M$ are calculated as the solution of the following optimization problem:

$$
\underset{\beta \hspace{0.05cm},\hspace{0.05cm} \sigma}{Max} \hspace{0.2cm} ln(\hspace{0.1cm}L(M)\hspace{0.1cm})
$$




Solutions:

$$
\hat{\beta}_{MLE}=(X^t \cdot X)^{-1} \cdot X^t \cdot Y = \hat{\beta}_{OLS}
$$
$$
\hat{\sigma}^2_{MLE} = \dfrac{RSS(M)}{n}
$$



Note that:

$$
arg \hspace{0.1cm} \underset{\beta \hspace{0.05cm},\hspace{0.05cm} \sigma}{Max} \hspace{0.2cm} L(M) \hspace{0.1cm}=\hspace{0.1cm} arg \hspace{0.1cm} \underset{\beta \hspace{0.05cm},\hspace{0.05cm} \sigma}{Max} \hspace{0.2cm} ln(\hspace{0.1cm}L(M)\hspace{0.1cm}) 
$$



So, the function $\hspace{0.1cm} ln \left( \hspace{0.1cm}L(M)\hspace{0.1cm} \right) \hspace{0.1cm}$ evaluated in $\hspace{0.1cm} \beta=\hat{\beta}_{MLE} \hspace{0.1cm}, \hspace{0.1cm}\sigma^2 = \hat{\sigma}^2_{MLE} \hspace{0.1cm}$ is :

$$ ln \left( \hspace{0.1cm}\widehat{L}(M)\hspace{0.1cm} \right) =  - \dfrac{n}{2} \left( ln(2\pi) + ln\left(\dfrac{RSS(M)}{n}\right) - ln(n) + 1 \right) $$




Then, in the linear regression model:

$$ AIC(M) = n \cdot \left( \hspace{0.1cm}  ln(2\pi) + ln(RSS(M)) - ln(n) \hspace{0.1cm} \right) + n + 2\cdot (\hspace{0.1cm}p(M) + 1\hspace{0.1cm}) $$

$$ BIC(M) =  n \cdot \left(  \hspace{0.1cm} ln(2\pi) + ln(RSS(M)) - ln(n) \hspace{0.1cm} \right) + n + ln(n)\cdot(\hspace{0.1cm} p(M) + 1 - n\hspace{0.1cm}) $$

#### $AIC$ in Python

### $C_p$

Given a full linear regression model  with $\hspace{0.1cm} p\hspace{0.1cm}$ predictors $\hspace{0.1cm} M_{Full}:\hspace{0.1cm} y_i = \beta_0 + \sum_{j=1}^{p} \beta_j \cdot X_j$

Given a linear regression model $\hspace{0.1cm} M \subseteq M_{Full}\hspace{0.1cm}$  with $\hspace{0.1cm} p(M) \leq p\hspace{0.1cm}$ predictors and $n$ observations:

$$ C_p(M) = \dfrac{RSS(M)}{\hat{\sigma}_{M_{Full}}^2} - n + 2\cdot \left(\hspace{0.1cm} p(M)+1 \hspace{0.1cm}\right) $$

Where:

$ \hat{\sigma}_{M_{Full}}^2 =  \dfrac{RSS(M_{Full})}{n-p-1} = \dfrac{\sum_{i=1}^{n} (y_i - \hat{y}_{M_{Full}, i})^2}{n-p-1}  \hspace{0.2cm} $ is the residual variance of the full model.

#### $C_p$ in Python <a class="anchor" id="6"></a>

### Best Subset Selection <a class="anchor" id="6"></a>

Best subset selection  consist in the following algorithm :

We have $p$ predictors: $X_1,...,X_p$

- We train the null linear model $(M_0)$ 
- We train all the possible linear models with  $1$ predictor, and we select one $(M_1)$ under some criteria, for example the one with **less** $train \hspace{0.1cm} error$.

- We train all the possible linear models with $2$ predictor, and we select one $(M_2)$ under the same criteria.
  
   $\dots$ 

- We train all the possible linear models with $p-1$ predictor, and we select one $(M_{p-1})$ under the same criteria.

- We train the full linear model $(M_p)$


We select one of the models $(M_1, M_2,...,M_{p-1},M_p)$ under some criteria, for example the one with **less**  $AIC$, $BIC$ or $Cp$, or **greater**  $\widehat{R}^2$ . 




Scheme of the algorithm:

- $M_0$

- $\lbrace \text{models with 1 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_1$

- $\lbrace \text{models with 2 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_2$

$\hspace{0.8cm} \dots$

- $\lbrace \text{models with}$ $p-1$ $\text{predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_{p-1}$

- $M_p$



- $\lbrace M_0, M_1 , ..., M_p \rbrace \underset{ AIC, BIC, C_p, \widehat{R}^2 }{\Rightarrow} \hspace{0.1cm} M^* $

**Problems:**

- Large computational requirements: compute $2^p$ models is required, which is impossible to more than $p=40$ predictors.

### Best Subset Selection in Python <a class="anchor" id="6"></a>

### Forward Selection <a class="anchor" id="6"></a>

Forward selection  consist in the following algorithm :

We have $p$ predictors: $X_1,...,X_p$

- We train the null linear model $(M_0)$ 
  
- We train all the  linear models that are the result of add one predictor to the model $M_0$ , and we select one $(M_1)$ under some criteria, for example the one with **less** $train \hspace{0.1cm} error$.


- We train all the linear models that are the result of add one predictor to the model $M_0$ , and we select one $(M_2)$ under the same criteria.
  
   $\dots$ 

- We train all the linear models that are the result of add one predictor to the model $M_{p-2}$ , and we select one $(M_{p-1})$ under the same criteria.

- We train the full linear model $(M_p)$


We select one of the models $(M_0, M_1,...,M_{p-1},M_p)$ under some criteria, for example the one with **less** $\hspace{0.1cm}cross\hspace{0.1cm} validation\hspace{0.1cm} test\hspace{0.1cm} error$, $AIC$, $BIC$ or $Cp$, or **greater**  $\widehat{R}^2$ . 


Scheme of the algorithm:

- $M_0$

- $\lbrace  M_0 \hspace{0.1cm} \text{+ 1 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_1$

- $\lbrace  M_1 \hspace{0.1cm} \text{+ 1 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_2$

$\hspace{0.8cm} \dots$

- $\lbrace  M_{p-2} \hspace{0.1cm} \text{+ 1 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_{p-1}$

- $M_p$

- $\lbrace M_0, M_1 , ..., M_p \rbrace \underset{ AIC, BIC, C_p, \widehat{R}^2 }{\Rightarrow} \hspace{0.1cm} M^* $

Where:

$\lbrace  M_j \hspace{0.1cm} \text{+ 1 predictor} \rbrace \hspace{0.1cm} $ is the set of the linear regression models that are the result of adding one predictor to the model $M_j$

#### Forward Selection in Python

In [79]:
def __varcharProcessing__(X, varchar_process = "dummy_dropfirst"):
    
    dtypes = X.dtypes
    if varchar_process == "drop":   
        X = X.drop(columns = dtypes[dtypes == np.object].index.tolist())
        print("Character Variables (Dropped):", dtypes[dtypes == np.object].index.tolist())
    elif varchar_process == "dummy":
        X = pd.get_dummies(X,drop_first=False)
        print("Character Variables (Dummies Generated):", dtypes[dtypes == np.object].index.tolist())
    elif varchar_process == "dummy_dropfirst":
        X = pd.get_dummies(X,drop_first=True)
        print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())
    else: 
        X = pd.get_dummies(X,drop_first=True)
        print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())
    
    X["intercept"] = 1
    cols = X.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    X = X[cols]
    
    return X

In [80]:
import statsmodels.formula.api as smf
import statsmodels.api as sm

In [227]:
def forward (X,y, varchar_process="dummy_dropfirst"):

        X = __varcharProcessing__(X , varchar_process = varchar_process)

        cols = X.columns.tolist()

        regressor = sm.OLS(y, X).fit()

        selected_cols = ["intercept"]

        other_cols = cols.copy()
        other_cols.remove("intercept")

        model = sm.OLS(y, X[selected_cols]).fit() 

        Models = pd.DataFrame([[  selected_cols[0] , model.aic ]], columns = ["model","AIC"])
        
  
        ######################################################################################
        for i in range(X.shape[1]-1):

                train_errors = pd.DataFrame(columns = ["Cols", "train_error_MSE"])

                for j in other_cols:

                        model = sm.OLS(y, X[ selected_cols + [j] ] ).fit()

                        train_error_MSE = ( (y - model.predict(X[ selected_cols + [j] ]))**2 ).mean()
                        
                        train_errors = pd.concat( [train_errors, pd.DataFrame([[ j , train_error_MSE ]], columns = ["Cols","train_error_MSE"] ) ] )

                train_errors = train_errors.sort_values(by=["train_error_MSE"]).reset_index(drop=True)
                
                model = sm.OLS(y, X[ selected_cols + [train_errors["Cols"][0]] ]).fit()  

                Models = pd.concat([Models, pd.DataFrame([[ selected_cols[0:(i+2)] , model.aic ]], columns = ["model","AIC"]) ])

                
                selected_cols.append( train_errors["Cols"][0] )
                other_cols.remove( train_errors["Cols"][0] )

                
                

        ######################################################################################       

        Final_Model = sm.OLS(y, X[selected_cols]).fit()
        
        return Models , Final_Model, train_errors

In [213]:
selected_cols = ["intercept"]


In [195]:
 Models = pd.DataFrame([[ selected_cols , 1 ]], columns = ["model","AIC"])
 Models

Unnamed: 0,model,AIC
0,[intercept],1


In [218]:
selected_cols.append('hola')
selected_cols.append('adios')

In [226]:
selected_cols[0:1]

['intercept']

In [198]:
pd.concat([Models, pd.DataFrame([[ selected_cols , 2 ]], columns = ["model","AIC"]) ])

Unnamed: 0,model,AIC
0,"[intercept, hola]",1
0,"[intercept, hola]",2


In [88]:
X = data_Python[['size_in_m_2', 'longitude', 'latitude', 'no_of_bedrooms', 'no_of_bathrooms', 'quality']]
y = data_Python['price']


In [124]:
import warnings
warnings.filterwarnings('ignore')

In [223]:
A = forward(X,y)

Character Variables (Dummies Generated, First Dummies Dropped): []


In [224]:
Models , Final_Model, train_errors = A

In [228]:
pd.options.display.max_columns
Models

Unnamed: 0,model,AIC
0,intercept,62118.101353
0,[intercept],60099.253467
0,"[intercept, size_in_m_2]",59917.927231
0,"[intercept, size_in_m_2, no_of_bedrooms]",59856.412019
0,"[intercept, size_in_m_2, no_of_bedrooms, latit...",59853.04137
0,"[intercept, size_in_m_2, no_of_bedrooms, latit...",59852.348316
0,"[intercept, size_in_m_2, no_of_bedrooms, latit...",59852.522224
0,"[intercept, size_in_m_2, no_of_bedrooms, latit...",59852.729811
0,"[intercept, size_in_m_2, no_of_bedrooms, latit...",59854.02705


In [208]:
train_errors['Cols'][0]

'no_of_bathrooms'

In [141]:
Final_Model.aic

59854.027050178316

### Backward Selection <a class="anchor" id="6"></a>

Backward selection consist in the following algorithm :

We have $p$ predictors: $X_1,...,X_p$

- We train the full linear model $(M_p)$ 
  
- We train all the  linear models that are the result of removing one predictor to the model $M_p$ , and we select one $(M_{p-1})$ under some criteria, for example the one with **less** $train \hspace{0.1cm} error$.


- We train all the linear models that are the result of removing one predictor to the model $M_{p-1}$ , and we select one $(M_{p-2})$ under the same criteria.
  
   $\dots$ 

- We train all the linear models that are the result of removing one predictor to the model $M_{2}$ , and we select one $(M_{1})$ under the same criteria.

- We train the null linear model $(M_0)$


We select one of the models $(M_0,M_1,...,M_{p-1},M_p)$ under some criteria, for example the one with **less** $\hspace{0.1cm}cross\hspace{0.1cm} validation\hspace{0.1cm} test\hspace{0.1cm} error$, $AIC$, $BIC$ or $Cp$, or **greater**  $\widehat{R}^2$  


Scheme of the algorithm:

- $M_p$

- $\lbrace  M_p \hspace{0.1cm} \text{- 1 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_{p-1}$

- $\lbrace  M_{p-1} \hspace{0.1cm} \text{- 1 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_{p-2}$

$\hspace{0.8cm} \dots$

- $\lbrace  M_{2} \hspace{0.1cm} \text{- 1 predictor} \rbrace \underset{ \text{train  error} }{\Rightarrow}M_{1}$

- $M_0$

- $\lbrace M_0, M_1 , ..., M_p \rbrace \underset{ AIC, BIC, C_p, \widehat{R}^2 }{\Rightarrow} \hspace{0.1cm} M^* $

In [None]:
model = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', data =data_Python)

model = model.fit()
 
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     547.4
Date:                Wed, 20 Jul 2022   Prob (F-statistic):               0.00
Time:                        19:56:10   Log-Likelihood:                -29918.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1896   BIC:                         5.990e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept       -6.207e+07   2.99e+07     

In [None]:
X = data_Python[['size_in_m_2', 'longitude', 'latitude', 'no_of_bedrooms', 'no_of_bathrooms', 'quality']]
X

Unnamed: 0,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,100.242337,55.138932,25.113208,1,2,1
1,146.972546,55.151201,25.106809,2,2,1
2,181.253753,55.137728,25.063302,3,5,1
3,187.664060,55.341761,25.227295,2,3,0
4,47.101821,55.139764,25.114275,0,1,1
...,...,...,...,...,...,...
1900,100.985561,55.310712,25.176892,2,2,3
1901,70.606280,55.276684,25.166145,1,2,1
1902,179.302790,55.345056,25.206500,3,5,1
1903,68.748220,55.229844,25.073858,1,2,1


In [None]:
y = data_Python['price']
y

0       2700000
1       2850000
2       1150000
3       2850000
4       1729200
         ...   
1900    1500000
1901    1230000
1902    2900000
1903     675000
1904     760887
Name: price, Length: 1905, dtype: int64

In [None]:
import numpy as np

In [None]:
__varcharProcessing__(X, varchar_process = "dummy_dropfirst")


Character Variables (Dummies Generated, First Dummies Dropped): []


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())


Unnamed: 0,intercept,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality_1,quality_2,quality_3
0,1,100.242337,55.138932,25.113208,1,2,1,0,0
1,1,146.972546,55.151201,25.106809,2,2,1,0,0
2,1,181.253753,55.137728,25.063302,3,5,1,0,0
3,1,187.664060,55.341761,25.227295,2,3,0,0,0
4,1,47.101821,55.139764,25.114275,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...
1900,1,100.985561,55.310712,25.176892,2,2,0,0,1
1901,1,70.606280,55.276684,25.166145,1,2,1,0,0
1902,1,179.302790,55.345056,25.206500,3,5,1,0,0
1903,1,68.748220,55.229844,25.073858,1,2,1,0,0


X = pd.get_dummies(X,drop_first=True)
X

X["intercept"] = 1

X

    cols = X.columns.tolist()
    
    cols

cols[-1:] 

cols[:-1]

    cols = cols[-1:] + cols[:-1]

    X = X[cols]
    
    X

In [None]:
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     547.4
Date:                Wed, 20 Jul 2022   Prob (F-statistic):               0.00
Time:                        20:52:13   Log-Likelihood:                -29918.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1896   BIC:                         5.990e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
intercept       -6.207e+07   2.99e+07     

In [None]:
model.aic

59854.027050178316

In [None]:
model.bic

59903.99718576636

In [None]:
model.rsquared_adj

0.6965926130210287

In [None]:
def __forwardSelectionRaw__(X, y, model_type ="linear",elimination_criteria = "aic", sl=0.05):

    iterations_log = ""
    cols = X.columns.tolist()
    
    def regressor(y,X, model_type=model_type):
        if model_type == "linear":
            regressor = sm.OLS(y, X).fit()
        elif model_type == "logistic":
            regressor = sm.Logit(y, X).fit()
        else:
            print("\nWrong Model Type : "+ model_type +"\nLinear model type is seleted.")
            model_type = "linear"
            regressor = sm.OLS(y, X).fit()
        return regressor
    
    selected_cols = ["intercept"]
    other_cols = cols.copy()
    other_cols.remove("intercept")
    
    model = regressor(y, X[selected_cols])
    
    if elimination_criteria == "aic":
        criteria = model.aic
    elif elimination_criteria == "bic":
        criteria = model.bic
    elif elimination_criteria == "r2" and model_type =="linear":
        criteria = model.rsquared
    elif elimination_criteria == "adjr2" and model_type =="linear":
        criteria = model.rsquared_adj
    
    
    for i in range(X.shape[1]):
        pvals = pd.DataFrame(columns = ["Cols","Pval"])
        for j in other_cols:
            model = regressor(y, X[selected_cols+[j]])
            pvals = pvals.append(pd.DataFrame([[j, model.pvalues[j]]],columns = ["Cols","Pval"]),ignore_index=True)
        pvals = pvals.sort_values(by = ["Pval"]).reset_index(drop=True)
        pvals = pvals[pvals.Pval<=sl]
        if pvals.shape[0] > 0:
            
            model = regressor(y, X[selected_cols+[pvals["Cols"][0]]])
            iterations_log += str("\nEntered : "+pvals["Cols"][0] + "\n")    
            iterations_log += "\n\n"+str(model.summary())+"\nAIC: "+ str(model.aic) + "\nBIC: "+ str(model.bic)+"\n\n"
                    
        
            if  elimination_criteria == "aic":
                new_criteria = model.aic
                if new_criteria < criteria:
                    print("Entered :", pvals["Cols"][0], "\tAIC :", model.aic)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break
            elif  elimination_criteria == "bic":
                new_criteria = model.bic
                if new_criteria < criteria:
                    print("Entered :", pvals["Cols"][0], "\tBIC :", model.bic)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break        
            elif  elimination_criteria == "r2" and model_type =="linear":
                new_criteria = model.rsquared
                if new_criteria > criteria:
                    print("Entered :", pvals["Cols"][0], "\tR2 :", model.rsquared)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break           
            elif  elimination_criteria == "adjr2" and model_type =="linear":
                new_criteria = model.rsquared_adj
                if new_criteria > criteria:
                    print("Entered :", pvals["Cols"][0], "\tAdjR2 :", model.rsquared_adj)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("Break : Criteria")
                    break
            else:
                print("Entered :", pvals["Cols"][0])
                selected_cols.append(pvals["Cols"][0])
                other_cols.remove(pvals["Cols"][0])            
                
        else:
            print("Break : Significance Level")
            break
        
    model = regressor(y, X[selected_cols])
    if elimination_criteria == "aic":
        criteria = model.aic
    elif elimination_criteria == "bic":
        criteria = model.bic
    elif elimination_criteria == "r2" and model_type =="linear":
        criteria = model.rsquared
    elif elimination_criteria == "adjr2" and model_type =="linear":
        criteria = model.rsquared_adj
    
    print(model.summary())
    print("AIC: "+str(model.aic))
    print("BIC: "+str(model.bic))
    print("Final Variables:", selected_cols)

    return selected_cols, iterations_log

### forward AIC

In [None]:
X = __varcharProcessing__(X , varchar_process = "dummy_dropfirst")

Character Variables (Dummies Generated, First Dummies Dropped): []


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())


In [None]:
X

Unnamed: 0,intercept,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality_1,quality_2,quality_3
0,1,100.242337,55.138932,25.113208,1,2,1,0,0
1,1,146.972546,55.151201,25.106809,2,2,1,0,0
2,1,181.253753,55.137728,25.063302,3,5,1,0,0
3,1,187.664060,55.341761,25.227295,2,3,0,0,0
4,1,47.101821,55.139764,25.114275,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...
1900,1,100.985561,55.310712,25.176892,2,2,0,0,1
1901,1,70.606280,55.276684,25.166145,1,2,1,0,0
1902,1,179.302790,55.345056,25.206500,3,5,1,0,0
1903,1,68.748220,55.229844,25.073858,1,2,1,0,0


In [None]:
cols = X.columns.tolist()
cols

['intercept',
 'size_in_m_2',
 'longitude',
 'latitude',
 'no_of_bedrooms',
 'no_of_bathrooms',
 'quality_1',
 'quality_2',
 'quality_3']

In [None]:
 selected_cols = ["intercept"]
 selected_cols

['intercept']

In [None]:
other_cols = cols.copy()
other_cols

['intercept',
 'size_in_m_2',
 'longitude',
 'latitude',
 'no_of_bedrooms',
 'no_of_bathrooms',
 'quality_1',
 'quality_2',
 'quality_3']

In [None]:
    other_cols.remove("intercept")

In [None]:
    other_cols

['size_in_m_2',
 'longitude',
 'latitude',
 'no_of_bedrooms',
 'no_of_bathrooms',
 'quality_1',
 'quality_2',
 'quality_3']

In [None]:
X[selected_cols]

Unnamed: 0,intercept
0,1
1,1
2,1
3,1
4,1
...,...
1900,1
1901,1
1902,1
1903,1


In [None]:
X

model = sm.OLS(y, X).fit()


In [None]:
y

0       2700000
1       2850000
2       1150000
3       2850000
4       1729200
         ...   
1900    1500000
1901    1230000
1902    2900000
1903     675000
1904     760887
Name: price, Length: 1905, dtype: int64

In [None]:
((y - model.predict(X)).abs()).mean()

938065.2280944814

In [None]:
def forward_AIC_pvalue (X,y, sl=0.05 , varchar_process="dummy_dropfirst"):

        X = __varcharProcessing__(X , varchar_process = varchar_process)

        cols = X.columns.tolist()

        regressor = sm.OLS(y, X).fit()

        selected_cols = ["intercept"]

        other_cols = cols.copy()
        other_cols.remove("intercept")

        model = sm.OLS(y, X[selected_cols]).fit()

        criteria = model.aic

        ######################################################################################
        for i in range(X.shape[1]):

                pvals = pd.DataFrame(columns = ["Cols","Pval"])

                for j in other_cols:

                        model = sm.OLS(y, X[ selected_cols + [j] ] ).fit()
                        
                        pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)

                pvals = pvals.sort_values(by = ["Pval"]).reset_index(drop=True)
                pvals = pvals[pvals.Pval<=sl]

                if pvals.shape[0] > 0:

                        model = sm.OLS(y, X[ selected_cols + [pvals["Cols"][0]] ]).fit()

                        new_criteria = model.aic

                        if new_criteria < criteria :

                                print("Entered :", pvals["Cols"][0], "\tAIC :", model.aic)

                                selected_cols.append(pvals["Cols"][0])
                                other_cols.remove(pvals["Cols"][0])

                                criteria = new_criteria

                        else:
                                print("break : Criteria")
                                break

        
                else:
                        print("Break : Significance Level")
                        break
        ######################################################################################       

        model = sm.OLS(y, X[selected_cols]).fit()
        
        criteria = model.aic

        print(model.summary())
        
        print("Final Variables:", selected_cols)

        return selected_cols

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
forward_AIC_pvalue (X, y, sl=0.05 , varchar_process="dummy_dropfirst") 

Character Variables (Dummies Generated, First Dummies Dropped): []
Entered : size_in_m_2 	AIC : 60099.2534671769
Entered : no_of_bedrooms 	AIC : 59917.92723088584
Entered : latitude 	AIC : 59856.41201860482
Entered : longitude 	AIC : 59853.0413699519
Break : Significance Level
                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.697
Model:                            OLS   Adj. R-squared:                  0.696
Method:                 Least Squares   F-statistic:                     1091.
Date:                Wed, 20 Jul 2022   Prob (F-statistic):               0.00
Time:                        23:17:40   Log-Likelihood:                -29922.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1900   BIC:                         5.988e+04
Df Model:                           4                                         
Covariance 

['intercept', 'size_in_m_2', 'no_of_bedrooms', 'latitude', 'longitude']

In [None]:
pvals = pd.DataFrame(columns = ["Cols","Pval"])

In [None]:
pvals

Unnamed: 0,Cols,Pval


In [None]:
print( sm.OLS(y, X[ selected_cols+['size_in_m_2']] ).fit().summary() )

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.654
Model:                            OLS   Adj. R-squared:                  0.654
Method:                 Least Squares   F-statistic:                     3594.
Date:                Wed, 20 Jul 2022   Prob (F-statistic):               0.00
Time:                        21:06:03   Log-Likelihood:                -30048.
No. Observations:                1905   AIC:                         6.010e+04
Df Residuals:                    1903   BIC:                         6.011e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
intercept   -1.658e+06   7.38e+04    -22.478      

In [None]:
 pvals.append(pd.DataFrame([[ 'size_in_m_2' , model.pvalues['size_in_m_2'] ]], columns = ["Cols","Pval"]), ignore_index=True)

  pvals.append(pd.DataFrame([[ 'size_in_m_2' , model.pvalues['size_in_m_2'] ]], columns = ["Cols","Pval"]), ignore_index=True)


Unnamed: 0,Cols,Pval
0,size_in_m_2,0.0


In [None]:
for j in other_cols:
        model = sm.OLS(y, X[ selected_cols + [j]] ).fit()
        pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)

  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)
  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)
  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)
  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)
  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)
  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)
  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)
  pvals = pvals.append(pd.DataFrame([[ j , model.pvalues[j] ]], columns = ["Cols","Pval"]), ignore_index=True)


In [None]:
pvals

Unnamed: 0,Cols,Pval
0,size_in_m_2,0.0
1,longitude,0.5018901
2,latitude,7.668705e-20
3,no_of_bedrooms,4.867487e-129
4,no_of_bathrooms,2.688943e-122
5,quality_1,0.1298061
6,quality_2,0.01291513
7,quality_3,0.0002262572


In [None]:
pvals = pvals.sort_values(by = ["Pval"]).reset_index(drop=True)
pvals

Unnamed: 0,Cols,Pval
0,size_in_m_2,0.0
1,no_of_bedrooms,4.867487e-129
2,no_of_bathrooms,2.688943e-122
3,latitude,7.668705e-20
4,quality_3,0.0002262572
5,quality_2,0.01291513
6,quality_1,0.1298061
7,longitude,0.5018901


In [None]:
pvals = pvals[pvals.Pval<=0.05]
pvals

Unnamed: 0,Cols,Pval
0,size_in_m_2,0.0
1,no_of_bedrooms,4.867487e-129
2,no_of_bathrooms,2.688943e-122
3,latitude,7.668705e-20
4,quality_3,0.0002262572
5,quality_2,0.01291513


In [None]:
pvals.shape[0]

6

In [None]:
[pvals["Cols"][0]]

['size_in_m_2']

In [None]:
def forward_AIC_train_error_MAD (X,y, varchar_process="dummy_dropfirst"):

        X = __varcharProcessing__(X , varchar_process = varchar_process)

        cols = X.columns.tolist()

        regressor = sm.OLS(y, X).fit()

        selected_cols = ["intercept"]

        other_cols = cols.copy()
        other_cols.remove("intercept")

        model = sm.OLS(y, X[selected_cols]).fit()

        criteria = model.aic

        ######################################################################################
        for i in range(X.shape[1]):

                train_errors = pd.DataFrame(columns = ["Cols", "train_error_MAD"])

                for j in other_cols:

                        model = sm.OLS(y, X[ selected_cols + [j] ] ).fit()

                        train_error_MAD = ((y - model.predict(X[ selected_cols + [j] ])).abs()).mean()
                        
                        train_errors = train_errors.append(pd.DataFrame([[ j , train_error_MAD ]], columns = ["Cols","train_error_MAD"]), ignore_index=True)

                train_errors = train_errors.sort_values(by = ["train_error_MAD"]).reset_index(drop=True)
                
                model = sm.OLS(y, X[ selected_cols + [train_errors["Cols"][0]] ]).fit()

                new_criteria = model.aic


                if new_criteria < criteria :

                                print("Entered :", train_errors["Cols"][0], "\tAIC :", model.aic)

                                selected_cols.append(train_errors["Cols"][0])
                                other_cols.remove(train_errors["Cols"][0])

                                criteria = new_criteria

                else:
                                print("break : Criteria")
                                break
        ######################################################################################       

        model = sm.OLS(y, X[selected_cols]).fit()
        
        criteria = model.aic

        print(model.summary())
        
        print("Final Variables:", selected_cols)

        return selected_cols

In [None]:
forward_AIC_train_error_MAD (X,y, varchar_process="dummy_dropfirst")

Character Variables (Dummies Generated, First Dummies Dropped): []
Entered : size_in_m_2 	AIC : 60099.2534671769
Entered : no_of_bedrooms 	AIC : 59917.92723088584
Entered : latitude 	AIC : 59856.41201860482
Entered : longitude 	AIC : 59853.0413699519
Entered : quality_2 	AIC : 59852.348315523595
break : Criteria
                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.697
Model:                            OLS   Adj. R-squared:                  0.696
Method:                 Least Squares   F-statistic:                     874.4
Date:                Wed, 20 Jul 2022   Prob (F-statistic):               0.00
Time:                        23:22:18   Log-Likelihood:                -29920.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1899   BIC:                         5.989e+04
Df Model:                           5                 

['intercept',
 'size_in_m_2',
 'no_of_bedrooms',
 'latitude',
 'longitude',
 'quality_2']

### Backward Selection <a class="anchor" id="6"></a>

## Bibliography

https://nathancarter.github.io/how2data/site/

https://github.com/talhahascelik/python_stepwiseSelection/blob/master/stepwiseSelection.py

http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-py.html

https://statweb.stanford.edu/~jtaylo/courses/stats203/notes/selection.pdf

Linear Models with R. Julian Faraway.