In [None]:
import pandas as pd

from dfply import *

import warnings
warnings.filterwarnings('ignore')

url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Linear%20Regression%20in%20Python%20and%20R/properties_data.csv'

data_Python = pd.read_csv(url)

data_Python['size_in_m_2'] = 0.092903*data_Python['size_in_sqft']
data_Python['price_per_m_2'] = data_Python['price_per_sqft']/0.092903

## F-test: test to compare models

\

The ANOVA and significance test are a particular case of a more general
test that is useful to compere linear regression models, under the
assumptions of model.

-   We have a linear regression model with $k$ coefficients
    $\Omega_k$

-   We have another linear regression model $\omega_q$ with only
    $q<k$ coefficientss of $\Omega_p$

-   $\omega_q$ is a sub-model of $\Omega_k$ , we can denote this as
    $\omega_q \subset \Omega_k$
    
    \

The hypothesis test we want to carry out is the following:

```{=tex}
\begin{gather*}
H_0: \omega_q  
H_1: \Omega_p
\end{gather*}
```

Where Reject H_0 means  \Omega_k is a better model than \omega_q 
and Not Reject H_0 means \Omega_k isn´t a better model than \omega_q


Now we have to determinate a rule to reject H_0 in favor of H_1 or not

A firts aproach is the following:

-   If RSS\_{\omega\*q} - RSS\*{\Omega\_k} is small, then the
    predictions of the smaller model are almost as good as the larger
    model, so we would prefer the smaller model on the grounds of
    simplicity.

-   If RSS\_{\omega\*q} - RSS\*{\Omega\_k} is large, then the
    predictions of the smaller model are much worse than the larger
    model, so we would prefer the larger model.

That suggest that something like

```{=tex}
\begin{gather*}
\dfrac{RSS_{\omega_q} - RSS_{\Omega_k}}{RSS_{\Omega_k}}
\end{gather*}
```
would be a potentially good test statistic, where the denominator is used for scaling purposes.

\

### Statistic Test

Finally we can get to an statistic test based on the previous
expression, called F-statistic:

```{=tex}
\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_k}/(n-k)} \sim F_{k-q, n-k}
\end{gather*}
```

Where:

$k$ is the number of coefficients of the model $\Omega_k$

$q$ is the number of coefficients variables of the model $\omega_q$


The beauty of this approach is you only need to know the general form.
In any particular case, you just need to figure out which model
represent the null and alternative hypothesis, fit them and compute the test statistic.

\

#### F-test in R

## ANOVA test as a F-test

\

Remember that the hypothesis of the ANOVA test are these:

\begin{gather*}
H_0: \beta_0=\beta_1=...=\beta_p=0 \\
H_1: \exists \ j=0,1,...,p , \ \beta_j \neq 0
\end{gather*}

\

Let us consider the following models:

-   $\Omega_k \ : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{p}\cdot x_{ip} + \varepsilon_i$

-   $\omega_q \ : \ \ y_i = \beta_0 + \varepsilon_i$ (The
    Null Model)

Then, the ANOVA test is equivalent to the following:

\begin{gather*}
H_0:  \ \hat{y}_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{p}\cdot x_{ip} + \varepsilon_i  \ ( \Omega_p ) \\
H_1: \hat{y}_i = \beta_0 + \varepsilon_i  \ (  \omega_q )
\end{gather*}


And we also have the following facts:

- $k=p+1$

- $q=1$

-   $RSS_{\Omega_k} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1} +...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$

-   $RSS_{\omega_q} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n ( y_i - \hat{\beta}_0 )^2$

-   Note that in the null model $\hat{\beta}_0=\overline{y}$, therefore we have $RSS_{\omega_q}=\sum_{i=1}^n ( y_i - \overline{y} )^2= TSS_{\omega_q}= TSS_{\Omega_k}=TSS$

\

Using these facts and the F-statistic we get the statistic test of the ANOVA test:


\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_r}/(n-k)} = \dfrac{(TSS-RSS)/(k-1)}{RSS/(n-k)} \sim F_{k-1, n-k}
\end{gather*}

\

Where:

$TSS= RSS_{\omega_q}=\sum_{i=1}^n ( y_i - \overline{y} )^2$

$RSS= RSS_{\Omega_k} = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1} +...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$


\

### Anova test as an F-test in R

\










\

## Significance test as a F-test

\


Remember that the hypothesis of the significance test of $\beta_j$ are these:

\begin{gather*}
H_0: \beta_j=0 \\
H_1: \beta_j \neq 0
\end{gather*}

Let us consider the following models:

-   $\omega_q \ : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_{j-1} \cdot x_{i,j-1}+\beta_{j+1} \cdot x_{i,j+1}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i$

-   $\Omega_k \ : \ \ y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_j \cdot x_{ij}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i$
    
\

Then, the significance test of $\beta_j$ is equivalent to the following:

\begin{gather*}
H_0: y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_{j-1} \cdot x_{i,j-1}+\beta_{j+1} \cdot x_{i,j+1}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i \ ( \Omega_p ) \\
H_1: y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_j \cdot x_{ij}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i \varepsilon_i \ (\omega_q)
\end{gather*}


And we also have the following facts:

- $k=p+1$

- $q=k-1=p$

-   $RSS_{\omega_q} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1}  +..+ \hat{\beta}_{j-1} \cdot x_{i,j-1} +  \hat{\beta}_{j+1} \cdot x_{i,j+1}+..+...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$

-   $RSS_{\Omega_k} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n\left( y_i -  ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1}  +... +  \hat{\beta}_{j} \cdot x_{i,j}+...+ \hat{\beta}_{p}\cdot x_{ip} )  \right)^2$

\

So, the statistic test is obtained applying the F-statistic formula:

\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_k}/(n-k)} \sim F_{k-q, n-k}
\end{gather*}



\

The results of the test using the F-test is approximately equal to the result obtained with the other alternative (t-test).

This is the way to determinate the significance of categorical variables (compare the model without the categorical variable vs the model with it)

### Best Subset Selection