In [2]:
import pandas as pd

In [3]:
data_Python = pd.read_csv('data_Python_copy.csv')

In [4]:
data_Python['quality'] = data_Python['quality'].astype('category')

In [5]:
data_Python.dtypes

price                 int64
size_in_m_2         float64
longitude           float64
latitude            float64
no_of_bedrooms        int64
no_of_bathrooms       int64
quality            category
dtype: object

## F-test: test to compare models

\

The ANOVA and significance test are a particular case of a more general
test that is useful to compere linear regression models, under the
assumptions of model.

-   We have a linear regression model with $k$ coefficients
    $\Omega_k$

-   We have another linear regression model $\omega_q$ with only
    $q<k$ coefficientss of $\Omega_p$

-   $\omega_q$ is a sub-model of $\Omega_k$ , we can denote this as
    $\omega_q \subset \Omega_k$
    
    \

The hypothesis test we want to carry out is the following:

```{=tex}
\begin{gather*}
H_0: \omega_q  
H_1: \Omega_p
\end{gather*}
```

Where Reject H_0 means  \Omega_k is a better model than \omega_q 
and Not Reject H_0 means \Omega_k isn´t a better model than \omega_q


Now we have to determinate a rule to reject H_0 in favor of H_1 or not

A firts aproach is the following:

-   If RSS\_{\omega\*q} - RSS\*{\Omega\_k} is small, then the
    predictions of the smaller model are almost as good as the larger
    model, so we would prefer the smaller model on the grounds of
    simplicity.

-   If RSS\_{\omega\*q} - RSS\*{\Omega\_k} is large, then the
    predictions of the smaller model are much worse than the larger
    model, so we would prefer the larger model.

That suggest that something like

```{=tex}
\begin{gather*}
\dfrac{RSS_{\omega_q} - RSS_{\Omega_k}}{RSS_{\Omega_k}}
\end{gather*}
```
would be a potentially good test statistic, where the denominator is used for scaling purposes.

\

### Statistic Test

Finally we can get to an statistic test based on the previous
expression, called F-statistic:

```{=tex}
\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_k}/(n-k)} \sim F_{k-q, n-k}
\end{gather*}
```

Where:

$k$ is the number of coefficients of the model $\Omega_k$

$q$ is the number of coefficients variables of the model $\omega_q$


The beauty of this approach is you only need to know the general form.
In any particular case, you just need to figure out which model
represent the null and alternative hypothesis, fit them and compute the test statistic.

\

#### F-test in R

## ANOVA test as a F-test

\

Remember that the hypothesis of the ANOVA test are these:

\begin{gather*}
H_0: \beta_0=\beta_1=...=\beta_p=0 \\
H_1: \exists \ j=0,1,...,p , \ \beta_j \neq 0
\end{gather*}

\

Let us consider the following models:

-   $\Omega_k \ : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{p}\cdot x_{ip} + \varepsilon_i$

-   $\omega_q \ : \ \ y_i = \beta_0 + \varepsilon_i$ (The
    Null Model)

Then, the ANOVA test is equivalent to the following:

\begin{gather*}
H_0:  \ \hat{y}_i = \beta_0 + \beta_1\cdot x_{i1} +...+ \beta_{p}\cdot x_{ip} + \varepsilon_i  \ ( \Omega_p ) \\
H_1: \hat{y}_i = \beta_0 + \varepsilon_i  \ (  \omega_q )
\end{gather*}


And we also have the following facts:

- $k=p+1$

- $q=1$

-   $RSS_{\Omega_k} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1} +...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$

-   $RSS_{\omega_q} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n ( y_i - \hat{\beta}_0 )^2$

-   Note that in the null model $\hat{\beta}_0=\overline{y}$, therefore we have $RSS_{\omega_q}=\sum_{i=1}^n ( y_i - \overline{y} )^2= TSS_{\omega_q}= TSS_{\Omega_k}=TSS$

\

Using these facts and the F-statistic we get the statistic test of the ANOVA test:


\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_r}/(n-k)} = \dfrac{(TSS-RSS)/(k-1)}{RSS/(n-k)} \sim F_{k-1, n-k}
\end{gather*}

\

Where:

$TSS= RSS_{\omega_q}=\sum_{i=1}^n ( y_i - \overline{y} )^2$

$RSS= RSS_{\Omega_k} = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1} +...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$


\

### Anova test as an F-test in R

\










\

## Significance test as a F-test

\


Remember that the hypothesis of the significance test of $\beta_j$ are these:

\begin{gather*}
H_0: \beta_j=0 \\
H_1: \beta_j \neq 0
\end{gather*}

Let us consider the following models:

-   $\omega_q \ : \  \ y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_{j-1} \cdot x_{i,j-1}+\beta_{j+1} \cdot x_{i,j+1}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i$

-   $\Omega_k \ : \ \ y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_j \cdot x_{ij}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i$
    
\

Then, the significance test of $\beta_j$ is equivalent to the following:

\begin{gather*}
H_0: y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_{j-1} \cdot x_{i,j-1}+\beta_{j+1} \cdot x_{i,j+1}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i \ ( \Omega_p ) \\
H_1: y_i = \beta_0 + \beta_1\cdot x_{i1} +..+\beta_j \cdot x_{ij}+..+ \beta_{p}\cdot x_{ip} + \varepsilon_i \varepsilon_i \ (\omega_q)
\end{gather*}


And we also have the following facts:

- $k=p+1$

- $q=k-1=p$

-   $RSS_{\omega_q} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left( y_i - ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1}  +..+ \hat{\beta}_{j-1} \cdot x_{i,j-1} +  \hat{\beta}_{j+1} \cdot x_{i,j+1}+..+...+ \hat{\beta}_{p}\cdot x_{ip} ) \right)^2$

-   $RSS_{\Omega_k} = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 = \sum_{i=1}^n\left( y_i -  ( \hat{\beta}_0 + \hat{\beta}_1\cdot x_{i1}  +... +  \hat{\beta}_{j} \cdot x_{i,j}+...+ \hat{\beta}_{p}\cdot x_{ip} )  \right)^2$

\

So, the statistic test is obtained applying the F-statistic formula:

\begin{gather*}
F=\dfrac{(RSS_{\omega_q} - RSS_{\Omega_k})/(k-q)}{RSS_{\Omega_k}/(n-k)} \sim F_{k-q, n-k}
\end{gather*}



\

The results of the test using the F-test is approximately equal to the result obtained with the other alternative (t-test).

This is the way to determinate the significance of categorical variables (compare the model without the categorical variable vs the model with it)

### Best Subset Selection

In [27]:
import statsmodels.formula.api as smf
import statsmodels.api as sm

In [128]:
model = smf.ols(formula = 'price ~ size_in_m_2 + no_of_bedrooms + no_of_bathrooms + quality + latitude + longitude', data =data_Python)

model = model.fit()
 
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     547.4
Date:                Wed, 20 Jul 2022   Prob (F-statistic):               0.00
Time:                        19:56:10   Log-Likelihood:                -29918.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1896   BIC:                         5.990e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept       -6.207e+07   2.99e+07     

In [6]:
X=data_Python[['size_in_m_2', 'longitude', 'latitude', 'no_of_bedrooms', 'no_of_bathrooms', 'quality']]
X

Unnamed: 0,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality
0,100.242337,55.138932,25.113208,1,2,1
1,146.972546,55.151201,25.106809,2,2,1
2,181.253753,55.137728,25.063302,3,5,1
3,187.664060,55.341761,25.227295,2,3,0
4,47.101821,55.139764,25.114275,0,1,1
...,...,...,...,...,...,...
1900,100.985561,55.310712,25.176892,2,2,3
1901,70.606280,55.276684,25.166145,1,2,1
1902,179.302790,55.345056,25.206500,3,5,1
1903,68.748220,55.229844,25.073858,1,2,1


In [7]:
y = data_Python['price']
y

0       2700000
1       2850000
2       1150000
3       2850000
4       1729200
         ...   
1900    1500000
1901    1230000
1902    2900000
1903     675000
1904     760887
Name: price, Length: 1905, dtype: int64

In [9]:
import numpy as np

In [11]:
def __varcharProcessing__(X, varchar_process = "dummy_dropfirst"):
    
    dtypes = X.dtypes
    if varchar_process == "drop":   
        X = X.drop(columns = dtypes[dtypes == np.object].index.tolist())
        print("Character Variables (Dropped):", dtypes[dtypes == np.object].index.tolist())
    elif varchar_process == "dummy":
        X = pd.get_dummies(X,drop_first=False)
        print("Character Variables (Dummies Generated):", dtypes[dtypes == np.object].index.tolist())
    elif varchar_process == "dummy_dropfirst":
        X = pd.get_dummies(X,drop_first=True)
        print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())
    else: 
        X = pd.get_dummies(X,drop_first=True)
        print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())
    
    X["intercept"] = 1
    cols = X.columns.tolist()
    cols = cols[-1:] + cols[:-1]
    X = X[cols]
    
    return X

In [134]:
__varcharProcessing__(X, varchar_process = "dummy_dropfirst")


Character Variables (Dummies Generated, First Dummies Dropped): []


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())


Unnamed: 0,intercept,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality_1,quality_2,quality_3
0,1,100.242337,55.138932,25.113208,1,2,1,0,0
1,1,146.972546,55.151201,25.106809,2,2,1,0,0
2,1,181.253753,55.137728,25.063302,3,5,1,0,0
3,1,187.664060,55.341761,25.227295,2,3,0,0,0
4,1,47.101821,55.139764,25.114275,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...
1900,1,100.985561,55.310712,25.176892,2,2,0,0,1
1901,1,70.606280,55.276684,25.166145,1,2,1,0,0
1902,1,179.302790,55.345056,25.206500,3,5,1,0,0
1903,1,68.748220,55.229844,25.073858,1,2,1,0,0


X = pd.get_dummies(X,drop_first=True)
X

X["intercept"] = 1

X

    cols = X.columns.tolist()
    
    cols

cols[-1:] 

cols[:-1]

    cols = cols[-1:] + cols[:-1]

    X = X[cols]
    
    X

In [28]:
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.698
Model:                            OLS   Adj. R-squared:                  0.697
Method:                 Least Squares   F-statistic:                     547.4
Date:                Wed, 20 Jul 2022   Prob (F-statistic):               0.00
Time:                        20:52:13   Log-Likelihood:                -29918.
No. Observations:                1905   AIC:                         5.985e+04
Df Residuals:                    1896   BIC:                         5.990e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
intercept       -6.207e+07   2.99e+07     

In [29]:
model.aic

59854.027050178316

In [30]:
model.bic

59903.99718576636

In [31]:
model.rsquared_adj

0.6965926130210287

In [None]:
def __forwardSelectionRaw__(X, y, model_type ="linear",elimination_criteria = "aic", sl=0.05):

    iterations_log = ""
    cols = X.columns.tolist()
    
    def regressor(y,X, model_type=model_type):
        if model_type == "linear":
            regressor = sm.OLS(y, X).fit()
        elif model_type == "logistic":
            regressor = sm.Logit(y, X).fit()
        else:
            print("\nWrong Model Type : "+ model_type +"\nLinear model type is seleted.")
            model_type = "linear"
            regressor = sm.OLS(y, X).fit()
        return regressor
    
    selected_cols = ["intercept"]
    other_cols = cols.copy()
    other_cols.remove("intercept")
    
    model = regressor(y, X[selected_cols])
    
    if elimination_criteria == "aic":
        criteria = model.aic
    elif elimination_criteria == "bic":
        criteria = model.bic
    elif elimination_criteria == "r2" and model_type =="linear":
        criteria = model.rsquared
    elif elimination_criteria == "adjr2" and model_type =="linear":
        criteria = model.rsquared_adj
    
    
    for i in range(X.shape[1]):
        pvals = pd.DataFrame(columns = ["Cols","Pval"])
        for j in other_cols:
            model = regressor(y, X[selected_cols+[j]])
            pvals = pvals.append(pd.DataFrame([[j, model.pvalues[j]]],columns = ["Cols","Pval"]),ignore_index=True)
        pvals = pvals.sort_values(by = ["Pval"]).reset_index(drop=True)
        pvals = pvals[pvals.Pval<=sl]
        if pvals.shape[0] > 0:
            
            model = regressor(y, X[selected_cols+[pvals["Cols"][0]]])
            iterations_log += str("\nEntered : "+pvals["Cols"][0] + "\n")    
            iterations_log += "\n\n"+str(model.summary())+"\nAIC: "+ str(model.aic) + "\nBIC: "+ str(model.bic)+"\n\n"
                    
        
            if  elimination_criteria == "aic":
                new_criteria = model.aic
                if new_criteria < criteria:
                    print("Entered :", pvals["Cols"][0], "\tAIC :", model.aic)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break
            elif  elimination_criteria == "bic":
                new_criteria = model.bic
                if new_criteria < criteria:
                    print("Entered :", pvals["Cols"][0], "\tBIC :", model.bic)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break        
            elif  elimination_criteria == "r2" and model_type =="linear":
                new_criteria = model.rsquared
                if new_criteria > criteria:
                    print("Entered :", pvals["Cols"][0], "\tR2 :", model.rsquared)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("break : Criteria")
                    break           
            elif  elimination_criteria == "adjr2" and model_type =="linear":
                new_criteria = model.rsquared_adj
                if new_criteria > criteria:
                    print("Entered :", pvals["Cols"][0], "\tAdjR2 :", model.rsquared_adj)
                    selected_cols.append(pvals["Cols"][0])
                    other_cols.remove(pvals["Cols"][0])
                    criteria = new_criteria
                else:
                    print("Break : Criteria")
                    break
            else:
                print("Entered :", pvals["Cols"][0])
                selected_cols.append(pvals["Cols"][0])
                other_cols.remove(pvals["Cols"][0])            
                
        else:
            print("Break : Significance Level")
            break
        
    model = regressor(y, X[selected_cols])
    if elimination_criteria == "aic":
        criteria = model.aic
    elif elimination_criteria == "bic":
        criteria = model.bic
    elif elimination_criteria == "r2" and model_type =="linear":
        criteria = model.rsquared
    elif elimination_criteria == "adjr2" and model_type =="linear":
        criteria = model.rsquared_adj
    
    print(model.summary())
    print("AIC: "+str(model.aic))
    print("BIC: "+str(model.bic))
    print("Final Variables:", selected_cols)

    return selected_cols, iterations_log

### forward AIC

In [12]:
X = __varcharProcessing__(X , varchar_process = "dummy_dropfirst")

Character Variables (Dummies Generated, First Dummies Dropped): []


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  print("Character Variables (Dummies Generated, First Dummies Dropped):", dtypes[dtypes == np.object].index.tolist())


In [13]:
X

Unnamed: 0,intercept,size_in_m_2,longitude,latitude,no_of_bedrooms,no_of_bathrooms,quality_1,quality_2,quality_3
0,1,100.242337,55.138932,25.113208,1,2,1,0,0
1,1,146.972546,55.151201,25.106809,2,2,1,0,0
2,1,181.253753,55.137728,25.063302,3,5,1,0,0
3,1,187.664060,55.341761,25.227295,2,3,0,0,0
4,1,47.101821,55.139764,25.114275,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...
1900,1,100.985561,55.310712,25.176892,2,2,0,0,1
1901,1,70.606280,55.276684,25.166145,1,2,1,0,0
1902,1,179.302790,55.345056,25.206500,3,5,1,0,0
1903,1,68.748220,55.229844,25.073858,1,2,1,0,0


In [16]:
cols = X.columns.tolist()
cols

['intercept',
 'size_in_m_2',
 'longitude',
 'latitude',
 'no_of_bedrooms',
 'no_of_bathrooms',
 'quality_1',
 'quality_2',
 'quality_3']

In [23]:
 selected_cols = ["intercept"]
 selected_cols

['intercept']

In [24]:
other_cols = cols.copy()
other_cols

['intercept',
 'size_in_m_2',
 'longitude',
 'latitude',
 'no_of_bedrooms',
 'no_of_bathrooms',
 'quality_1',
 'quality_2',
 'quality_3']

In [25]:
    other_cols.remove("intercept")

In [21]:
    other_cols

['size_in_m_2',
 'longitude',
 'latitude',
 'no_of_bedrooms',
 'no_of_bathrooms',
 'quality_1',
 'quality_2',
 'quality_3']

In [22]:
X[selected_cols]

Unnamed: 0,intercept
0,1
1,1
2,1
3,1
4,1
...,...
1900,1
1901,1
1902,1
1903,1


In [None]:
def forward_AIC(X,y, significance=0.05 , varchar_process="dummy_dropfirst"):

X = __varcharProcessing__(X , varchar_process = varchar_process)

iterations_log = ""

cols = X.columns.tolist()

 regressor = sm.OLS(y, X).fit()

selected_cols = ["intercept"]
    
other_cols = cols.copy()
other_cols.remove("intercept")

model = sm.OLS(y, X[selected_cols]).fit()

criteria = model.aic

for i in range(X.shape[1]):
        pvals = pd.DataFrame(columns = ["Cols","Pval"])

for j in other_cols:
            model = sm.OLS(y, X[ selected_cols+[j]] ).fit()

In [32]:
pvals = pd.DataFrame(columns = ["Cols","Pval"])

In [33]:
pvals

Unnamed: 0,Cols,Pval


In [35]:
print( sm.OLS(y, X[ selected_cols+['size_in_m_2']] ).fit().summary() )

0,1,2,3
Dep. Variable:,price,R-squared:,0.654
Model:,OLS,Adj. R-squared:,0.654
Method:,Least Squares,F-statistic:,3594.0
Date:,"Wed, 20 Jul 2022",Prob (F-statistic):,0.0
Time:,21:02:40,Log-Likelihood:,-30048.0
No. Observations:,1905,AIC:,60100.0
Df Residuals:,1903,BIC:,60110.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,-1.658e+06,7.38e+04,-22.478,0.000,-1.8e+06,-1.51e+06
size_in_m_2,2.844e+04,474.409,59.952,0.000,2.75e+04,2.94e+04

0,1,2,3
Omnibus:,1048.866,Durbin-Watson:,1.704
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29146.766
Skew:,2.041,Prob(JB):,0.0
Kurtosis:,21.723,Cond. No.,292.0


In [None]:
def forwardSelection(X, y, model_type ="linear",elimination_criteria = "aic", varchar_process = "dummy_dropfirst", sl=0.05):
    """
    Forward Selection is a function, based on regression models, that returns significant features and selection iterations.\n
    Required Libraries: pandas, numpy, statmodels
    
    Parameters
    ----------
    X : Independent variables (Pandas Dataframe)\n
    y : Dependent variable (Pandas Series, Pandas Dataframe)\n
    model_type : 'linear' or 'logistic'\n
    elimination_criteria : 'aic', 'bic', 'r2', 'adjr2' or None\n
        'aic' refers Akaike information criterion\n
        'bic' refers Bayesian information criterion\n
        'r2' refers R-squared (Only works on linear model type)\n
        'r2' refers Adjusted R-squared (Only works on linear model type)\n
    varchar_process : 'drop', 'dummy' or 'dummy_dropfirst'\n
        'drop' drops varchar features\n
        'dummy' creates dummies for all levels of all varchars\n
        'dummy_dropfirst' creates dummies for all levels of all varchars, and drops first levels\n
    sl : Significance Level (default: 0.05)\n
    
    Returns
    -------
    columns(list), iteration_logs(str)\n\n
    Not Returns a Model
    
    Tested On
    ---------
    Python v3.6.7, Pandas v0.23.4, Numpy v1.15.04, StatModels v0.9.0
    
    See Also
    --------
    https://en.wikipedia.org/wiki/Stepwise_regression
    """
    X = __varcharProcessing__(X,varchar_process = varchar_process)
    return __forwardSelectionRaw__(X, y, model_type = model_type,elimination_criteria = elimination_criteria , sl=sl)
