## <center><font color=navy>Big Data Economics</font></center>
### <center>Algorithms for Subset Selection in Multiple Linear Regression</center>
#### <center>Ali Habibnia</center>
    
<center> Assistant Professor, Department of Economics, </center>
<center> and Division of Computational Modeling & Data Analytics at Virginia Tech</center>
<center> habibnia@vt.edu </center> 

### Readings:

1. ***Chapter 6.2,*** Graham Elliott, and Allan Timmermann, Economic Forecasting, Princeton University Press, 2016.
2. ***Chapter 3.3*** [The Elements of Statistical Learning: Data Mining, Inference, and Prediction](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12.pdf). 

### Goals

> In this note, we introduce methods for identifying subsets of the independent variables to improve predictions.

- For moderate numbers of independent variables the method discussed during previous session is preferable. The idea was to evaluate all subsets. 

- We compute a criterion such as adjusted $R^2$. We use the adjusted $R^2$ for all subsets to choose the best one. 

- This is only feasible if $p$ is less than about 20

- Since the number of subsets for even moderate values of $p$ is very large, we need some way to examine the most promising subsets and to select from them. An intuitive metric to compare subsets is adjusted $R^2$.

- A criterion that is often used for subset selection is known as Mallow’s $C_p$. This criterion assumes that the full model is unbiased although it may have variables that, if dropped, would improve the mean squared error (MSE). 

- $C_p$ is also an estimate of the sum of MSE (standardized by dividing by σ2) for predictions (the fitted values) at the x-values observed in the training set. Thus good models are those that have values of $C_p$ near k and that have small k (i.e. are of small size). $C_p$ is computed from the formula:

$$
C_p = \frac{SSR}{\hat{\sigma}^2_{Full}} + 2k - n
$$

Where:
- $SSR$ is the sum of squares Residualls due to regression.
- $\hat{\sigma}^2_{Full}$ is an estimate of the variance associated with the full model.
- $k$ is the number of predictors in the model.
- $n$ is the sample size.



- It is important to remember that the usefulness of this approach depends heavily on the reliability of the estimate of ${\sigma}^2$ for the full model. This requires that the training set contains a large number of observations relative to the number of variables.

### Subset Selection in Linear Regression

- A frequent problem in data mining is that of using a regression equation to predict the value of a dependent variable when we have a number of variables available to choose as independent variables in our model. 

- Given the high speed of modern algorithms for multiple linear regression calculations, it is tempting in such a situation to take a kitchen-sink approach: why bother to select a subset, just use all the variables in the model. 

- There are several reasons why this could be undesirable.

    - It may be expensive to collect the full complement of variables for future predictions.
    - We may be able to more accurately measure fewer variables (for example in surveys).
    - Parsimony is an important property of good models. We obtain more insight into the influence of regressors in models with a few parameters.
    - Estimates of regression coefficients are likely to be unstable due to multicollinearity in models with many variables. We get better insights into the influence of regressors from models with fewer variables as the coefficients are more stable for parsimonious models.
    - It can be shown that using independent variables that are uncorrelated with the dependent variable will increase the variance of predictions.
    - It can be shown that dropping independent variables that have small (non-zero) coefficients can reduce the average error of predictions.

### Algorithms for Subset Selection

- Selecting subsets to improve MSE is a difficult computational problem for large number $p$ of independent variables. 

- The most common procedure for $p$ greater than about 20 is to use heuristics to select “good” subsets rather than to lookfor the best subset for a given criterion. 

- The heuristics most often used and available in statistics software are step-wise procedures. 

- There are three common procedures: **forward selection**, **backward elimination** and **step-wise regression**.

## Forward Selection

**Forward selection** is a stepwise approach that starts with no variables in the model (the null model) and adds them one by one. In the first step, it adds the most significant variable. At each subsequent step, it adds the most significant variable of those not in the model, until there are no variables that meet the criterion set by the researcher.

### Algorithm:

1. Start with a model containing only the constant term (intercept).
2. For each variable not in the model:
   - Compute the decrease in SSR (Sum of Squares of the Residuals) that would result from adding the variable.
   - Estimate $\hat{\sigma}^2$ for the current model.
   - Calculate the F-statistic:
     $$
     F_i = \frac{SSR(S) - SSR(S \cup \{i\})}{\hat{\sigma}^2(S \cup \{i\})}
     $$
   - If $F_i$ is greater than a predefined threshold $F_{in}$ (often between 2 and 4), add variable $i$ to the model.
3. Repeat step 2 until no more variables meet the criterion for inclusion.



## Backward Elimination

**Backward elimination** is the reverse process of forward selection. We start with all candidate variables (full model) and remove the least significant variable at each step, until none meet the criterion.

### Algorithm:
1. Start with all variables in the model.
2. For each variable in the model:
   - Compute the increase in SSR that would result from removing the variable.
   - Estimate $\hat{\sigma}^2$ for the current model.
   - Calculate the F-statistic:
     $$
     F_i = \frac{SSR(S - \{i\}) - SSR(S)}{\hat{\sigma}^2(S)}
     $$
   - If $F_i$ is less than a predefined threshold $F_{out}$ (often between 2 and 4), remove variable $i$ from the model.
3. Repeat step 2 until no more variables meet the criterion for removal.

- Backward Elimination has the advantage that all variables are included in $S$ at some stage. 

- This addresses a problem of forward selection that will never select a variable that is better than a previously selected variable that is strongly correlated with it. 

- The disadvantage is that the full model with all variables is required at the start and this can be time-consuming and numerically unstable.

## Stepwise Regression

**Stepwise regression** alternates between forward and backward, bringing in and removing variables that meet the criteria for entry or removal, until a stable set of variables is attained. This procedure is like Forward Selection except that at each step we consider dropping variables as in Backward Elimination.


### Algorithm:
1. Start with no variables in the model or all variables, depending on the specific type of stepwise regression.
2. Perform steps from forward selection to add significant variables.
3. Perform steps from backward elimination to remove non-significant variables.
4. Repeat steps 2 and 3 until adding or removing variables does not improve the model.

- Convergence is guaranteed if the thresholds $F_{out}$ and $F_{in}$ satisfy: $F_{out}$ < $F_{in}$. It is possible, however, for a variable to enter $S$ and then leave $S$ at a subsequent step and even rejoin $S$ at a yet later step.

- These methods pick one best subset. There are straightforward variations of the methods that do identify several close to best choices for different sizes of independent variable subsets.

- Each of these methods has its own merits and can be chosen based on the specific needs of the data and the analysis being performed.


The essential problems with stepwise methods can be paraphrased as follows:
1. $R^2$ values are biased high
2. The $F$ statistics do not have the claimed distribution.
3. The standard errors of the parameter estimates are too small.
4. Consequently, the confidence intervals around the parameter estimates are too narrow.
5. p-values are too low, due to multiple comparisons, and are difficult to correct.
6. Parameter estimates are biased away from 0.
7. Collinearity problems are exacerbated.

### Subset Selection in Python

1. f_regression in sklearn. See [1](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html), [2](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)
> Scikit-learn indeed does not support stepwise regression. That's because what is commonly known as 'stepwise regression' is an algorithm based on p-values of coefficients of linear regression, and scikit-learn deliberately avoids inferential approach to model learning (significance testing etc).
2. Forward Selection by adjusted $𝑅^2$ with statsmodels. See [link](https://planspace.org/20150423-forward_selection_with_statsmodels/)
3. [stepwise-regression package](https://pypi.org/project/stepwise-regression/#description) which executes linear regression forward and backward

In [24]:
! pip install stepwise-regression> Null

### Example: Boston Housing Dataset

The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), and there are many other attributes that available in the dataset.


**The objective is to predict the value of prices of the house using the given features.**

- Number of Instances: 506

- Number of Attributes: 13 continuous attributes and 1 binary-valued attribute.

- Attribute Information:

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 Dollars
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT    \% lower status of the population
    14. MEDV     Median value of owner-occupied homes in 1000's Dollars
    


In [10]:
# from sklearn.datasets import fetch_california_housing
import pandas as pd
import statsmodels.api as sm

In [11]:
# Load Boston housing dataset from the provided URL
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
boston = pd.read_csv(url)

# Separate predictors and response
X = boston.drop('medv', axis=1)  # predictors
y = boston['medv']  # response

In [12]:
X.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33


In [20]:
# Define the stepwise selection functions

def forward_selection(X, y, significance_level=0.01):
    initial_features = []
    best_features = []
    while True:
        remaining_features = list(set(X.columns) - set(best_features))
        new_pval = pd.Series(index=remaining_features, dtype=float)
        for new_column in remaining_features:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[best_features + [new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        min_p_value = new_pval.min()
        if min_p_value < significance_level:
            best_features.append(new_pval.idxmin())
        else:
            break
    return best_features

def backward_elimination(X, y, significance_level=0.01):
    features = list(X.columns)
    while len(features) > 0:
        features_with_constant = sm.add_constant(X[features])
        p_values = sm.OLS(y, features_with_constant).fit().pvalues[1:]
        max_p_value = p_values.max()
        if max_p_value > significance_level:
            excluded_feature = p_values.idxmax()
            features.remove(excluded_feature)
        else:
            break
    return features

def stepwise_reg(X, y, significance_level_in=0.01, significance_level_out=0.01):
    features_in = []
    features_out = list(X.columns)
    while True:
        remaining_features = list(set(features_out) - set(features_in))
        pval_in = pd.Series(index=remaining_features, dtype=float)
        for new_column in remaining_features:
            model_in = sm.OLS(y, sm.add_constant(pd.DataFrame(X[features_in + [new_column]]))).fit()
            pval_in[new_column] = model_in.pvalues[new_column]
        min_p_value_in = pval_in.min()
        if min_p_value_in < significance_level_in:
            features_in.append(pval_in.idxmin())
            features_out.remove(pval_in.idxmin())
            while True:
                model_out = sm.OLS(y, sm.add_constant(pd.DataFrame(X[features_in]))).fit()
                p_values_out = model_out.pvalues.iloc[1:]
                max_p_value_out = p_values_out.max()
                if max_p_value_out > significance_level_out:
                    excluded_feature = p_values_out.idxmax()
                    features_in.remove(excluded_feature)
                else:
                    break
        else:
            break
    return features_in

In [22]:
# Apply the stepwise selection methods
forward_selected_features = forward_selection(X, y)
backward_eliminated_features = backward_elimination(X, y)
stepwise_selected_features = stepwise_reg(X, y)

# Create a DataFrame to display the features in a table
features_df = pd.DataFrame({
    'Forward Selection': pd.Series(forward_selected_features),
    'Backward Elimination': pd.Series(backward_eliminated_features),
    'Stepwise Selection': pd.Series(stepwise_selected_features)
})

# Fill NaN values with empty strings for a cleaner presentation
features_df = features_df.fillna('')
features_df


Unnamed: 0,Forward Selection,Backward Elimination,Stepwise Selection
0,lstat,crim,lstat
1,rm,zn,rm
2,ptratio,chas,ptratio
3,dis,nox,dis
4,nox,rm,nox
5,chas,dis,chas
6,b,rad,b
7,zn,tax,zn
8,,ptratio,
9,,b,
