## Subset Selection - Part II

Purpose of this tutorial is to explain the follwing:

1. Forward-Stepwise Regression
2. Backward-Stepwise Regression
3. Forward/Backward Stepwise Regression using "MLEXTEND"



In [4]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import itertools

In [5]:
df = pd.read_csv("longley.csv",sep=",")
df.head()

Unnamed: 0.1,Unnamed: 0,GNP.deflator,GNP,Unemployed,Armed.Forces,Population,Year,Employed
0,1947,83.0,234.289,235.6,159.0,107.608,1947,60.323
1,1948,88.5,259.426,232.5,145.6,108.632,1948,61.122
2,1949,88.2,258.054,368.2,161.6,109.773,1949,60.171
3,1950,89.5,284.599,335.1,165.0,110.929,1950,61.187
4,1951,96.2,328.975,209.9,309.9,112.075,1951,63.221


In [6]:
y = df["Employed"]
X = df.drop("Employed",axis=1)
X.head()

Unnamed: 0.1,Unnamed: 0,GNP.deflator,GNP,Unemployed,Armed.Forces,Population,Year
0,1947,83.0,234.289,235.6,159.0,107.608,1947
1,1948,88.5,259.426,232.5,145.6,108.632,1948
2,1949,88.2,258.054,368.2,161.6,109.773,1949
3,1950,89.5,284.599,335.1,165.0,110.929,1950
4,1951,96.2,328.975,209.9,309.9,112.075,1951


In [7]:
X = X.drop(["Year","Unnamed: 0"],axis=1)
X.head()

Unnamed: 0,GNP.deflator,GNP,Unemployed,Armed.Forces,Population
0,83.0,234.289,235.6,159.0,107.608
1,88.5,259.426,232.5,145.6,108.632
2,88.2,258.054,368.2,161.6,109.773
3,89.5,284.599,335.1,165.0,110.929
4,96.2,328.975,209.9,309.9,112.075


#### Forward-Stepwise Selection 

In [8]:
def cal_RSS(Column_name):
    #x = sm.add_constant(X[list(Column_name)])
    model = sm.OLS(y,X[list(Column_name)])
    reg_name = model.fit()
    RSS = ((reg_name.predict(X[list(Column_name)])-y)**2).sum()
    return {"Model_Name":reg_name,"RSS":RSS}

In [9]:
def forward_stepwise(df,target,sign_level=0.05):
    columns = df.columns.tolist()
    selected_features=[]
    while(len(columns)>0):
        remaining_features = list(set(columns)-set(selected_features))
        pval = pd.Series(index=remaining_features,dtype="int")
        for cols in remaining_features:
            model = sm.OLS(target,sm.add_constant(df[selected_features+[cols]])).fit()
            pval[cols]=model.pvalues[cols]
        min_pval = pval.min()
        if min_pval < sign_level:
            selected_features.append(pval.idxmin())
        else:
            break
    return selected_features


In [10]:
features = forward_stepwise(X,y)

print(f"Forward-Stepwise {features}")


Forward-Stepwise ['GNP', 'Armed.Forces', 'Unemployed', 'GNP.deflator', 'Population']


#### Backward-Stepwise Selection 

In [12]:
def backward_stepwise(df,target,sign_level=0.5):
    columns = df.columns.tolist()
    #selcted_features=[]
    while(len(columns)>0):
          p_val = sm.OLS(target,sm.add_constant(df[columns])).fit().pvalues[1:]
          max_p_val = p_val.max()
          if (max_p_val > sign_level):
              removed_feature = p_val.idxmax()
              print(f"Removing {removed_feature} with p-val {np.round(max_p_val,2)}")
              columns.remove(removed_feature)
          else:
              break
    return columns

In [13]:
features = backward_stepwise(X,y)

print(f"Backward-Stepwise {features}")

Removing GNP.deflator with p-val 0.72
Backward-Stepwise ['GNP', 'Unemployed', 'Armed.Forces', 'Population']


#### MLExtend 

In [15]:
from mlxtend.feature_selection import SequentialFeatureSelector as seqf
from sklearn.linear_model import LinearRegression
seqf = seqf(LinearRegression(),
            k_features=5,
            forward=False,
            floating=False,
            cv=0)

seqf.fit(X,y)
print(f"Backward-Stepwise using MLExtend {seqf.k_feature_names_}")

Backward-Stepwise using MLExtend ('GNP.deflator', 'GNP', 'Unemployed', 'Armed.Forces', 'Population')


#### Conclusion: 

It shows clearly that result obtained from "MLExtend" is entirely different from the result obtained from Backward-Stepwise using p-value. Therefore use of p-value for backward-stepwise selection is not justified.

#### References 

1. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction to statistical learning : with applications in R. New York :Springer