# Wrapper Methods

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance. [1] These procedures are normally built after the concept of Greedy Search technique (or algorithm). [2] A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage.[3]

Generally, three directions of procedures are possible:

Forward selection — starts with one predictor and adds more iteratively. At each subsequent iteration, the best of the remaining original predictors are added based on performance criteria.

Backward elimination — starts with all predictors and eliminates one-by-one iteratively. One of the most popular algorithms is Recursive Feature Elimination (RFE) which eliminates less important predictors based on feature importance ranking.

Step-wise selection — bi-directional, based on a combination of forward selection and backward elimination. It is considered less greedy than the previous two procedures since it does reconsider adding predictors back into the model that has been removed (and vice versa). Nonetheless, the considerations are still made based on local optimisation at any given iteration.

### Forward Selection Using SFS() from mlxtend

In [3]:

#Load needed libraries
from sklearn.datasets import load_boston
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

# load_boston() sklearn dataset to boston
boston = load_boston()

# use np.c_ to concatenate into a dataframe
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = pd.Series(boston.target)

#Split the features and target data
#select the first 13 columns as features
X = df.iloc[:,:13]
#Select the last column for target 
y = df.iloc[:,-1]

#Define Sequential Forward Selection (sfs)
sfs = SFS(LinearRegression(),
k_features=5,
forward=True,
floating=False,
scoring = 'r2',
cv = 0)
#Use SFS to select the top 5 features 
sfs.fit(X, y)

#Create a dataframe for the SFS results 
df_SFS_results = pd.DataFrame(sfs.subsets_).transpose()
df_SFS_results

Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names
1,"(12,)",[0.5441462975864797],0.544146,"(LSTAT,)"
2,"(5, 12)",[0.6385616062603403],0.638562,"(RM, LSTAT)"
3,"(5, 10, 12)",[0.678624160161311],0.678624,"(RM, PTRATIO, LSTAT)"
4,"(5, 7, 10, 12)",[0.6903077016842538],0.690308,"(RM, DIS, PTRATIO, LSTAT)"
5,"(4, 5, 7, 10, 12)",[0.7080892893529662],0.708089,"(NOX, RM, DIS, PTRATIO, LSTAT)"


### Backward Elimination - RFE() from Sklearn

In [6]:
#Load needed libraries
from sklearn.datasets import load_boston
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

# load_boston() sklearn dataset to boston
boston = load_boston()

#Split the features and target data
#select the first 13 columns as features
X = boston.data
#Select the last column for target 
Y = boston.target

#Build a logistic regression model 
model = LinearRegression()
#Define RFE 
rfe = RFE(estimator = model, n_features_to_select=5, step=1)
#Use RFE to select the top 5 features 
fit = rfe.fit(X, Y)

#Create a dataframe for the results 
df_RFE_results = []
for i in range(X.shape[1]):
    df_RFE_results.append(
        {      
            'Feature_names': boston.feature_names[i],
            'Selected':  rfe.support_[i],
            'RFE_ranking':  rfe.ranking_[i],
        }
    )

df_RFE_results = pd.DataFrame(df_RFE_results)
df_RFE_results.index.name='Columns'
df_RFE_results


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

Unnamed: 0_level_0,Feature_names,Selected,RFE_ranking
Columns,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,CRIM,False,4
1,ZN,False,6
2,INDUS,False,5
3,CHAS,True,1
4,NOX,True,1
5,RM,True,1
6,AGE,False,9
7,DIS,True,1
8,RAD,False,3
9,TAX,False,7


### Step-wise Selection -- SFFS() from mlxtend

In [4]:
#Load needed libraries
from sklearn.datasets import load_boston
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

# load_boston() sklearn dataset to boston
boston = load_boston()

# use np.c_ to concatenate into a dataframe
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = pd.Series(boston.target)

#Split the features and target data
#select the first 13 columns as features
X = df.iloc[:,:13].values
#Select the last column for target 
y = df.iloc[:,-1].values

#Define Sequential Forward Selection (sfs)
sffs = SFS(LinearRegression(),
k_features=5,
forward=True,
floating=True,
scoring = 'r2',
cv = 0)
#Use SFS to select the top 5 features 
feature_names=boston.feature_names
sffs.fit(X, y, custom_feature_names=feature_names)

#Create a dataframe for the SFS results 
df_SFFS_results = pd.DataFrame(sffs.subsets_).transpose()
df_SFFS_results


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

Unnamed: 0,feature_idx,cv_scores,avg_score,feature_names
1,"(12,)",[0.5441462975864797],0.544146,"(LSTAT,)"
2,"(5, 12)",[0.6385616062603403],0.638562,"(RM, LSTAT)"
3,"(5, 10, 12)",[0.678624160161311],0.678624,"(RM, PTRATIO, LSTAT)"
4,"(5, 7, 10, 12)",[0.6903077016842538],0.690308,"(RM, DIS, PTRATIO, LSTAT)"
5,"(4, 5, 7, 10, 12)",[0.7080892893529662],0.708089,"(NOX, RM, DIS, PTRATIO, LSTAT)"
