# Wrapper Approach - Hill Climbing
In this notebook we implement a rather simple feature selection procedure that follows a wrapper approach. The search algorithm, hill climbing in this case, is wrapped around the target classification/regression algorithm.

First we import the libraries that we will need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

from sklearn import datasets
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler

from deap import algorithms
from deap import base
from deap import creator
from deap import tools

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

Next we load the data and generate the k-fold evaluations.

In [2]:
data = datasets.load_boston()

scaler = StandardScaler()
X = scaler.fit_transform(data["data"])
y = data["target"]


number_of_variables = X.shape[1]
input_variables = data.feature_names
target_variable = 'MEDV'

seed = 1234
np.random.seed(seed)

# let's create also a pandas data frame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['MEDV'] = y
df.head()

kfolds = KFold(10,shuffle=True,random_state=seed)

When applying a wrapper approach we are searching for the best subset of feature for a target model. In this case we will search for the best subset of features for plain linear regression.

In [3]:
def EvaluateFeatureSubsetSingleObjective(individual):
    selected_columns = []
    for i,allele in enumerate(individual):
        if (allele==1):
            selected_columns.append(df.columns[i])

    model = linear_model.LinearRegression()
    scores = cross_val_score(model, df[selected_columns], y, cv=kfolds)
    return scores.mean()

## Hill Climbing

In [6]:
def HillClimbing(number_of_variables,number_of_evaluations,evaluation_function):

    # current evaluation
    evaluations = 0
    
    # start from a random set of features
    current_feature_subset = [random.randint(0,1) for x in range(number_of_variables)]

    # that will also provide an initial evaluation of the best performance
    best_performance = evaluation_function(current_feature_subset)
    
    print("%5d\t\t%3.2f\t%s"%(evaluations,best_performance,str(current_feature_subset)))
    
    # continue until all the evaluations have been performed
    while evaluations<number_of_evaluations:
        
        # generate a neighbor candidate using a 10% perturbation of the current subset
        perturbation = [(lambda x: 1-x if (random.random()<0.1) else x)(x) for x in current_feature_subset]

        # evaluate only if there is at least one variable
        if (sum(perturbation)>0):
            performance = evaluation_function(perturbation)

            if (performance>best_performance):
                best_performance = performance
                current_feature_subset = perturbation

        evaluations = evaluations + 1
        print("%5d\t\t%3.2f\t%s"%(evaluations,best_performance,str(current_feature_subset)))

    print("Best Feature Subset = %s "%(str(current_feature_subset)))
    print("Performance = %3.2f"%(best_performance))

Let's run hill-climbing for 100 evaluations. 

In [7]:
HillClimbing(number_of_variables,100,EvaluateFeatureSubsetSingleObjective)

    0		0.26	[0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0]
    1		0.26	[0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0]
    2		0.26	[0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0]
    3		0.26	[0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0]
    4		0.26	[0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0]
    5		0.27	[0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0]
    6		0.52	[1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0]
    7		0.53	[1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0]
    8		0.60	[1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1]
    9		0.67	[1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1]
   10		0.67	[1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1]
   11		0.67	[1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1]
   12		0.67	[1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
   13		0.67	[1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
   14		0.67	[1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
   15		0.67	[1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
   16		0.67	[1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
   17		0.67	[1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
   18		0.67	[1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
   19		0.68	

We can repeat the same process targeting another model like for instance a k-nearest-neighbour regressor with a k of 5.

In [9]:
def EvaluateFeatureSubsetKNN(individual):
    selected_columns = []
    for i,allele in enumerate(individual):
        if (allele==1):
            selected_columns.append(df.columns[i])

    model = KNeighborsRegressor(5)
    scores = cross_val_score(model, df[selected_columns], y, cv=kfolds)
    return scores.mean()

In [11]:
HillClimbing(number_of_variables,100,EvaluateFeatureSubsetKNN)

    0		0.31	[0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0]
    1		0.32	[0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0]
    2		0.32	[0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0]
    3		0.32	[0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0]
    4		0.54	[0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
    5		0.54	[0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
    6		0.54	[0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
    7		0.56	[0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
    8		0.56	[0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
    9		0.56	[0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   10		0.57	[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   11		0.57	[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   12		0.57	[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   13		0.57	[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   14		0.57	[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   15		0.57	[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   16		0.59	[1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   17		0.59	[1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   18		0.59	[1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]
   19		0.59	

## Discussion
Note that, with k-NN, we were able to reach a better performance with much fewer features. Might we draw some insight from this result? Also note that when doing feature selection we used the entire dataset but feature selection is, as a matter of fact, similar to the search of the hyper-parameter alpha for Lasso/Ridge regression so we shou