# Automated Feature Selection

In this live lecture activity, we are going to consider the problem of how to write algorithms that automatically make reasonable choices about which features to include in machine learning models. There are many approaches to this problem, and we will look at just one. 

## Grab and Prepare the Titanic Data

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

In [2]:
import urllib
def retrieve_data(url):
    """
    Retrieve a file from the specified url and save it in a local file 
    called data.csv. The intended values of url are:     
    """
    
    # grab the data and parse it
    filedata = urllib.request.urlopen(url) 
    to_write = filedata.read()
    
    # write to file
    with open("data.csv", "wb") as f:
        f.write(to_write)

retrieve_data("https://philchodrow.github.io/PIC16A/datasets/titanic.csv")
titanic = pd.read_csv("data.csv")

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanic["Sex"] = le.fit_transform(titanic['Sex'])
titanic = titanic.drop(["Name"], axis = 1)

X = titanic.drop(['Survived'], axis = 1)
y = titanic['Survived']

In [4]:
y

0      0
1      1
2      1
3      1
4      0
      ..
882    0
883    1
884    0
885    1
886    0
Name: Survived, Length: 887, dtype: int64

## Greedy Stagewise Feature Selection

Here's what we are going to do. We will start with one randomly-chosen "active" column. Then, we will do the following a user-specified number of times: 

1. Compute the CV score of a model using only the active columns, and save it. 
2. Propose either "activating" or "deactivating" a column (i.e. adding or removing it from the list of active columns). Compute the CV score. If the CV score has improved, accept the proposal (i.e. add that column to the active set, or remove it).

## Part A: Setup

In [5]:
# import logistic regression and cross-validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# create a logistic regression model
LR = LogisticRegression(solver = "liblinear")

In [10]:
list(X.columns).remove('Siblings/Spouses Aboard')

In [15]:
def initialize_lists():
    """
    Create an "active" list with a single random column
    from X.columns and an "inactive" list with 
    all remaining columns. 
    """
    # grab a single random column
    active = [np.random.choice(X.columns)]
    
    # make a list of all the other columns
    inactive = list(X.columns)
    inactive.remove(active[0])
    return active, inactive
    
def move(col, active, inactive, mode = "activate"):
    """
    Activate or deactivate a single column
    by moving it between the active and inactive
    lists. 
    Does not modify active or inactive -- instead 
    returns copies. 
    """
    # create copies
    new_active = active.copy()
    new_inactive = inactive.copy()
    
    if mode == "activate":
        # if we are activating a column
        new_inactive.remove(col)
        # add col to the active list
        new_active.append(col)
    
    # if we are deactivating a column
    
    if mode == "deactivate":
        new_active.remove(col)
        new_inactive.append(col)
    
    # return copies
    return new_active, new_inactive

### Illustrations

In [17]:
active, inactive = initialize_lists()
active, inactive

(['Pclass'],
 ['Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'])

In [18]:
move("Age", active, inactive, mode = "activate")

(['Pclass', 'Age'],
 ['Sex', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'])

## Part B: Feature Selection

In [22]:
def greedy_stagewise_feature_selection(model, X, y, n_iters = 20):
    
    # initialize with a single, randomly selected column
    active, inactive = initialize_lists()
    
    # initialize the best CV score
    best_CV = 0
    
    # main loop, n_iters times
    for i in range(n_iters):
        # alternate between activating and deactivating
        for mode in ["activate", "deactivate"]:
        
            # if mode is "activate" and there are any remaining inactive
            # columns, randomly select one. Otherwise, continue
            if (mode == "activate"):
                if len(inactive) > 0:
                    col = np.random.choice(inactive)
                else: 
                    continue
            
            # if mode is "deactivate" and if there at least 2 active
            # columns then pick a random active column
            if (mode == "deactivate") and (len(active) >= 2):
                col = np.random.choice(active)
            
            # create a new, proposed active list by moving
            # col between lists
            
            new_active, new_inactive = move(col, active, inactive, mode)
            
            # compute the CV score
            CV_score = cross_val_score(LR, X[new_active], y, cv = 7).mean()
            
            # if the CV score is an improvement, update the 
            # active and inactive column sets. 
            
            if (CV_score > best_CV) and (len(new_active) >=1):
                best_CV = CV_score
                active = new_active
                inactive = new_inactive
            
            print("Number of columns: " + str(len(active)) + ". CV score: " + str(best_CV))
    return active

In [27]:
cols = greedy_stagewise_feature_selection(LR, X, y, n_iters = 10)

Number of columns: 2. CV score: 0.7857678504472655
Number of columns: 2. CV score: 0.7857678504472655
Number of columns: 2. CV score: 0.7857678504472655
Number of columns: 2. CV score: 0.7857678504472655
Number of columns: 2. CV score: 0.7857678504472655
Number of columns: 2. CV score: 0.7857678504472655
Number of columns: 3. CV score: 0.7891602835359866
Number of columns: 3. CV score: 0.7891602835359866
Number of columns: 4. CV score: 0.7937043583837734
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score:

In [37]:
from sklearn.tree import DecisionTreeClassifier
T = DecisionTreeClassifier(max_depth = 10)

In [38]:
cols = greedy_stagewise_feature_selection(T, X, y, n_iters = 10)

Number of columns: 2. CV score: 0.614432124555859
Number of columns: 2. CV score: 0.614432124555859
Number of columns: 3. CV score: 0.6899851804238757
Number of columns: 3. CV score: 0.6899851804238757
Number of columns: 4. CV score: 0.691101112360955
Number of columns: 4. CV score: 0.691101112360955
Number of columns: 5. CV score: 0.7903387076615422
Number of columns: 5. CV score: 0.7903387076615422
Number of columns: 5. CV score: 0.7903387076615422
Number of columns: 4. CV score: 0.7925616440802042
Number of columns: 4. CV score: 0.7925616440802042
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8004178049172425
Number of columns: 3. CV score: 0.8

In [39]:
cols

['Siblings/Spouses Aboard', 'Pclass', 'Sex']