# Automated Feature Selection

In this live lecture activity, we are going to consider the problem of how to write algorithms that automatically make reasonable choices about which features to include in machine learning models. There are many approaches to this problem, and we will look at just one. 

## Grab and Prepare the Titanic Data

In [160]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

In [161]:
import urllib
def retrieve_data(url):
    """
    Retrieve a file from the specified url and save it in a local file 
    called data.csv. The intended values of url are:     
    """
    
    # grab the data and parse it
    filedata = urllib.request.urlopen(url) 
    to_write = filedata.read()
    
    # write to file
    with open("data.csv", "wb") as f:
        f.write(to_write)

retrieve_data("https://philchodrow.github.io/PIC16A/datasets/titanic.csv")
titanic = pd.read_csv("data.csv")

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanic["Sex"] = le.fit_transform(titanic['Sex'])
titanic = titanic.drop(["Name"], axis = 1)

X = titanic.drop(['Survived'], axis = 1)
y = titanic['Survived']

## Greedy Stagewise Feature Selection

Here's what we are going to do. We will start with one randomly-chosen "active" column. Then, we will do the following a user-specified number of times: 

1. Compute the CV score of a model using only the active columns, and save it. 
2. Propose either "activating" or "deactivating" a column (i.e. adding or removing it from the list of active columns). Compute the CV score. If the CV score has improved, accept the proposal (i.e. add that column to the active set, or remove it).

## Part A: Setup

In [186]:
# import logistic regression and cross-validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# create a logistic regression model
LR = LogisticRegression(solver = "liblinear")

In [163]:
def initialize_lists():
    """
    Create an "active" list with a single column
    from X.columns() and an "inactive" list with 
    all remaining columns. 
    """
    active = [np.random.choice(X.columns)]
    inactive = list(X.columns)
    inactive.remove(active[0])
    return active, inactive

def move(col, active, inactive, mode = "activate"):
    """
    Activate or deactivate a single column
    by moving it between the active and inactive
    lists. 
    Does not modify active or inactive -- instead 
    returns copies. 
    """
    # create copies
    new_active = active.copy()
    new_inactive = inactive.copy()
    
    # if we are activating a column
    if mode == "activate":
        new_active.append(col)
        new_inactive.remove(col)
    # if we are deactivating a column
    else:
        new_active.remove(col)
        new_inactive.append(col)
    
    # return copies
    return new_active, new_inactive    

### Illustrations

In [187]:
active, inactive = initialize_lists()
print(active, inactive)

['Sex'] ['Pclass', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']


In [188]:
move("Age", active, inactive, mode = "activate")

(['Sex', 'Age'],
 ['Pclass', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'])

## Part B: Feature Selection

In [177]:
def greedy_stagewise_feature_selection(model, X, y, n_iters = 20):
    
    # initialize with a single, randomly selected column
    active, inactive = initialize_lists()
    
    # initialize the best CV score
    best_CV = 0
    
    # main loop, n_iters times
    
    for i in range(n_iters):
        # alternate between activating and deactivating
        for mode in ["activate", "deactivate"]:
            # if mode is "activate" and there are any remaining inactive
            # columns, randomly select one. Otherwise, continue
            if mode == "activate":
                if len(inactive) > 0:
                    col = np.random.choice(inactive)
                else:
                    continue
            # if mode is "deactivate", then pick a random 
            # active column
            else:
                col = np.random.choice(active)
            
            # create a new, proposed active set by moving
            # col between sets
            new_active, new_inactive = move(col, active, inactive, mode)
            
            # compute the CV score
            CV_score = cross_val_score(LR, X[new_active], y, cv = 5).mean()
            
            # if the CV score is an improvement, update the 
            # active and inactive column sets. 
            if (CV_score > best_CV) and (len(new_active) >= 1):
                best_CV = CV_score
                active = new_active
                inactive = new_inactive
        # 
        print("Number of columns: " + str(len(active)) + ". CV score: " + str(best_CV))
    return active    

In [185]:
cols = greedy_stagewise_feature_selection(LR, X, y, n_iters = 20)

Number of columns: 2. CV score: 0.6629340443090206
Number of columns: 2. CV score: 0.6753380308512665
Number of columns: 3. CV score: 0.6754015108233352
Number of columns: 4. CV score: 0.7925411032819145
Number of columns: 4. CV score: 0.7925411032819145
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.79817812480