# Automated Feature Selection

In this live lecture activity, we are going to consider the problem of how to write algorithms that automatically make reasonable choices about which features to include in machine learning models. There are many approaches to this problem, and we will look at just one. 

## Grab and Prepare the Titanic Data

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

In [2]:
import urllib
def retrieve_data(url):
    """
    Retrieve a file from the specified url and save it in a local file 
    called data.csv. The intended values of url are:     
    """
    
    # grab the data and parse it
    filedata = urllib.request.urlopen(url) 
    to_write = filedata.read()
    
    # write to file
    with open("data.csv", "wb") as f:
        f.write(to_write)

retrieve_data("https://philchodrow.github.io/PIC16A/datasets/titanic.csv")
titanic = pd.read_csv("data.csv")

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanic["Sex"] = le.fit_transform(titanic['Sex'])
titanic = titanic.drop(["Name"], axis = 1)

X = titanic.drop(['Survived'], axis = 1)
y = titanic['Survived']

In [4]:
y

0      0
1      1
2      1
3      1
4      0
      ..
882    0
883    1
884    0
885    1
886    0
Name: Survived, Length: 887, dtype: int64

## Greedy Stagewise Feature Selection

Here's what we are going to do. We will start with one randomly-chosen "active" column. Then, we will do the following a user-specified number of times: 

1. Compute the CV score of a model using only the active columns, and save it. 
2. Propose either "activating" or "deactivating" a column (i.e. adding or removing it from the list of active columns). Compute the CV score. If the CV score has improved, accept the proposal (i.e. add that column to the active set, or remove it).

# Part A: Setup

In [7]:
#Lets look at the columns of X
X.columns

Index(['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')

In [8]:
#import Logistic Regression and Cross-val score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

#instantiate a LR model
LR= LogisticRegression(solver="liblinear")
#liblinear helps when the number of columns is small






In [9]:
def initialize_lists(X):
    """
    Create a list of active and inactive columsn
    from X.columns
    """
    
    #start with single active column
    active = [np.random.choice(X.columns)]
    
    #all other columns will be inactive
    inactive=list(X.columns)
    inactive.remove(active[0])
    
    return active, inactive
    

In [10]:
def move(col, active, inactive, mode="activate"):
    """
    Activate or deactive a column
    by moving between active and inactive lists
    depending on mode
    """
    
    new_active=active.copy()
    new_inactive=inactive.copy()
    
    #thigngs we did not do
    #what if one of the lists is empty
    #we did not make a copy (initially)
    if mode == "activate":
        new_inactive.remove(col)
        new_active.append(col)
        
    if mode == "deactivate":
        new_active.remove(col)
        new_inactive.append(col)
        
    return new_active, new_inactive
        

### Illustrations

In [13]:
#If we run multiple times we get different results
active,inactive=initialize_lists(X)
active,inactive

(['Siblings/Spouses Aboard'],
 ['Pclass', 'Sex', 'Age', 'Parents/Children Aboard', 'Fare'])

In [14]:
#activate a column
new_active,new_inactive=move(inactive[0],active,inactive,mode="activate")
new_active,new_inactive

(['Siblings/Spouses Aboard', 'Pclass'],
 ['Sex', 'Age', 'Parents/Children Aboard', 'Fare'])

In [15]:
#remove a column
new_active,new_inactive=move(active[0],new_active,new_inactive,mode="deactivate")
new_active,new_inactive

(['Pclass'],
 ['Sex', 'Age', 'Parents/Children Aboard', 'Fare', 'Siblings/Spouses Aboard'])

# Part B: Feature Selection

In [16]:
def greedy_selection(model,X,y,n_iters=20):
    
    active,inactive=initialize_lists(X)
    print(active,inactive)
    best_CV=0
    
    for i in range(n_iters):
        
        for mode in ['activate','deactivate']:
            
            new_active=active.copy()
            new_inactive=inactive.copy()
            
            if mode == "activate" and len(inactive)>=1:
                col=np.random.choice(inactive)
                new_active,new_inactive=move(col,active,inactive,mode)
                
            elif mode == "deactivate" and len(active)>=2:
                col=np.random.choice(active)
                new_active,new_inactive=move(col,active,inactive,mode)
                
            CV_score=cross_val_score(model,X[new_active],y,cv=7).mean()
            
            if (CV_score>best_CV) and (len(new_active)>=1):
                best_CV=CV_score
                active=new_active
                inactive=new_inactive
                
            print("Number of columns: " + str(len(active)) + ". CV score: " + str(best_CV))
            
            
    return active







            
            
            
            
            
            
            

In [19]:
cols=greedy_selection(LR,X,y,n_iters=20)

['Fare'] ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
Number of columns: 2. CV score: 0.6629171353580803
Number of columns: 1. CV score: 0.662926062813577
Number of columns: 1. CV score: 0.662926062813577
Number of columns: 1. CV score: 0.662926062813577
Number of columns: 2. CV score: 0.6753620083203886
Number of columns: 2. CV score: 0.6753620083203886
Number of columns: 3. CV score: 0.7880532790544039
Number of columns: 3. CV score: 0.7880532790544039
Number of columns: 3. CV score: 0.7880532790544039
Number of columns: 3. CV score: 0.7880532790544039
Number of columns: 3. CV score: 0.7880532790544039
Number of columns: 3. CV score: 0.7880532790544039
Number of columns: 4. CV score: 0.7891870659024764
Number of columns: 4. CV score: 0.7891870659024764
Number of columns: 4. CV score: 0.7891870659024764
Number of columns: 4. CV score: 0.7891870659024764
Number of columns: 5. CV score: 0.7948202903208527
Number of columns: 5. CV score: 0.79482029032085

In [20]:
cols

['Siblings/Spouses Aboard', 'Sex', 'Pclass']