# Automated Feature Selection

In this live lecture activity, we are going to consider the problem of how to write algorithms that automatically make reasonable choices about which features to include in machine learning models. There are many approaches to this problem, and we will look at just one. 

## Grab and Prepare the Titanic Data

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

In [2]:
import urllib
def retrieve_data(url):
    """
    Retrieve a file from the specified url and save it in a local file 
    called data.csv. The intended values of url are:     
    """
    
    # grab the data and parse it
    filedata = urllib.request.urlopen(url) 
    to_write = filedata.read()
    
    # write to file
    with open("data.csv", "wb") as f:
        f.write(to_write)

retrieve_data("https://philchodrow.github.io/PIC16A/datasets/titanic.csv")
titanic = pd.read_csv("data.csv")

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanic["Sex"] = le.fit_transform(titanic['Sex'])
titanic = titanic.drop(["Name"], axis = 1)

X = titanic.drop(['Survived'], axis = 1)
y = titanic['Survived']

In [3]:
y

0      0
1      1
2      1
3      1
4      0
      ..
882    0
883    1
884    0
885    1
886    0
Name: Survived, Length: 887, dtype: int64

## Greedy Stagewise Feature Selection

Here's what we are going to do. We will start with one randomly-chosen "active" column. Then, we will do the following a user-specified number of times: 

1. Compute the CV score of a model using only the active columns, and save it. 
2. Propose either "activating" or "deactivating" a column (i.e. adding or removing it from the list of active columns). Compute the CV score. If the CV score has improved, accept the proposal (i.e. add that column to the active set, or remove it).

# Part A: Setup

In [4]:
#Lets look at the columns of X
X.columns

Index(['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare'],
      dtype='object')

In [6]:
#import Logistic Regression and Cross-val score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
#instantiate a LR model

LR=LogisticRegression(solver='liblinear')
#liblinear helps if the number of columns is very small

In [8]:
def initialize_lists(X):
    """
    Create a list of "active" and "inactive" columns
    from X.columns
    """
    
    #single active column
    active=[np.random.choice(X.columns)]
    
    #all of the other columns are inactive
    inactive=list(X.columns)
    inactive.remove(active[0])
    
    return active, inactive
    
   
    

In [16]:
def move(col,active,inactive,mode="activate"):
    """
    Activate  or deactivate a single column by moving it from active to inactive 
    or from inactive to active depending on the the parameter mode
    
    Does NOT modify active or inactive -- instead it makes copies
    """
    
    #create copies copies
    new_active=active.copy()
    new_inactive=inactive.copy()
    
    #add entry if mode == activate
    if mode =="activate":
        new_inactive.remove(col)
        new_active.append(col)
        
    #remove entry if mode == deactivate
    if mode =="deactivate":
        new_active.remove(col)
        new_inactive.append(col)
    
    return new_active,new_inactive

### Illustrations

In [17]:
#If we run multiple times we get different results
active,inactive = initialize_lists(X)
active,inactive

(['Fare'],
 ['Pclass',
  'Sex',
  'Age',
  'Siblings/Spouses Aboard',
  'Parents/Children Aboard'])

In [18]:
#activate a column
new_active,new_inactive=move(inactive[0],active,inactive,mode="activate")
new_active,new_inactive

(['Fare', 'Pclass'],
 ['Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])

In [19]:
#remove a column
new_active,new_inactive=move(new_active[0],new_active,new_inactive,mode="deactivate")
new_active,new_inactive

(['Pclass'],
 ['Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare'])

# Part B: Feature Selection

In [30]:
def greedy_selection(model,X,y,n_iters=10):
    """
    Greedy Feature Selection Algorithm
    iteratively adds or subtracts columns to set of predictor values
    onlys keeps changes if they improve perfromance as based on cross validation
    
    parameter model - the ML model we we are using
    parameter X - predictor variables
    parameter y - targer variable
    parameter n_iters - numbers of iterations, defaults to 10
    
    """
    #initialize lists and best_CV
    active,inactive=initialize_lists(X)
    best_CV =0
    
    #iterativelly add/subtract cols
    for i in range(n_iters):
        
        #alternate between activvate and deactive
        for mode in ["activate", "deactivate"]:
            
            #copy data
            new_active=active.copy()
            new_inactive=inactive.copy()
            
            
            #only activate / deactive if appropriate number of cols
            #avoid emptys lists in active cols
            if mode=="activate" and len(inactive)>=1:
                col =np.random.choice(inactive)  
            elif mode=="deactivate" and len(active)>=2:
                col=np.random.choice(active)
                
            #update data    
            new_active,new_inactive=move(col,active,inactive,mode)
            
            #determine effectivenewss
            CV_score=cross_val_score(model,X[new_active],y,cv=7).mean()
            
            #update settings, if new cols are better
            if (CV_score>best_CV) and len(new_active)>=1:
                best_CV=CV_score
                active=new_active
                inactive=new_inactive
                
            #display number of cols at each iteration    
            print("Number of cols: "+str(len(active))+". CV score: "+ str(best_CV))
            
    return active
                
            

In [31]:
cols=greedy_selection(LR,X,y,n_iters=10)

Number of cols: 2. CV score: 0.7880354241434107
Number of cols: 2. CV score: 0.7880354241434107
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425
Number of cols: 3. CV score: 0.8004178049172425


In [29]:
cols

['Age', 'Pclass', 'Sex', 'Siblings/Spouses Aboard']

In [None]:
#Could try different models

#Could have a list of models, and then loop over that list and select the best model via cross val

#whats better than ONE good model???
# three good models
#composite model
#train a decision tree, train logistic regression, train a support vector
#then let the model vote

#could trainsform your original training - this is done by more advanced neural networks
#
#X1_new = f(X0,X1)
