# Automated Feature Selection

In this live lecture activity, we are going to consider the problem of how to write algorithms that automatically make reasonable choices about which features to include in machine learning models. There are many approaches to this problem, and we will look at just one. 

# How do we pick which columns?
(For project required to have one qualitative and two quantitative)

Idea 1: Try different combos of columns which you think are interesting based on exploratory data analysis

Idea 2: Try every possible combo of columns (that meets the requirements)

Idea 2 is good because it avoids the chance of overlooking something. Idea 2 is bad because on more complex data sets, there could be $N$ columns for $N>>>0$, and the number of possbile columns would be $N$ choose 3. Which is on the order of $N^3$. Which is massive!

Idea 1 has two advantages, faster, and possibly more interpretable



In today's lecture, we will do something in the middle. Our idea is as following

1) Pick one column at random and calculate a CV score.

2) Pick a new column at random, add it in and see if the CV scores improves. If so keep it (otherwise don't keep it).

3) Pick a random `active` column and remove it. See if CV score improves, if so keep it gone (otherwise leave it in).

4) Repeat 2 and 3 over and over a specified number of times 

## Grab and Prepare the Titanic Data

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

In [2]:
import urllib
def retrieve_data(url):
    """
    Retrieve a file from the specified url and save it in a local file 
    called data.csv. The intended values of url are:     
    """
    
    # grab the data and parse it
    filedata = urllib.request.urlopen(url) 
    to_write = filedata.read()
    
    # write to file
    with open("data.csv", "wb") as f:
        f.write(to_write)

retrieve_data("https://philchodrow.github.io/PIC16A/datasets/titanic.csv")
titanic = pd.read_csv("data.csv")

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanic["Sex"] = le.fit_transform(titanic['Sex'])
titanic = titanic.drop(["Name"], axis = 1)

X = titanic.drop(['Survived'], axis = 1)
y = titanic['Survived']

## Greedy Stagewise Feature Selection

Here's what we are going to do. We will start with one randomly-chosen "active" column. Then, we will do the following a user-specified number of times: 

1. Compute the CV score of a model using only the active columns, and save it. 
2. Propose either "activating" or "deactivating" a column (i.e. adding or removing it from the list of active columns). Compute the CV score. If the CV score has improved, accept the proposal (i.e. add that column to the active set, or remove it).

# Part A: Setup

In [1]:
#Lets look at the columns of X


In [6]:
#import Logistic Regression and Cross-val score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

#instantiate a LR model
LR=LogisticRegression(solver="liblinear")

In [14]:
def initialize_lists(X):

    active = [np.random.choice(X.columns)]

    inactive=list(X.columns)
    inactive.remove(active[0])

    return active,inactive
active,inactive=initialize_lists(X)
active,inactive

(['Siblings/Spouses Aboard'],
 ['Pclass', 'Sex', 'Age', 'Parents/Children Aboard', 'Fare'])

In [16]:
def move(col,active,inactive,mode="activate"):
    
    new_active=active.copy()
    new_inactive=inactive.copy()
    
    if mode=="activate":
        new_inactive.remove(col)
        new_active.append(col)
    
    if mode =="deactivate":
        new_active.remove(col)
        new_inactive.append(col)
    
    return new_active,new_inactive




### Illustrations

In [18]:
#If we run multiple times we get different results
active,inactive=initialize_lists(X)
active,inactive

(['Siblings/Spouses Aboard'],
 ['Pclass', 'Sex', 'Age', 'Parents/Children Aboard', 'Fare'])

In [19]:
#activate a column
active,inactive = move("Age",active,inactive,"activate")
active,inactive

(['Siblings/Spouses Aboard', 'Age'],
 ['Pclass', 'Sex', 'Parents/Children Aboard', 'Fare'])

In [20]:
#remove a column
active,inactive = move("Siblings/Spouses Aboard",active,inactive,"deactivate")
active,inactive

(['Age'],
 ['Pclass',
  'Sex',
  'Parents/Children Aboard',
  'Fare',
  'Siblings/Spouses Aboard'])

# Part B: Feature Selection

In [37]:
def greedy_selection(model,X,y,n_iter=20):

    active,inactive=initialize_lists(X)
    best_CV=0
    
    for i in range(n_iters):
        for mode in ["activate","deactivate"]:
            new_active=active.copy()
            new_inactive=inactive.copy()
            
            #active is a variable that is tied to a list and lists are mutable objects 
            #so if you just wrote new_active=active
            #then appending things to new_active would automatically append things to 
            # the original list active
            
            #we don't want this, because we only want to append if the activation is an improvement
            
            #if you are still a bit confused, when dealing with mutable object such as a list of dataframe
            #when in doubt make a copy if you dont want to change your original list by accident

        
            if mode=="activate" and len(inactive)>=1:
                col=np.random.choice(inactive)
                new_active,new_inactive=move(col,active,inactive,mode)
            
            elif mode=="deactivate" and len(active)>2:
                col=np.random.choice(active)
                new_active,new_inactive=move(col,active,inactive,mode)
            
            CV_score=cross_val_score(model,X[new_active],y,cv=5).mean()
        
            if (CV_score>best_CV) and (len(new_active)>=1):
                best_CV=CV_score
                active=new_active
                inactive=new_inactive
            
            print("Number of columns: " + str(len(active)) + ". CV score: " + str(best_CV))
    return active

In [36]:
active=greedy_selection(LR,X,y)
active

Number of columns: 2. CV score: 0.6178251761569226
Number of columns: 2. CV score: 0.6178251761569226
Number of columns: 3. CV score: 0.7880149812734083
Number of columns: 3. CV score: 0.7880149812734083
Number of columns: 4. CV score: 0.7970418333015934
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.798178124801625
Number of columns: 3. CV score: 0.79817812480

['Siblings/Spouses Aboard', 'Sex', 'Pclass']

# why is this not the right method for your project?

1. This method is not guaranteed to produce three columns

2. Also not guaranteed to select a "qualitative column"