# Feature Selection

In the last few lectures, we learned how to use hold-out "test" sets and cross-validation to gain appropriate estimates of a model's performance on unseen data. There, the focus was on choosing a good "complexity" parameter, such as the depth of a decision tree. In this lecture, we'll instead show how to use cross-validation to get an estimate of which columns in the data should or should not be included in a model. It's very common in practice that not all columns will be used in the best model, and many, many machine learning reseachers devote their careers to studying the problem of how to intelligently and automatically choose only the most relevant columns for models. In the literature, this problem is usually called *feature selection*. In this lecture, we'll take a quick look at how feature selection can improve model performance. 

For this demonstration, we'll switch from decision trees to logistic regression. Logistic regression is a form of regression modeling well-suited for predicting probabilities and class labels. 

Let's begin by running some familiar blocks of code, in which we load our core libraries, read in the data, split the data, and clean the data. 

In [19]:
#standard imports
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

#read in data
titanic=pd.read_csv("titanic.csv")


We are going to divide the data into a training and testing sets and prepocess it by changing male/female to 1/0 and dropping the name column. This code is nearly exactly the same as earlier videos.

In [4]:
from sklearn.model_selection import train_test_split

np.random.seed(1111)
train, test = train_test_split(titanic, test_size = 0.2)

In [5]:
from sklearn import preprocessing
def prep_titanic_data(data_df):
    df = data_df.copy()
    le = preprocessing.LabelEncoder()
    df['Sex'] = le.fit_transform(df['Sex'])
    df = df.drop(['Name'], axis = 1)
    
    X = df.drop(['Survived'], axis = 1)
    y = df['Survived']
        
    return(X, y)

X_train, y_train = prep_titanic_data(train)
X_test,  y_test  = prep_titanic_data(test)

Using logistic regression is almost exactly the same as using the decision tree classifier. Let's go ahead and use cross-validation to estimate the predictive performance of the model. 

Is this the best we can do? If you've studied logistic regression before, you may know that using lots of columns doesn't always help -- due to *multicollinearity*, the model's predictive performance can actually suffer. This is actually another aspect of *overfitting*. Adding more columns makes the model more flexible, and we've seen that that is not always beneficial. So, a natural question is whether we can achieve the same (or better?) model performance by using only a subset of the columns. 


It's easy to train a model on a subset of the data. For example: 

In [1]:
#all the columns except fare
cols = ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']


Interesting! Excluding the last column (Fare) actually improved our CV score slightly. 

## Systematic Feature Selection

Now, let's write a function that will let us do this systematically. Our function will use cross-validation to avoid "peeking" at the test set. 

In [2]:
combos = [['Sex', 'Age', 'Fare'],
          ['Pclass', 'Sex', 'Age'],
          ['Pclass', 'Parents/Children Aboard'],
          ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'],
          ['Pclass', 'Sex', 'Age', 'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Fare']]


-  Which combo scored the best?
-  Which combo scored the worst?

Now, lets check on the test set

Indeed, we achieved a higher prediction score on the test set by ignoring the "Fare" column completely. 

There are a number of sophisticated algorithms for automated feature selection, but we won't go further into this topic in this course. 