# Feature Selection Lab

In this lab we will explore feature selection on the [Titanic Dataset](https://www.kaggle.com/c/titanic/data).

We encourage you to conduct EDA across the data before building a logistic regression to determine whether or not a given individual survived. 

You'll then experiment with various feature selection techniques to improve your performance. You'll need the sklearn documentation: http://scikit-learn.org/stable/modules/feature_selection.html

In [282]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt


## 1. Import the data and EDA

We'll be working with the titanic datasets - go ahead and import it from the "assets" folder. While you're at it, take some time to do EDA and see what the data looks like! 

In [283]:
train = pd.read_csv('train.csv')

In [284]:
train.shape

(891, 12)

In [285]:
train.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,,891,2,,,,681,,147,3
top,,,,"Graham, Mr. George Edward",male,,,,CA. 2343,,C23 C25 C27,S
freq,,,,1,577,,,,7,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,,1.0,0.0,,31.0,,


# cabin is not in my selection, I remove this column and then clean my new dataset

In [286]:
import copy
dataset_org = copy.deepcopy(train)

In [287]:
train.drop('Cabin', axis=1, inplace=True)

In [288]:
train.head(2)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C


In [290]:
train = train.dropna()

train.shape


(712, 11)

In [291]:

new_data  = train.loc[train['Age'] > 0]

new_data.shape

(712, 11)

In [292]:
new_data  = new_data.loc[new_data['Fare'] > 0]

new_data.shape


(705, 11)

In [293]:
mask = new_data['Embarked'].isin(['S', 'C','Q'])
new_data2 = new_data[mask]

new_data2.shape

(705, 11)

In [294]:
mask = new_data['Sex'].isin(['male', 'female'])
new_data2 = new_data[mask]

new_data2.shape

(705, 11)

In [295]:

mask = new_data['Pclass'].isin([1, 2,3])
new_data2 = new_data[mask]

new_data2.shape

(705, 11)

In [296]:
new_data2

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,S


## 2. Feature selection

Let's use the `SelectKBest` method in scikit learn to see which are the top 5 features.

- What are the top 5 features for `Xt`?

=> store them in a variable called `kbest_columns`

# create dummy variable

In [297]:
new_data2=pd.concat([new_data2, pd.get_dummies(new_data2['Sex'])], axis=1)
new_data2=pd.concat([new_data2, pd.get_dummies(new_data2['Embarked'])], axis=1)
new_data2=pd.concat([new_data2, pd.get_dummies(new_data2['Parch'])], axis=1)
new_data2=pd.concat([new_data2, pd.get_dummies(new_data2['SibSp'])], axis=1)
new_data2=pd.concat([new_data2, pd.get_dummies(new_data2['Pclass'])], axis=1)

In [298]:
new_data2.shape

(705, 32)

In [299]:
col_name = ['PassengerId', 'Survived', 'Pclass', 'Name','Sex', 'Age','SibSp','Parch','Ticket','Fare','Embarked','female',
              'male','Embarked_C','Embarked_Q','Embarked_S','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5', 'Parch_6','SibSp_0','SibSp_1','SibSp_2','SibSp_3',
            'SibSp_4','SibSp_5','Pclass_1','Pclass_2','Pclass_3']

len(col_name)

32

In [301]:
new_data2.columns=['PassengerId', 'Survived', 'Pclass', 'Name','Sex', 'Age','SibSp','Parch','Ticket','Fare','Embarked','female','male','Embarked_C','Embarked_Q','Embarked_S','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5', 'Parch_6','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','Pclass_1','Pclass_2','Pclass_3']

In [302]:
new_data2.columns

Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Embarked', u'female', u'male',
       u'Embarked_C', u'Embarked_Q', u'Embarked_S', u'Parch_0', u'Parch_1',
       u'Parch_2', u'Parch_3', u'Parch_4', u'Parch_5', u'Parch_6', u'SibSp_0',
       u'SibSp_1', u'SibSp_2', u'SibSp_3', u'SibSp_4', u'SibSp_5', u'Pclass_1',
       u'Pclass_2', u'Pclass_3'],
      dtype='object')

# these are the independent variable 

In [102]:
feature_cols = [  u'female',u'Age',u'Fare']
X = new_data2[feature_cols]
X
# store response vector in "y"
y = new_data2['Survived']
y
# check X's type
print type(X)
#print type(X.values)

# check y's type
print type(y)
print type(y.values)


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<type 'numpy.ndarray'>


In [109]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
kbest_columns = X_new

In [108]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV

from sklearn import metrics
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# STEP 1: split X and y into training and testing sets (using random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

# STEP 2: train the model on the training set (using K=1)
logreg_cv = LogisticRegressionCV(solver='liblinear',Cs = 15, cv = 5,penalty = 'l2')
logreg_cv.fit(X_train, y_train)

# STEP 3: test the model on the testing set, and check the accuracy
y_pred_class = logreg_cv.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_class)

0.802259887006


In [221]:
new_data2.shape

(705, 32)

In [225]:
feature_cols = [ u'Age',u'Fare', u'male', u'Embarked_Q', u'Embarked_S', u'Parch_1',u'Parch_2', u'Parch_3', u'Parch_4', u'Parch_5', u'Parch_6',u'SibSp_1', u'SibSp_2', u'SibSp_3', u'SibSp_4', u'SibSp_5',u'Pclass_2', u'Pclass_3']
X = new_data2[feature_cols]
X
# store response vector in "y"
y = new_data2['Survived']
y
# check X's type
print type(X)
#print type(X.values)

# check y's type
print type(y)
print type(y.values)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<type 'numpy.ndarray'>


In [226]:
len(feature_cols)

18

In [228]:

from sklearn.feature_selection import SelectKBest


X_new = SelectKBest(k=5).fit_transform(X_train, y_train)
kbest_columns_all_variable = SelectKBest(k=5).fit(X_train, y_train)



In [229]:
kbest_columns_all_variable

SelectKBest(k=5, score_func=<function f_classif at 0x115015aa0>)

In [230]:
X_new

array([[   7.8958,    1.    ,    0.    ,    0.    ,    1.    ],
       [  15.85  ,    1.    ,    0.    ,    0.    ,    1.    ],
       [  30.    ,    1.    ,    0.    ,    0.    ,    0.    ],
       ..., 
       [  13.    ,    1.    ,    0.    ,    0.    ,    0.    ],
       [ 151.55  ,    0.    ,    0.    ,    1.    ,    0.    ],
       [  35.5   ,    1.    ,    0.    ,    0.    ,    0.    ]])

In [231]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV

from sklearn import metrics
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# STEP 1: split X and y into training and testing sets (using random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

# STEP 2: train the model on the training set (using K=1)
logreg_cv = LogisticRegressionCV(solver='liblinear',Cs = 15, cv = 5,penalty = 'l2')
logreg_cv.fit(X_train, y_train)

# STEP 3: test the model on the testing set, and check the accuracy
y_pred_class = logreg_cv.predict(X_test)
print metrics.accuracy_score(y_test, y_pred_class)

0.80790960452


## 3. Recursive Feature Elimination

`Scikit Learn` also offers recursive feature elimination as a class named `RFECV`. Use it in combination with a logistic regression model to see what features would be kept with this method.

=> store them in a variable called `rfecv_columns`

In [232]:
from sklearn.svm import SVC
#from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV


# Create the RFE object and compute a cross-validated score.

logreg_cv = LogisticRegressionCV(solver='liblinear',Cs = 15, cv = 5,penalty = 'l2')

#svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=logreg_cv, step=1, cv=2,
              scoring='accuracy')
rfecv.fit_transform(X_train, y_train)
rfecv_columns = rfecv.fit(X_train, y_train)



In [233]:
print("Optimal number of features : %d" % rfecv.n_features_)

Optimal number of features : 17


## 4. Logistic regression coefficients

Let's see if the Logistic Regression coefficients correspond.

- Create a logistic regression model
- Perform grid search over penalty type and C strength in order to find the best parameters
- Sort the logistic regression coefficients by absolute value. Do the top 5 correspond to those above?

=> choose which ones you would keep and store them in a variable called `lr_columns`

In [234]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV

In [235]:
## Load the Dataset
dataset = new_data2

In [236]:
## Prepare a Range of Alpha Values to Test
C_vals = [ 1,  5, 10, 100, 1000]
penalties = ['l1','l2']
logreg_cv = LogisticRegressionCV(solver='liblinear' ,cv=5)


In [237]:
## Create and Fit a GridSearchCV Model

grid = GridSearchCV(estimator=logreg_cv, param_grid=dict(penalty = penalties, Cs =C_vals ))
grid_coef = grid.fit(X_train, y_train)
print(grid)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='liblinear', tol=0.0001,
           verbose=0),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'Cs': [1, 5, 10, 100, 1000], 'penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)


In [220]:
grid.best_estimator_

LogisticRegressionCV(Cs=5, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
           refit=True, scoring=None, solver='liblinear', tol=0.0001,
           verbose=0)

In [219]:
## Summarize the Results of the Grid Search
print(grid.best_score_)
print(grid.best_estimator_.Cs)

0.791666666667
5


## 5. Compare features sets

Use the `best estimator` from question 4 on the 3 different feature sets:

- `kbest_columns`
- `rfecv_columns`
- `lr_columns`
- `all_columns`

Questions:

- Which scores the highest? (use cross_val_score)
- Is the difference significant?

Discuss results.

In [241]:
kbest_columns_all_variable.scores_


array([  5.89483535e+00,   3.40160753e+01,   2.06901576e+02,
         1.16224747e+00,   7.71192307e+00,   8.56480157e+00,
         1.04211905e+01,   1.84197855e+00,   2.88377193e+00,
         2.15573770e+00,   7.13897937e-01,   8.17863091e+00,
         9.73036169e-01,   5.78085504e-02,   3.80068259e+00,
         2.15573770e+00,   4.07574948e+00,   6.39910668e+01])

In [246]:
rfecv_columns.grid_scores_

array([ 0.77462121,  0.76325758,  0.76136364,  0.76136364,  0.76136364,
        0.76136364,  0.76136364,  0.77083333,  0.76893939,  0.76325758,
        0.76325758,  0.76893939,  0.77083333,  0.76325758,  0.76136364,
        0.77272727,  0.78787879,  0.78787879])

In [249]:
grid.best_score_

0.79166666666666663

## Bonus 1

Use a bar chart to display the logistic regression coefficients. Start from the most negative on the left.

## Bonus 2

Use Sebastian Raschka's [MLxtend library](http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) to implement a feature selection tactic discussed in class: sequential forward or backward search or floating sequential forward/backward search.