# Feature Selection Lab

In this lab we will explore feature selection on the Titanic Dataset. First of all let's load a few things:

- Standard packages
- The training set from lab 2.3
- The union we have saved in lab 2.3


You can load the titanic data as follows:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sqlalchemy import create_engine
engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com/titanic')

df = pd.read_sql('SELECT * FROM train', engine)
print df

     index  PassengerId  Survived  Pclass  \
0        0            1         0       3   
1        1            2         1       1   
2        2            3         1       3   
3        3            4         1       1   
4        4            5         0       3   
5        5            6         0       3   
6        6            7         0       1   
7        7            8         0       3   
8        8            9         1       3   
9        9           10         1       2   
10      10           11         1       3   
11      11           12         1       1   
12      12           13         0       3   
13      13           14         0       3   
14      14           15         0       3   
15      15           16         1       2   
16      16           17         0       3   
17      17           18         1       2   
18      18           19         0       3   
19      19           20         1       3   
20      20           21         0       2   
21      21

In [2]:
# import gzip
# import dill

# with gzip.open('./assets/datasets/union.dill.gz') as fin:
#     union = dill.load(fin)
    
X = df[[u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare', u'Embarked']]
y = df[u'Survived']

X_transf = union.fit_transform(X)
X_transf

NameError: name 'union' is not defined

## 1 Column names

Uh oh, we have lost the column names along the way! We need to manually add them:
- age_pipe => 'scaled_age'
- one_hot_pipe => 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S'
- gender_pipe => 'male'
- fare_pipe => 'scaled_fare'

Now we need to:

1. Create a new pandas dataframe called `Xt` with the appropriate column names and fill it with the `X_transf` data.
2. Notice that the current pipeline complitely discards the columns: u'SibSp', u'Parch'. Stack them as they are to the new dataframe


In [1]:
import pandas as pd

df = pd.read_csv("/Users/Debjani/Downloads/df_tr.csv")
print df.head(3)

   Unnamed: 0  index  PassengerId  Survived  Pclass  \
0           0      0            1         0       3   
1           1      1            2         1       1   
2           2      2            3         1       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  


## 2. Feature selection

Let's use the `SelectKBest` method in scikit learn to see which are the top 5 features.

- What are the top 5 features for `Xt`?

=> store them in a variable called `kbest_columns`

In [5]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

Xt = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived']

estimator = LogisticRegression()
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(Xt, y)
rfecv_columns = Xt.columns[selector.support_]
rfecv_columns

ValueError: could not convert string to float: Q

In [3]:
LogisticRegression

In [None]:
new_cols = ['scaled_age', 'Pclass_1', 'Pclass_2', 'Pclass_3',
            'Embarked_C', 'Embarked_Q', 'Embarked_S',
            'male', 'scaled_fare']

Xt = pd.DataFrame(X_transf, columns=new_cols)
Xt = pd.concat([Xt, X[[u'SibSp', u'Parch']]], axis = 1)
Xt.head()

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=5)
selected_data = selector.fit_transform(Xt, y)
kbest_columns = Xt.columns[selector.get_support()]
Xtbest = pd.DataFrame(selected_data, columns=kbest_columns)
Xtbest.head()

## 3. Recursive Feature Elimination

`Scikit Learn` also offers recursive feature elimination as a class named `RFECV`. Use it in combination with a logistic regression model to see what features would be kept with this method.

=> store them in a variable called `rfecv_columns`

## 4. Logistic regression coefficients

Let's see if the Logistic Regression coefficients correspond.

- Create a logistic regression model
- Perform grid search over penalty type and C strength in order to find the best parameters
- Sort the logistic regression coefficients by absolute value. Do the top 5 correspond to those above?
> Answer: Not completely. That could be due to scaling

=> choose which ones you would keep and store them in a variable called `lr_columns`

## 5. Compare features sets

Use the `best estimator` from question 4 on the 3 different feature sets:

- `kbest_columns`
- `rfecv_columns`
- `lr_columns`
- `all_columns`

Questions:

- Which scores the highest? (use cross_val_score)
- Is the difference significant?
> Answer: Not really
- discuss in pairs

## Bonus

Use a bar chart to display the logistic regression coefficients. Start from the most negative on the left.