# Logistic Regression Lab

In the previous lab we have constructed a processing pipeline using `sklearn` for the titanic dataset. At this point you should have a set of features ready for consumption by a Logistic Regression model.

In this la we will use the pre-processing pipeline you have created and combine it with a classification model.


We have imported this titanic data into our PostgreSQL instance that you can find connecting here:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

First of all let's load a few things:

- standard packages
- the training set from lab 2.3
- the union we have saved in lab 2.3

In [37]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

from sqlalchemy import create_engine
engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com/titanic')

df = pd.read_sql('SELECT * FROM train', engine)

In [38]:
import gzip
import dill

with gzip.open('../../../2.3-lab/assets/datasets/union.dill.gz') as fin:
    union = dill.load(fin)

In [55]:
union

FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(steps=[('columnselector', ColumnSelector(columns='Age')), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('getdummiestransformer', Ge...r(columns='Fare')), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))]))],
       transformer_weights=None)

Then, let's create the training and test sets:

In [40]:
X = df[[u'Age',u'Fare',u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare', u'Embarked']]
y = df['Survived']

In [41]:
X

Unnamed: 0,Age,Fare,Pclass,Sex,Age.1,SibSp,Parch,Fare.1,Embarked
0,22.0,7.2500,3,male,22.0,1,0,7.2500,S
1,38.0,71.2833,1,female,38.0,1,0,71.2833,C
2,26.0,7.9250,3,female,26.0,0,0,7.9250,S
3,35.0,53.1000,1,female,35.0,1,0,53.1000,S
4,35.0,8.0500,3,male,35.0,0,0,8.0500,S
5,,8.4583,3,male,,0,0,8.4583,Q
6,54.0,51.8625,1,male,54.0,0,0,51.8625,S
7,2.0,21.0750,3,male,2.0,3,1,21.0750,S
8,27.0,11.1333,3,female,27.0,0,2,11.1333,S
9,14.0,30.0708,2,female,14.0,1,0,30.0708,C


In [46]:
from sklearn.cross_validation import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [47]:
print X_train, X_test
print y_train, y_test

      Age      Fare  Pclass     Sex   Age  SibSp  Parch      Fare Embarked
6    54.0   51.8625       1    male  54.0      0      0   51.8625        S
718   NaN   15.5000       3    male   NaN      0      0   15.5000        Q
685  25.0   41.5792       2    male  25.0      1      2   41.5792        C
73   26.0   14.4542       3    male  26.0      1      0   14.4542        C
882  22.0   10.5167       3  female  22.0      0      0   10.5167        S
328  31.0   20.5250       3  female  31.0      1      1   20.5250        S
453  49.0   89.1042       1    male  49.0      1      0   89.1042        C
145  19.0   36.7500       2    male  19.0      1      1   36.7500        S
234  24.0   10.5000       2    male  24.0      0      0   10.5000        S
220  16.0    8.0500       3    male  16.0      0      0    8.0500        S
370  25.0   55.4417       1    male  25.0      1      0   55.4417        C
811  39.0   24.1500       3    male  39.0      0      0   24.1500        S
132  47.0   14.5000      

In [45]:
train_test_split?

## 1. Model Pipeline

Combine the union you have created in the previous lab with a LogisticRegression instance. Notice that a `sklearn.pipeline` can have an arbitrary number of transformation steps, but only one, optional, estimator step as the last one in the chain.

In [52]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [59]:
union_plus_lr = make_pipeline(union, LogisticRegression())

# 2 train_test

In [61]:
union_plus_lr.fit(X_train,y_train)

Pipeline(steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(steps=[('columnselector', ColumnSelector(columns='Age')), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [63]:
y_pred = union_plus_lr.predict(X_test)
y_pred

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0])

## 3. Evaluate the model accuracy

1. Use the `confusion_matrix` and `classification_report` functions to assess the quality of the model.
- Embed the results of the `confusion_matrix` in a Pandas dataframe with appropriate column names and index, so that it's easier to understand what kind of error the model is incurring into.
- Are there more false positives or false negatives? (remember we are trying to predict survival)
- How does that relate to what the `classification_report` is showing?

In [64]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [69]:
cm = confusion_matrix(y_test,y_pred)
print cm

[[154  21]
 [ 37  83]]


In [68]:
cr = classification_report(y_test,y_pred)
print cr

             precision    recall  f1-score   support

          0       0.81      0.88      0.84       175
          1       0.80      0.69      0.74       120

avg / total       0.80      0.80      0.80       295



In [70]:
cm_table = pd.DataFrame(data=cm,columns=['Result1', 'Result0'],index=['Predict1','Predict0'])
cm_table

Unnamed: 0,Result1,Result0
Predict1,154,21
Predict0,37,83


## 4. Improving the model

Can we improve the accuracy of the model?

One way to do this is to use tune the parameters controlling it.

You can get a list of all the model parameters using `model.get_params().keys()`.

Discuss with your team which parameters you could try to change.

You can systematically probe parameter combinations by using the `GridSearchCV` function. Implement a new classifier that searches the best parameter combination.

1. How will you choose the grid granularity?
1. How can you prevent the grid to exponentially grow?

## 5. Assess the tuned model

A tuned grid search model stores the best parameter combination and the best estimator as attributes.

1. Use these to generate a new prediction vector `y_pred`.
- Use the `confusion matrix`and `classification_report` to assess the accuracy of the new model.
- How does the new model compare with the old one?
- What else could you do to improve the accuracy?

## Bonus

What would happen if we used a different scoring function? Would our results change?
Choose one or two classification metrics from the [sklearn provided metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) and repeat the grid_search. Do your result change?