# Project Pipeline Lab 2 - fitting logistic regression model

In the previous lab we have constructed a processing pipeline using `sklearn` for the titanic dataset. At this point you should have a set of features ready for consumption by a Logistic Regression model.

In this lab we will use the pre-processing pipeline you have created and combine it with a classification model. This demonstrates how the workflow can be separated out into distinct tasks.

Again we have imported this titanic data into our PostgreSQL instance that you can find connecting here:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

First of all let's load a few things:

- standard packages
- the training set from lab 2.3
- the union we have saved in lab 2.3

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com/titanic')
df = pd.read_sql('SELECT * FROM train', engine)

In [44]:
# This is how we get back our pickled union object

import gzip
import dill

with gzip.open('union.dill.gz') as fin:
    union = dill.load(fin)

Then, let's create the training and test sets:

In [45]:
X = df[[u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare', u'Embarked']]
y = df['Survived']

In [46]:
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## 1. Model Pipeline

Combine the union you have created in the previous lab with a LogisticRegression instance. Notice that a `sklearn.pipeline` can have an arbitrary number of transformation steps, but only one, optional, estimator step as the last one in the chain.

In [47]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
pipe = make_pipeline(union,LogisticRegression())

## 2. Train the model
Use `X_train` and `y_train` to fit the model.
Use `X_test` to generate predicted values for the target variable and save those in a new variable called `y_pred`.

In [53]:
pipe.fit(X_train,y_train)
y_pred = pipe.transform(X_test)



In [64]:
y_pred = y_pred[:,1]

## 3. Evaluate the model accuracy

1. Use the `confusion_matrix` and `classification_report` functions to assess the quality of the model.
- Embed the results of the `confusion_matrix` in a Pandas dataframe with appropriate column names and index, so that it's easier to understand what kind of error the model is incurring into.
- Are there more false positives or false negatives? (remember we are trying to predict survival)
- How does that relate to what the `classification_report` is showing?

In [66]:
from sklearn.metrics import confusion_matrix
conmat = np.array(confusion_matrix(y_test, y_pred))
confusion = pd.DataFrame(conmat, index=['Died', 'Survived'],columns=['predicted_Died', 'predicted_Survived'])
confusion

Unnamed: 0,predicted_Died,predicted_Survived
Died,151,24
Survived,36,84


In [70]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.81      0.86      0.83       175
          1       0.78      0.70      0.74       120

avg / total       0.80      0.80      0.79       295



## 4. Improving the model

Can we improve the accuracy of the model?

One way to do this is to use tune the parameters controlling it.

You can get a list of all the model parameters using `model.get_params().keys()`.

In [71]:
pipe.get_params()

{'featureunion': FeatureUnion(n_jobs=1,
        transformer_list=[('pipeline-1', Pipeline(steps=[('columnselector', ColumnSelector(columns='Age')), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('pipeline-2', Pipeline(step...r(columns='Fare')), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))]))],
        transformer_weights=None),
 'featureunion__n_jobs': 1,
 'featureunion__pipeline-1': Pipeline(steps=[('columnselector', ColumnSelector(columns='Age')), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))]),
 'featureunion__pipeline-1__columnselector': ColumnSelector(columns='Age'),
 'featureunion__pipeline-1__columnselector__columns': 'Age',
 'featureunion__pipeline-1__imputer': Imputer(axis=0, copy=True, missing_values=

In [None]:
X = 

You can systematically probe parameter combinations by using the `GridSearchCV` function. Implement a new classifier that searches the best parameter combination.

In [73]:
# You use a double underscore __ to identify the label for the hyperparameter (e.g. C) that you wish to gridsearch,
# after the name of the classifier as input to the pipeline. So for example param_grid={"LogisticRegression__C":[...]}

from sklearn.model_selection import GridSearchCV

params = dict(logisticregression__C = [.1,10,100,1000])
grid_search = GridSearchCV(pipe, param_grid=params, scoring="accuracy")
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'logisticregression__C': 100}

## 5. Assess the tuned model

A tuned grid search model stores the best parameter combination and the best estimator as attributes.

1. Use these to generate a new prediction vector `y_pred`.
- Use the `confusion matrix` and `classification_report` to assess the accuracy of the new model.
- How does the new model compare with the old one?
- What else could you do to improve the accuracy?

In [83]:
pipe.set_params(logisticregression__C = 100)
pipe.fit(X_train, y_train)
y_pred = pipe.transform(X)
y_pred = y_pred[:,1]

from sklearn.metrics import classification_report
print(classification_report(y, y_pred))

from sklearn.metrics import confusion_matrix
conmat = np.array(confusion_matrix(y, y_pred))
confusion = pd.DataFrame(conmat, index=['Died', 'Survived'],columns=['predicted_Died', 'predicted_Survived'])
confusion

             precision    recall  f1-score   support

          0       0.81      0.85      0.83       549
          1       0.74      0.68      0.71       342

avg / total       0.78      0.79      0.78       891





Unnamed: 0,predicted_Died,predicted_Survived
Died,468,81
Survived,109,233


## Bonus

What would happen if we used a different scoring function? Would our results change?
Choose one or two classification metrics from the [sklearn provided metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) and repeat the grid_search. Do your result change?