# Project Pipeline Lab 1


##  Data loading, extracting features only (no fitting of models in this lab)
In this lab we will create pipelines for data processing on the famous [Titanic dataset](http://www.kaggle.com/c/titanic-gettingStarted/data).

The dataset is a list of passengers. The second column of the dataset is a “label” for each person indicating whether that person survived (1) or did not survive (0). You can see the Kaggle page linked above for more information and the codebook.

The data has been uploaded to a SQL table, so instead of downloading the csv and then loading that into our RAM we can just take the data directly from the table. In this case we will take all of the rows and columns, so this doesn't  save us much effort or memory, but in general we would take only the subset of data we need which makes memory management considerably more straightforward. You can grab the titanic data with the following credentials:

    psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student titanic
    password: gastudents

In [374]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

# We use sqlalchemy here. There are several ways to connect to sql tables in python. This might break if
# psycopg2 is not installed, you can just conda install psycopg2 in the terminal if you get an error

engine = create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com/titanic')

all_tables_in_schema=pd.read_sql("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'", engine)
all_tables_in_schema

Unnamed: 0,table_name
0,train
1,table1
2,account
3,account_information
4,jacques
5,howie
6,user
7,evictions_simple
8,sd_geo
9,student_id


In [375]:
# We have a list of several tables. I'm not actually sure what some of these are (e.g. jacques seems to be empty)
# but the one we want is 'train', this is the train.csv from the kaggle website. First we should return the first 5
# rows to check that it makes sense and is what we expect - can you do that? If you are stuck then ask, I know this
# was only covered yesterday. I suggest using pandas read_sql

pd.read_sql("SELECT * FROM train LIMIT 5", engine)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [376]:
# Next we check how many rows it is, you do not want to pull in a table that has 100 million rows or you will be here
# for some time!

pd.read_sql("SELECT COUNT(*) FROM train", engine)

Unnamed: 0,count
0,891


In [377]:
# You should find there are 891 rows, which corresponds to the kaggle csv also and is obviously no problem for memory 
# load. You can now be confident to pull in the whole table without crashing your computer!

titanic = pd.read_sql("SELECT * FROM train", engine)

Have a look at the data using the info method:

- Are there numerical features?
- Are there categorical features?
- Which columns have missing data?
- Which of these are important to be filled?

In [378]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
index          891 non-null int64
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB


Ok, so what we are going to do is build a pipeline for data processing. There is more than one way to do this.
The way we are going to do this is to build separate pipelines for each feature we are processing. Then we will
use a union method to tell sklearn to operate all the separate feature pipelines in unison, i.e. a pipeline will
be a list of operations to perform one after another. A union will be operations acting simultaneously. 

Hence:
- define pipeline feature 1: operation 1 > operation 2 > operation 3
- define pipeline feature 2
- define pipeline feature 3
- define union: pipeline feature 1 + pipeline feature 2 + pipeline feature 3


## 2. Age

Several passengers are missing data points for age. Impute the missing values so that there are no “NaN” values for age as inputs to your model, i.e. we do not want to drop these rows. We seek to find a method to come up with a reasonable guess for these ages. 

In [379]:
# First, use the impute method of sklearn to fill the missing values for the age column
# i.e. call the method only on that one column. This is just to check how it  import orks.

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = 'mean')
X = titanic['Age'].reshape(-1,1)
imputer.fit(X)
titanic['Age'] = imputer.transform(X)
titanic.Age.head(10)

0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    29.699118
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: Age, dtype: float64

### 2.b Age Transformer


Create a custom transformer that imputes only the age values. Depending on how you have decided to impute missing values, this could involve:
    - Selecting one or more columns
    - Filling the NAs using Imputer or a custom strategy
    - Scaling the Age values
Since we want to pickle the output at the end, we will need to build custom classes for this which have both fit and transform methods.

In [380]:
# Note that if you simply call impute within your pipeline, you would apply impute to all the columns
# of the input. That might be fine but if we want to just impute the age column, we need a custom transformer that 
# selects only the age column. I have put the scaffold here for you to add to, I know it can be a bit daunting 
#but try to remember the intro from last week. This way you get to see how these classes can be put into action!

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer, StandardScaler, LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin

# the BaseEstimator and TransformerMixin are classes from sklearn that we are here extending
class ColumnSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns):
        self.columns = columns
        
    # The asterisk allows extra inputs of arbitrary number, this helps to make it more robust to different inputs
    def transform(self, X, *_):
        if isinstance(X, pd.DataFrame):
            return(X[self.columns].reshape(-1,1))
        else:
            raise TypeError("This transformer only works with Pandas Dataframes")
    
    # We don't want our column selector to do anything if a fit is called
    def fit(self, X, *_):
        #return pd.DataFrame(self)
        return self

In [381]:
# Ok, now try creating a pipeline where you call the column selector, then impute and finally scale the resulting
# age column (you should scale all outputs). You should use the transform(), or fit_transform() method to test it
# fit_transform() will apply fit() and then transform()

pipeline = make_pipeline(ColumnSelector('Age'), Imputer(strategy = 'mean'), StandardScaler())

pipeline.fit_transform(titanic)

array([[-0.5924806 ],
       [ 0.63878901],
       [-0.2846632 ],
       [ 0.40792596],
       [ 0.40792596],
       [ 0.        ],
       [ 1.87005862],
       [-2.13156761],
       [-0.20770885],
       [-1.20811541],
       [-1.97765891],
       [ 2.17787603],
       [-0.7463893 ],
       [ 0.71574336],
       [-1.20811541],
       [ 1.94701297],
       [-2.13156761],
       [ 0.        ],
       [ 0.10010856],
       [ 0.        ],
       [ 0.40792596],
       [ 0.33097161],
       [-1.13116105],
       [-0.1307545 ],
       [-1.66984151],
       [ 0.63878901],
       [ 0.        ],
       [-0.82334365],
       [ 0.        ],
       [ 0.        ],
       [ 0.79269771],
       [ 0.        ],
       [ 0.        ],
       [ 2.79351083],
       [-0.1307545 ],
       [ 0.94660642],
       [ 0.        ],
       [-0.66943495],
       [-0.900298  ],
       [-1.20811541],
       [ 0.79269771],
       [-0.20770885],
       [ 0.        ],
       [-2.05461326],
       [-0.82334365],
       [ 0

## 3. Categorical Variables

`Embarked` and `Pclass` are categorical variables. Use pandas get_dummies function to create dummy columns corresponding to the values.

`Embarked` has 2 missing values. Fill them with e.g. the most common port of embarkment.

Feel free to create a GetDummiesTransformer that wraps around the get_dummies function.

In [382]:
# Ok so now do something similar to the above in the following cases

class GetDummiesTransformer(BaseEstimator, TransformerMixin):

    def __init__(self,columns):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
        self.columns = columns
        
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        res = X[self.columns].fillna(self.fill)
        return pd.get_dummies(res)

dummiesTransformer1 = GetDummiesTransformer(['Embarked'])
pipeline2 = make_pipeline(dummiesTransformer1)
pipeline2.fit_transform(titanic)

dummiesTransformer2 = GetDummiesTransformer(['Pclass'])
pipeline3 = make_pipeline(dummiesTransformer2)
pipeline3.fit_transform(titanic)


Unnamed: 0,Pclass
0,3
1,1
2,3
3,1
4,3
5,3
6,1
7,3
8,3
9,2


## 4. Boolean Columns

The `Sex` column only contains 2 values: `male` and `female`. Build a custom transformers that is initialised with one of the values and returns a boolean column with values of `True` when that value is found and `False` otherwise.

In [383]:
# Again, similar operation creating a pipeline for this operation

In [452]:
from sklearn.preprocessing import LabelBinarizer

class BooleanTransformer(BaseEstimator, TransformerMixin):

    def __init__(self,baseValue):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
        self.baseValue = baseValue
        
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return (X == self.baseValue)*1

#booleanTransformer = BooleanTransformer('female','Sex')
#labelBinarizer = LabelBinarizer()
pipeline4 = make_pipeline(ColumnSelector('Sex'),BooleanTransformer('female'))
pipeline4.fit_transform(titanic)

array([[0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
    

## 5. Fare

The `Fare` attribute can be scaled using one of the scalers from the preprocessing module. 

In [385]:
# Pipeline for this

In [465]:
from sklearn.preprocessing import StandardScaler

pipeline5 = make_pipeline(ColumnSelector('Fare'),StandardScaler())
pipeline5.fit_transform(titanic)
#pipeline5.transform(titanic)
#scaler.fit(titanic.Fare.reshape(-1,1))
#scaler.transform(titanic.Fare.reshape(-1,1))


array([[ -5.02445171e-01],
       [  7.86845294e-01],
       [ -4.88854258e-01],
       [  4.20730236e-01],
       [ -4.86337422e-01],
       [ -4.78116429e-01],
       [  3.95813561e-01],
       [ -2.24083121e-01],
       [ -4.24256141e-01],
       [ -4.29555021e-02],
       [ -3.12172378e-01],
       [ -1.13845709e-01],
       [ -4.86337422e-01],
       [ -1.87093118e-02],
       [ -4.90279793e-01],
       [ -3.26266659e-01],
       [ -6.19988892e-02],
       [ -3.86670720e-01],
       [ -2.85997284e-01],
       [ -5.02948539e-01],
       [ -1.24919787e-01],
       [ -3.86670720e-01],
       [ -4.86756223e-01],
       [  6.63597416e-02],
       [ -2.24083121e-01],
       [ -1.64441595e-02],
       [ -5.02948539e-01],
       [  4.64700108e+00],
       [ -4.89776426e-01],
       [ -4.89442190e-01],
       [ -9.02720170e-02],
       [  2.30172882e+00],
       [ -4.92377828e-01],
       [ -4.37007438e-01],
       [  1.00606170e+00],
       [  3.98582080e-01],
       [ -5.02863973e-01],
 

## 6. Union

Use the `make_union` function from the `sklearn.pipeline` module to combine all the pipes you have created.

In [466]:
# The idea here is to now bring together the separate transformations that we have acting on each column separately
# so we will now have these act together. You can think of the union operation as one that operates each pipeline
# simultaneously, so rather than the pipe running one operation after another it splits off into separate pipes that
# then recombine together at the end

from sklearn.pipeline import make_union

union = make_union(pipeline,pipeline2,pipeline3,pipeline4,pipeline5)

In [468]:
array= union.fit_transform(titanic)
array

array([[-0.5924806 ,  0.        ,  0.        , ...,  3.        ,
         0.        , -0.50244517],
       [ 0.63878901,  1.        ,  0.        , ...,  1.        ,
         1.        ,  0.78684529],
       [-0.2846632 ,  0.        ,  0.        , ...,  3.        ,
         1.        , -0.48885426],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  3.        ,
         1.        , -0.17626324],
       [-0.2846632 ,  1.        ,  0.        , ...,  1.        ,
         0.        , -0.04438104],
       [ 0.17706291,  0.        ,  1.        , ...,  3.        ,
         0.        , -0.49237783]])

The union you have created is a complete pre-processing pipeline that takes the original titanic dataset and extracts a bunch of features out of it. The last step of this process is to persist the `union` object to disk, so that it can be used again later, in a process called pickling. The following lines achieve this:

In [469]:
import dill
import gzip

with gzip.open('union.dill.gz', 'w') as fout:
    dill.dump(union, fout)

## Bonus

Can you think of a way to engineer an additional boolean feature that keeps track whethere the person is travelling alone or with family?