# Custom Classes: Pipelines
We've covered transformers and estimators. [Pipelines ](scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)are a tool capable of very effectively chaining transformers and estimators.

In [36]:
import sklearn as sk
import pandas as pd
import numpy as np

In [37]:
from sklearn import cross_validation
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

X_fit, X_validation, y_fit, y_validation = \
    cross_validation.train_test_split(train.drop('Survived', axis=1), 
                                      train.Survived, test_size=0.25, random_state=42)

As a reminder this set includes:

| Variable      | Description  |  Values  |
| ------------- |:-------------:| -----:|
| survived      | Survival | (0 = No; 1 = Yes) |
| pclass     | Passenger Class     |   (1 = 1st; 2 = 2nd; 3 = 3rd) |
| name  | Name     |    String |
| sex | Sex      |    ('male' or 'female') |
| age | Age     |    Float 0-80  |
| sibsp | Number of Siblings/Spouses Aboard      |    Int |
| parch | Number of Parents/Children Aboard      |    Int |
| ticket | Ticket Number      |    String  |
| fare | Passenger Fare      |    Float |
| cabin| Cabin     |    String (e.g. C134) |
| embarked| Port of Embarkation      |    ('C' = Cherbourg; 'Q' = Queenstown; 'S' = Southampton) |


<a id='pipeBegin'></a>

## Pipelines
Pipelines allow for a claen implementation of a dataset -> transform/manipulate -> predict -> score -> iterate model. 

Starting with a simple example transform gender tags and predict using a decsion tree.

First using nested functions:

In [38]:
from scikitDemoHelpers import genericLevelsToDummiesTransformer

In [39]:
dummyTransformer=genericLevelsToDummiesTransformer(['Cabin','Sex', 'Pclass','Embarked'], printFlag=False)

In [40]:
dummyTransformer.fit(train)

genericLevelsToDummiesTransformer(columns=['Cabin', 'Sex', 'Pclass', 'Embarked'],
                 printFlag=False)

In [41]:
dummyTransformer.transform(test).head(3)

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin_A,Cabin_B,Cabin_C,...,Cabin_NaN,Cabin_T,Sex_female,Pclass_1,Pclass_2,Pclass_3,Embarked_nan,Embarked_C,Embarked_Q,Embarked_S
0,892,"Kelly, Mr. James",34.5,0,0,330911,7.8292,0,0,0,...,1,0,1,0,0,1,0,0,1,0
1,893,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0,0,0,0,...,1,0,0,0,0,1,0,0,0,1
2,894,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,0,0,0,...,1,0,1,0,1,0,0,0,1,0


Let's drop a few of those columns

In [42]:
def dropColumns(fullColumnDF):
    reducedColumns = fullColumnDF.drop(['PassengerId', 'Name', "Ticket"], axis=1)
    
    # fill NA's while we're at it
    return reducedColumns.fillna(0)
    
dropColumns(train).head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,0,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,0,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,0,S


In [43]:
from sklearn import preprocessing    
columnDropper = preprocessing.FunctionTransformer(dropColumns, validate=False)
columnDropper.transform(train).head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,0,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,0,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,0,S


In [44]:
from sklearn import tree
treeClf = tree.DecisionTreeClassifier(random_state=42)
withDummies=dummyTransformer.transform(X_fit)
withDummiesExtraColumnsDropped=columnDropper.transform(withDummies)
withDummiesExtraColumnsDropped.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,...,Cabin_NaN,Cabin_T,Sex_female,Pclass_1,Pclass_2,Pclass_3,Embarked_nan,Embarked_C,Embarked_Q,Embarked_S
0,0.0,0,0,30.5,0,0,1,0,0,0,...,0,0,1,1,0,0,0,0,0,1
1,25.0,0,0,7.05,0,0,0,0,0,0,...,1,0,1,0,0,1,0,0,0,1
2,24.0,0,2,14.5,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
3,22.0,0,0,7.5208,0,0,0,0,0,0,...,1,0,1,0,0,1,0,0,0,1
4,0.92,1,2,151.55,0,0,1,0,0,0,...,0,0,1,1,0,0,0,0,0,1


In [45]:
treeClf.fit(withDummiesExtraColumnsDropped, y_fit)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best')

In [46]:
withDummiesTest=dummyTransformer.transform(X_validation)
withDummiesExtraColumnsDroppedTest=columnDropper.transform(withDummiesTest)

treeClf.predict(withDummiesExtraColumnsDroppedTest)[0:5]

array([1, 0, 0, 1, 1])

In [47]:
treeClf.score(withDummiesExtraColumnsDroppedTest, y_validation)

0.77578475336322872

Let's change a few of the parameters from the default and see how it affects the score.

In [48]:
treeClf2 = tree.DecisionTreeClassifier(min_samples_leaf=2, max_features=4, min_samples_split=15, random_state=42)
treeClf2.fit(withDummiesExtraColumnsDropped, y_fit)

treeClf2.score(withDummiesExtraColumnsDroppedTest, y_validation)

0.7982062780269058

Sweeeeet! We just got more accurate. Is this the best we can do? We should try a bunch of values 

In [49]:
treeClf3 = tree.DecisionTreeClassifier(min_samples_leaf=5, max_features=10, min_samples_split=10, random_state=42)
treeClf3.fit(withDummiesExtraColumnsDropped, y_fit)

treeClf3.score(withDummiesExtraColumnsDroppedTest, y_validation)

0.85201793721973096

As a pipeline:

In [50]:
from sklearn import pipeline

dummifier = genericLevelsToDummiesTransformer(['Cabin','Sex', 'Pclass','Embarked'], printFlag=False)
dropifier = preprocessing.FunctionTransformer(dropColumns, validate=False)
treeClfPipe = tree.DecisionTreeClassifier(random_state=42)

dummyTreePipeline = pipeline.Pipeline([('dummyMaker', dummifier), 
                                        ('columnDropper', dropifier), 
                                        ('treeClassifer', treeClfPipe)])

In [51]:
dummyTreePipeline

Pipeline(steps=[('dummyMaker', genericLevelsToDummiesTransformer(columns=['Cabin', 'Sex', 'Pclass', 'Embarked'],
                 printFlag=False)), ('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd7e1fba050>, pass_y=False,
          validate=False)), ('...plit=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))])

In [52]:
dummyTreePipeline.fit(X_fit, y_fit)

Pipeline(steps=[('dummyMaker', genericLevelsToDummiesTransformer(columns=['Cabin', 'Sex', 'Pclass', 'Embarked'],
                 printFlag=False)), ('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd7e1fba050>, pass_y=False,
          validate=False)), ('...plit=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))])

In [53]:
dummyTreePipeline.predict(X_validation)[1:10]

array([0, 0, 1, 1, 1, 1, 0, 1, 1])

In [54]:
dummyTreePipeline.score(X_validation,y_validation)

0.77578475336322872

Changing the parameters now in pipeline version.

In [55]:
dummyTreePipeline.set_params(treeClassifer__min_samples_leaf=5,
                             treeClassifer__max_features=10, 
                             treeClassifer__min_samples_split=10)

Pipeline(steps=[('dummyMaker', genericLevelsToDummiesTransformer(columns=['Cabin', 'Sex', 'Pclass', 'Embarked'],
                 printFlag=False)), ('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd7e1fba050>, pass_y=False,
          validate=False)), ('...lit=10, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))])

In [56]:
dummyTreePipeline.fit(X_fit, y_fit)
dummyTreePipeline.score(X_validation,y_validation)

0.85201793721973096

Well this is sort of helpful. 
We've seen better parameters can really improve our model. Can we semi automate it? That would really make this powerful stuff.

## On to [Grid Search](https://github.com/SethPaul/scikitFlowDemo/blob/master/gridSearch.ipynb#beginGS)