# Sci Kit-Learn Work Flow: Grid Search and Cross Validation 
[Grid Search](http://scikit-learn.org/stable/modules/grid_search.html) is a powerful tool to optimize parameters. Will we be doing this using a little more complex [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html).

In [1]:
import sklearn as sk
import pandas as pd
import numpy as np

In [104]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

# combine set to make sure cabin transform includes all
combinedSet = pd.concat([train , test], axis=0)
combinedSet = combinedSet.reset_index(drop=True)

In [105]:
from sklearn import pipeline
from sklearn import preprocessing
from sklearn import tree
from scikitDemoHelpers import genericLevelsToDummiesTransformer
from scikitDemoHelpers import dropColumns

dummyTransformer=genericLevelsToDummiesTransformer(['Cabin','Sex', 'Pclass','Embarked', 'Title'], printFlag=False)
dummyTransformer.fit_transform(combinedSet)

dropifier = preprocessing.FunctionTransformer(dropColumns, validate=False)
treeClfPipe = tree.DecisionTreeClassifier(random_state=42)

dummyTreePipeline = pipeline.Pipeline([('columnDropper', dropifier), 
                                        ('treeClassifer', treeClfPipe)])
dummyTreePipeline.set_params(treeClassifer__min_samples_leaf=5,
                             treeClassifer__max_features=10, 
                             treeClassifer__min_samples_split=10)

Pipeline(steps=[('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
          validate=False)), ('treeClassifer', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=10, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=10, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))])

<a id='gridBegin'></a>

## Cross Validation
So far we have considered three sets: 
 - fit set to fit the model on (75% of the full train set in the pipeline notebook) 
 - validation set to test the model on with known labels (25% of the full train set in the pipeline notebook) 
 - test set to predict our unknown labels 

In [209]:
from sklearn import cross_validation
X_fit, X_validation, y_fit, y_validation = \
    cross_validation.train_test_split(train.drop('Survived', axis=1), 
                                      train.Survived, test_size=0.25, random_state=42)

If we want to consider more than the single train-validation pair, we can cut up the train set into multiple blocks and rotate which one to validate against. This is called K-folds, where K is the number of "folds" we want in the train set.

For example with 100 obeservations and k=4:
 - fold 1: 1-25
 - fold 2: 26-50
 - fold 3: 51-75
 - fold 4: 76-100

We would then crossvalidate 4 times:
 - fit on fold 2,3,4; validate with fold 1
 - fit on fold 1,3,4; validate with fold 2
 - fit on fold 1,2,4; validate with fold 3
 - fit on fold 1,2,3; validate with fold 4

We would then end up with 4 scores who's average would be better at estimating the model's ability to generalize. 

A simple example below shows that training and testing our model on different sets even of the same size can produce better or worse scores.

In [107]:
from sklearn import metrics
from sklearn import cross_validation

scores =  cross_validation.cross_val_score(dummyTreePipeline, 
                                           dummyTransformer.transform(train.drop('Survived', 
                                                      axis=1)), 
                                           train.Survived,
                                          cv = 5,
                                          n_jobs=-1)
print scores

[ 0.7877095   0.76536313  0.79213483  0.79775281  0.82485876]


If we want to create the folds similar to [```sklearn.crossvalidation.train_test_split()```](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html#sklearn.cross_validation.train_test_split) before we can use [```sklearn.crossvalidation.KFold()```](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold) or [```sklearn.crossvalidation.StratifiedKFold()```](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold). Stratified K Folds ensures there is an equal ratio of labels across the folds.

In [108]:
skf = cross_validation.StratifiedKFold(train.Survived, n_folds=5, shuffle=True)

In [109]:
for fit_indices, validate_indices in skf:
    dummyTreePipeline.fit(dummyTransformer.transform(train.drop('Survived', axis=1)).loc[fit_indices,:], 
                          train.Survived[fit_indices])
    print dummyTreePipeline.score(dummyTransformer.transform(train.drop('Survived', axis=1)).loc[validate_indices,:], train.Survived[validate_indices])

0.821229050279
0.765363128492
0.859550561798
0.814606741573
0.785310734463


Now that we understand the concept of k-folds let's look at gridsearchCV. 

## Grid Search
[Grid Search](http://scikit-learn.org/stable/modules/grid_search.html#grid-search) allows for a search across the parameter space for a model. For the decision tree example this could be considering all the parameters output when we create the classifer:
         ```
         DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
                min_samples_split=2, min_weight_fraction_leaf=0.0,
                presort=False, random_state=42, splitter='best')
                ```

In [162]:
dummyTreePipeline.get_params()

{'columnDropper': FunctionTransformer(accept_sparse=False,
           func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
           validate=False),
 'columnDropper__accept_sparse': False,
 'columnDropper__func': <function scikitDemoHelpers.dropColumns>,
 'columnDropper__pass_y': False,
 'columnDropper__validate': False,
 'steps': [('columnDropper', FunctionTransformer(accept_sparse=False,
             func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
             validate=False)),
  ('treeClassifer',
   DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
               max_features=10, max_leaf_nodes=None, min_samples_leaf=5,
               min_samples_split=10, min_weight_fraction_leaf=0.0,
               presort=False, random_state=42, splitter='best'))],
 'treeClassifer': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=10, max_leaf_nodes=None, min_samples_leaf=5,
             min_samp

This search can be over a predefined parameter set

In [151]:
paramGrid = [
    {'treeClassifer__max_features': [1, 5, 12, 25], 'treeClassifer__min_samples_split': [5, 10, 15]}
]

In [152]:
from sklearn import grid_search
dummyGridSearch = grid_search.GridSearchCV(dummyTreePipeline, paramGrid, cv=5)

In [153]:
dummyGridSearch.fit(dummyTransformer.transform(train.drop('Survived', axis=1)), train.Survived)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
          validate=False)), ('treeClassifer', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=10, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=10, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'treeClassifer__max_features': [1, 5, 12, 25], 'treeClassifer__min_samples_split': [5, 10, 15]}],
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [154]:
dummyGridSearch.best_score_

0.81930415263748602

In [155]:
dummyGridSearch.grid_scores_

[mean: 0.69585, std: 0.01797, params: {'treeClassifer__max_features': 1, 'treeClassifer__min_samples_split': 5},
 mean: 0.69585, std: 0.01797, params: {'treeClassifer__max_features': 1, 'treeClassifer__min_samples_split': 10},
 mean: 0.70819, std: 0.01578, params: {'treeClassifer__max_features': 1, 'treeClassifer__min_samples_split': 15},
 mean: 0.78002, std: 0.03347, params: {'treeClassifer__max_features': 5, 'treeClassifer__min_samples_split': 5},
 mean: 0.78002, std: 0.03347, params: {'treeClassifer__max_features': 5, 'treeClassifer__min_samples_split': 10},
 mean: 0.78002, std: 0.02249, params: {'treeClassifer__max_features': 5, 'treeClassifer__min_samples_split': 15},
 mean: 0.81481, std: 0.02347, params: {'treeClassifer__max_features': 12, 'treeClassifer__min_samples_split': 5},
 mean: 0.81481, std: 0.02347, params: {'treeClassifer__max_features': 12, 'treeClassifer__min_samples_split': 10},
 mean: 0.80696, std: 0.04500, params: {'treeClassifer__max_features': 12, 'treeClassifer_

In [156]:
dummyGridSearch.best_params_

{'treeClassifer__max_features': 25, 'treeClassifer__min_samples_split': 5}

In [157]:
dummyGridSearch.best_estimator_

Pipeline(steps=[('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
          validate=False)), ('treeClassifer', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=25, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))])

Or we can allow it to randomly roam with [```sklearn.grid_search.RandomizedSearchCV```](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html#sklearn.grid_search.RandomizedSearchCV):

In [179]:
from scipy.stats import randint as sp_randint
paramDists = {
 'treeClassifer__max_features': sp_randint(1,25),
 'treeClassifer__min_samples_split': sp_randint(1, 30),
}

In [187]:
dummyRandomSearch = grid_search.RandomizedSearchCV(dummyTreePipeline, paramDists, cv=5, n_iter=10, random_state=42)

In [188]:
dummyRandomSearch.fit(dummyTransformer.transform(train.drop('Survived', axis=1)), train.Survived)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(steps=[('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
          validate=False)), ('treeClassifer', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=10, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=10, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))]),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'treeClassifer__max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd1d06dc790>, 'treeClassifer__min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd1d06ad750>},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          scoring=None, verbose=0)

In [189]:
dummyRandomSearch.best_score_

0.82491582491582494

In [190]:
dummyRandomSearch.grid_scores_

[mean: 0.79349, std: 0.01912, params: {'treeClassifer__max_features': 10, 'treeClassifer__min_samples_split': 2},
 mean: 0.80808, std: 0.02640, params: {'treeClassifer__max_features': 16, 'treeClassifer__min_samples_split': 22},
 mean: 0.75982, std: 0.04621, params: {'treeClassifer__max_features': 3, 'treeClassifer__min_samples_split': 4},
 mean: 0.82492, std: 0.01880, params: {'treeClassifer__max_features': 13, 'treeClassifer__min_samples_split': 15},
 mean: 0.81145, std: 0.02384, params: {'treeClassifer__max_features': 24, 'treeClassifer__min_samples_split': 11},
 mean: 0.82043, std: 0.02673, params: {'treeClassifer__max_features': 18, 'treeClassifer__min_samples_split': 14},
 mean: 0.78227, std: 0.03129, params: {'treeClassifer__max_features': 4, 'treeClassifer__min_samples_split': 29},
 mean: 0.71268, std: 0.02279, params: {'treeClassifer__max_features': 1, 'treeClassifer__min_samples_split': 27},
 mean: 0.78900, std: 0.01977, params: {'treeClassifer__max_features': 4, 'treeClassif

In [191]:
dummyRandomSearch.best_params_

{'treeClassifer__max_features': 13, 'treeClassifer__min_samples_split': 15}

In [192]:
dummyRandomSearch.best_estimator_

Pipeline(steps=[('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
          validate=False)), ('treeClassifer', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=13, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=15, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))])

## On to [Ensembles](./ensembles.ipynb#ensembleBegin)<br>
OR<br>
we look at a few more parameter combinations:

In [203]:
paramDists2 = {
 'treeClassifer__criterion': ["gini", "entropy"],
 'treeClassifer__max_depth': sp_randint(1,50),
 'treeClassifer__max_features': sp_randint(1,25),
 'treeClassifer__max_leaf_nodes': sp_randint(2,25),
 'treeClassifer__min_samples_leaf': sp_randint(1,25),
 'treeClassifer__min_samples_split': sp_randint(1,25),
 'treeClassifer__splitter': ['best', 'random']
}

In [204]:
dummyRandomSearch2 = grid_search.RandomizedSearchCV(dummyTreePipeline, paramDists2, cv=5, n_iter=1000, random_state=42)

In [205]:
dummyRandomSearch2.fit(dummyTransformer.transform(train.drop('Survived', axis=1)), train.Survived)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(steps=[('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
          validate=False)), ('treeClassifer', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=10, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=10, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))]),
          fit_params={}, iid=True, n_iter=1000, n_jobs=1,
          param_distributions={'treeClassifer__max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd1d019e1d0>, 'treeClassifer__min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd1d01a3ed0>, 'treeClassifer__min_samples_leaf': <scipy.stats._distn_infrastructur...infrastructure.rv_frozen object at 0x7fd1d019e510>, 'treeClassifer__criterion': ['gini', 'entrop

In [206]:
print dummyRandomSearch2.best_score_
print dummyRandomSearch2.best_params_

0.832772166105
{'treeClassifer__max_features': 20, 'treeClassifer__min_samples_split': 3, 'treeClassifer__min_samples_leaf': 3, 'treeClassifer__max_leaf_nodes': 18, 'treeClassifer__splitter': 'random', 'treeClassifer__max_depth': 5, 'treeClassifer__criterion': 'entropy'}


In [214]:
dummyRandomSearch3 = grid_search.RandomizedSearchCV(dummyTreePipeline, paramDists2, cv=5, n_iter=1000, refit=True, random_state=42)

In [215]:
dummyRandomSearch3.fit(dummyTransformer.transform(X_fit), y_fit)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=Pipeline(steps=[('columnDropper', FunctionTransformer(accept_sparse=False,
          func=<function dropColumns at 0x7fd1d6828410>, pass_y=False,
          validate=False)), ('treeClassifer', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=10, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=10, min_weight_fraction_leaf=0.0,
            presort=False, random_state=42, splitter='best'))]),
          fit_params={}, iid=True, n_iter=1000, n_jobs=1,
          param_distributions={'treeClassifer__max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd1d019e1d0>, 'treeClassifer__min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd1d01a3ed0>, 'treeClassifer__min_samples_leaf': <scipy.stats._distn_infrastructur...infrastructure.rv_frozen object at 0x7fd1d019e510>, 'treeClassifer__criterion': ['gini', 'entrop

In [216]:
dummyRandomSearch3.best_score_

0.8293413173652695

In [217]:
dummyRandomSearch3.score(dummyTransformer.transform(X_validation), y_validation)

0.80717488789237668