# Pyplearnr demo
Here I demonstrate pyplearnr, a wrapper for scikit-learn that performs model validation and selection using nested k-fold cross-validation.

# Titanic dataset example
Here I use the Titanic dataset I've cleaned and pickled in a separate tutorial.

# Prepare data
## Import data

In [1]:
import os

In [2]:
import pandas as pd
import numpy as np

df = pd.read_pickle('trimmed_titanic_data.pkl')

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 890 entries, 0 to 890
Data columns (total 9 columns):
Survived    890 non-null int64
Pclass      890 non-null int64
Sex         890 non-null object
Age         890 non-null float64
SibSp       890 non-null int64
Parch       890 non-null int64
Fare        890 non-null float64
Embarked    890 non-null object
Title       890 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.5+ KB


By "cleaned" I mean I've derived titles (e.g. "Mr.", "Mrs.", "Dr.", etc) from the passenger names, imputed the missing Age values using polynomial regression with grid-searched 10-fold cross-validation, filled in the 3 missing Embarked values with the mode, and removed all fields that could be considered an id for that individual.

Thus, no data are missing or null.

## Set categorical features as type 'category'
In order to one-hot encode categorical data, its best to set the features that are considered categorical:

In [3]:
df.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,male,22.0,1,0,7.25,S,Mr
1,1,1,female,38.0,1,0,71.2833,C,Mrs
2,1,3,female,26.0,0,0,7.925,S,Miss
3,1,1,female,35.0,1,0,53.1,S,Mrs
4,0,3,male,35.0,0,0,8.05,S,Mr
5,0,3,male,35.050324,0,0,8.4583,Q,Mr
6,0,1,male,54.0,0,0,51.8625,S,Mr
7,0,3,male,2.0,3,1,21.075,S,Child
8,1,3,female,27.0,0,2,11.1333,S,Mrs
9,1,2,female,14.0,1,0,30.0708,C,Mrs


In [4]:
simulation_df = df.copy()

categorical_features = ['Survived','Pclass','Sex','Embarked','Title']

for feature in categorical_features:
    simulation_df[feature] = simulation_df[feature].astype('category')
    
simulation_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 890 entries, 0 to 890
Data columns (total 9 columns):
Survived    890 non-null category
Pclass      890 non-null category
Sex         890 non-null category
Age         890 non-null float64
SibSp       890 non-null int64
Parch       890 non-null int64
Fare        890 non-null float64
Embarked    890 non-null category
Title       890 non-null category
dtypes: category(5), float64(2), int64(2)
memory usage: 39.9 KB


## One-hot encode categorical features

In [5]:
simulation_df = pd.get_dummies(simulation_df,drop_first=True)

simulation_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 890 entries, 0 to 890
Data columns (total 17 columns):
Age               890 non-null float64
SibSp             890 non-null int64
Parch             890 non-null int64
Fare              890 non-null float64
Survived_1        890 non-null uint8
Pclass_2          890 non-null uint8
Pclass_3          890 non-null uint8
Sex_male          890 non-null uint8
Embarked_Q        890 non-null uint8
Embarked_S        890 non-null uint8
Title_Dr          890 non-null uint8
Title_Military    890 non-null uint8
Title_Miss        890 non-null uint8
Title_Mr          890 non-null uint8
Title_Mrs         890 non-null uint8
Title_Noble       890 non-null uint8
Title_Rev         890 non-null uint8
dtypes: float64(2), int64(2), uint8(13)
memory usage: 46.1 KB


Now we have 17 features.

## Split into input/output data

In [6]:
# Set output feature
output_feature = 'Survived_1'

# Get all column names
column_names = list(simulation_df.columns)

# Get input features
input_features = [x for x in column_names if x != output_feature]

# Split into features and responses
X = simulation_df[input_features].copy()
y = simulation_df[output_feature].copy()

# Null model

In [7]:
simulation_df['Survived_1'].value_counts().values/float(simulation_df['Survived_1'].value_counts().values.sum())

array([ 0.61573034,  0.38426966])

Thus, null accuracy of ~62% if we always predict death.

# Import pyplearnr and initialize optimized pipeline collection

In [8]:
%matplotlib inline

import pyplearnr as ppl

# KNN 
Here we do a k-nearest neighbors (KNN) classification with stratified and nested 3-by-3-fold cross-validation over the 1 to 30 nearest neighbors and the use of either "uniform" or "distance" weights:

In [9]:
%%time 

# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3, 
                                      inner_loop_fold_count=3,
                                      shuffle_seed=2369,
                                      outer_loop_split_seed=461,
                                      inner_loop_split_seeds=[284, 406, 303])

# Combinatorial pipeline schematic
pipeline_schematic = [
    {'estimator': {
            'knn': {
                'n_neighbors': range(1,31),
                'weights': ['uniform','distance']
    }}}
]

# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         scoring_metric='auc', score_type='median')

Outer Fold: 2 

51 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=26, p=2,
           weights='distance'))])
55 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=28, p=2,
           weights='distance'))])


No model was chosen because there is no clear winner. Please use the same fit method with best_inner_fold_pipeline_inds keyword argument.

Example:	kfcv.fit(X.values, y.values, pipelines)
		kfcv.fit(X.values, y.values, pipelines, 
			 best_inner_fold_pipeline_inds = {0:9, 2:3})

CPU times: user 5.76 s, sys: 47.1 ms, total: 5.81 s
Wall time: 5.92 s


In [9]:
%%time 

# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3, 
                                      inner_loop_fold_count=3,
                                      shuffle_seed=2369,
                                      outer_loop_split_seed=461,
                                      inner_loop_split_seeds=[284, 406, 303])

# Combinatorial pipeline schematic
pipeline_schematic = [
    {'estimator': {
            'knn': {
                'n_neighbors': range(1,31),
                'weights': ['uniform','distance']
    }}}
]

# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         scoring_metric='auc', score_type='median')

Outer Fold: 2 

51 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=26, p=2,
           weights='distance'))])
55 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=28, p=2,
           weights='distance'))])


No model was chosen because there is no clear winner. Please use the same fit method with best_inner_fold_pipeline_inds keyword argument.

Example:	kfcv.fit(X.values, y.values, pipelines)
		kfcv.fit(X.values, y.values, pipelines, 
			 best_inner_fold_pipeline_inds = {0:9, 2:3})

CPU times: user 5.74 s, sys: 121 ms, total: 5.86 s
Wall time: 6.05 s


Pyplearnr has indicated that the contest of outer-fold 2 has resulted in a tie between two pipelines with the same median score over all inner-folds. We can resolve this by re-running the fit method with the best_inner_fold_pipeline_inds keyword argument. I'll choose the simplest (higher number of neighbors):

In [None]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         best_inner_fold_pipeline_inds={2:55})

Whatever pipline wins the most outer-fold contests wins overall.

In this case, pyplearnr has notified us that all inner-fold contests of each outer-fold have resulted in different winners. We can resolve this conflict by, again, re-running the fit method, but with the best_outer_fold_pipeline keyword argument:

In [None]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         best_outer_fold_pipeline=55)

The output report lists the winning pipeline index, its validation (outer-fold test) scores and statistics, inner-fold (IF) test scores and statistics for each outer-fold (OF), a layout of the pipeline steps, and the corresponding step parameters.

Additionally, the report contains the outer- and inner-fold counts, seeds, scoring metric, and scoring type. These, along with the same data and pipelines, can be used as inputs to the nested k-fold cross-validation object initialization and its fit method to uniquely duplicate the results of this run.

We can get a visual report of this pipeline's validation scores and the inner-fold test scores for each outer-fold:

In [None]:
kfcv.plot_best_pipeline_scores()

Additionally we can visualize the performance of all pipelines over all folds:

In [None]:
kfcv.plot_contest(all_folds=True, markersize=3, figsize=(5,10), fontsize=8)

Additionally, we can visualize the test fold scores in all folds separately:

In [None]:
kfcv.plot_contest(markersize=2, figsize=(5,10), fontsize=8)

# KNN with different scaling
We can investigate how scaling affects the scores for this dataset by including them in the combinatorial pipeline schematic:

In [None]:
%%time 

# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3, 
                                      inner_loop_fold_count=3,
                                      shuffle_seed=2369,
                                      outer_loop_split_seed=461,
                                      inner_loop_split_seeds=[284, 406, 303])

# Combinatorial pipeline schematic
pipeline_schematic = [
    {'scaler': {
            'none': {},
            'standard': {},
            'normal': {},
            'min_max': {},
            'binary': {}
        }},
    {'estimator': {
            'knn': {
                'n_neighbors': range(1,31),
                'weights': ['uniform','distance']
    }}}
]

# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         scoring_metric='auc', score_type='median')

In [None]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         scoring_metric='auc', score_type='median',
         best_inner_fold_pipeline_inds={1:15})

In [None]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         scoring_metric='auc', score_type='median',
         best_outer_fold_pipeline=27)

Pyplearnr has chosen a pipeline that scales the feature data to between 0 and 1 before putting it through a KNN classifier set to consider 14 nearest neighbors weighted by distance.

We can plot the contests again, but this time with the color_by keyword argument, to see if there are any patterns:

In [None]:
kfcv.plot_contest(color_by='scaler', all_folds=True, 
                  markersize=1, figsize=(10,20), fontsize=4)

As expected, those pipelines with a lack of scaling have the lowest scores. Additionally, the scikit-learn Normalizer scaler does worse than the others. 

The MinMaxScaler and StandardScaler do the best for this dataset with the KNN classifier.

# Different estimators
Let's say we would like to compare the performance of multiple classifiers on this dataset. We can specify scikit-learn objects directly if they're not already supported by including them as a 'sklo' parameter for step options:

In [None]:
%%time 

from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3, 
                                      inner_loop_fold_count=3,
                                      shuffle_seed=2369,
                                      outer_loop_split_seed=461,
                                      inner_loop_split_seeds=[284, 406, 303])

# Combinatorial pipeline schematic
pipeline_schematic = [
    {'estimator': {
            'knn': {
                'n_neighbors': range(1,31),
                'weights': ['uniform','distance']
            },
            'svm': {
                    'sklo': LinearSVC,
                    'loss': ['hinge', 'squared_hinge']
                },
            'logistic_regression': {},
            'random_forest': {
                'sklo': RandomForestClassifier,
                'max_depth': range(2,6)
            },
            'gaussian': {
                'sklo': GaussianProcessClassifier,                
            },
            'adaboost': {
                'sklo': AdaBoostClassifier
            },
            'naive_bayes': {
                'sklo': GaussianNB
            },
            'qda': {
                'sklo': QuadraticDiscriminantAnalysis
            }
    }}
]

# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
         scoring_metric='auc', score_type='median')

In [None]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
    best_inner_fold_pipeline_inds={2:60})

The best model appears to be logistic regression.

In [None]:
kfcv.plot_best_pipeline_scores()

The validation performance matches well with that in the inner-fold contests.

Let's look at all of the inner-fold pipeline contests:

In [None]:
kfcv.plot_contest(all_folds=True, color_by='estimator', 
                  color_map='jet', markersize=5, fontsize=10)

Note, I've chosen different input parameters to make the plot look better.

# PCA with feature selection and KNN
We'd like to see if there's any pattern in doing either standard or min_max scaling, PCA, selection of different numbers of the transformed output (essentially selecting the number of principal components to use to transform the data), and k-nearest neighbors over multiple values of k:

In [None]:
%%time 

import numpy as np

# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3, 
                                      inner_loop_fold_count=3,
                                      shuffle_seed=3243,
                                      outer_loop_split_seed=45,
                                      inner_loop_split_seeds=[62, 207, 516])

# Combinatorial pipeline schematic
feature_count = X.shape[1]

pipeline_schematic = [
    {'scaler': {
            'min_max': {},
            'standard': {}
        }
    },
    {'transform': {
            'pca': {
                'n_components': [feature_count]
            }
        }         
    },
    {'feature_selection': {
            'select_k_best': {
                'k': range(1, feature_count+1)
            }
        }
    },
    {'estimator': {
            'knn': {
                'n_neighbors': range(1,31)
                }
        }
    }
]

# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, scoring_metric='auc')

In [None]:
kfcv.fit(X.values, y.values, 
         pipeline_schematic=pipeline_schematic, 
         best_outer_fold_pipeline=728)

Our process has resulted in the selection of a pipeline with standard scaling, PCA, selection of 9 principal components, and feeding into a KNN classifier with 9 neighbors and use of unform weighting.

In [None]:
kfcv.plot_best_pipeline_scores()

The validation performance is in line with the best pipeline's inner-fold testing performance for each outer-fold.

We can look at the effect of parameter values by changing the color_by keyword argument to a string with the format 
'step\__step_option\__parameter_name\__parameter_value'. To be clear, those are two underscores in-between step, step_options, parameter_name, and parameter_value

Let's see if there are any patterns with regard to the number of principal components used:

In [None]:
kfcv.plot_contest(all_folds=True, markersize=1, fontsize=2, figsize=(20,60), 
                  color_by='feature_selection__select_k_best__k', color_map='hot')

Not sure I see much of a pattern other than the lower numbers of principal components used to transform the data tends to predominate at the lowest scores.

Now let's look at the number of k-nearest neighbors for the classifier:

In [None]:
kfcv.plot_contest(all_folds=True, markersize=1, fontsize=2, figsize=(20,60), 
                  color_by='estimator__knn__n_neighbors', color_map='hot')

It appears there are rather mixed results, except the highest scores appear to occur with 1 to about 13 nearest neighbors. Although, they are still represented at the lowest levels as well.

# Reducing the number of pipeline combinations
This process can become time-intensive quickly. So, in the spirit of RandomizedGridSearchCV, I've included a random_combinations keyword argument to specify the number of available combinations and a random_comboination_seed that will be calculated similarly to duplicate results:

In [None]:
%%time 

import numpy as np

# Initialize nested k-fold cross-validation object
kfcv = ppl.NestedKFoldCrossValidation(outer_loop_fold_count=3, 
                                      inner_loop_fold_count=3,
                                      shuffle_seed=3243,
                                      outer_loop_split_seed=45,
                                      inner_loop_split_seeds=[62, 207, 516],
                                      random_combinations=50,
                                      random_combination_seed=2374)

# Design combinatorial pipeline schematic
feature_count = X.shape[1]

pipeline_schematic = [
    {'scaler': {
            'min_max': {},
            'standard': {}
        }
    },
    {'transform': {
            'pca': {
                'n_components': [feature_count]
            }
        }         
    },
    {'feature_selection': {
            'select_k_best': {
                'k': range(1, feature_count+1)
            }
        }
    },
    {'estimator': {
            'knn': {
                'n_neighbors': range(1,31)
                }
        }
    }
]

# Perform nested k-fold cross-validation
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, scoring_metric='auc')

In [None]:
kfcv.fit(X.values, y.values, pipeline_schematic=pipeline_schematic, 
    best_inner_fold_pipeline_inds={2:681})

In [None]:
kfcv.plot_best_pipeline_scores()

The best pipeline has a slightly lower median validation score (0.7918 vs  0.7995) than that testing on all pipelines, though at about 1/20 of the time (7.32 s versus 2 min 36 s). 

In [None]:
kfcv.plot_contest(all_folds=True, markersize=5, fontsize=13, figsize=(30,15), 
                  color_by='estimator__knn__n_neighbors', color_map='hot', legend_loc='best')

Having less pipelines certainly makes it easier to make these graphs look better.

# Accessing internal pipeline parameters
The best pipeline is automatically placed in the pipeline field of the nested k-fold cross-validation object (kfcv).

This object is a custom pyplearnr.OuterFoldTrainedPipeline object whose own pipeline field contains the actual trained sklearn.pipeline.Pipeline object. This object can be used to look at derived pipeline step parameters normally. Please see scikit-learn's documention of Pipeline objects for more information.

# Predicting survival with the optimal model
All one has to do to make a prediction is use the .predict() method.

Here's an example of predicting whether I would survive on the Titanic. I'm 33, would probably have one family member with me, might be Pclass1 (I'd hope), male, have a Ph.D (if that's what they mean by Dr.). I'm using the median Fare for Pclass 1 and arbitrary chose a city to have embarked from:

In [None]:
personal_stats = np.array([33, 1, 0, df[df['Pclass']==1]['Fare'].median(), 
                  0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0])

zip(personal_stats,X.columns)

In [None]:
kfcv.predict(personal_stats.reshape(1,-1))

Looks like I survived!

Let's look at my predicted probability of surviving:

In [None]:
kfcv.predict_proba(personal_stats.reshape(1,-1))

I would have a 60% chance of survival.

## Summary
I've shown how to use pyplearnr to do model selection and validation among a diverse collection of pipelines, generated using a simple/intuitive/flexible combinatorial pipeline schematic input, using nested k-fold cross-validation.

Also, I've shown how to visualize the best model performance and that of all models in the inner-fold contests of each outer-fold, predict survival, and check the actual predicted probability according to the optimized pipeline.

I hope this proves to be a useful tool. 

Please let me know if you have any questions or suggestions about how to improve this tool, my code, the approach I'm taking, etc.