# Data Science Library demo
Here I demonstrate the data science library I developed to quickly build scikit learn pipelines with optional scaling, feature interaction, data transformation (e.g. PCA, t-SNE) steps. It runs the pipeline through a grid-search (all combinations or a specific number of them) stratified (if classification) k-folds cross-validation and outputs the best model.

## Titanic dataset
Here I use the Titanic dataset I've cleaned and pickled in a separate tutorial.

### Import data

In [2]:
import pandas as pd

df = pd.read_pickle('trimmed_titanic_data.pkl')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null object
Title       891 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 62.7+ KB


By "cleaned" I mean I've derived titles (e.g. "Mr.", "Mrs.", "Dr.", etc) from the passenger names, imputed the missing Age values using polynomial regression with grid-searched 10-fold cross-validation, filled in the 3 missing Embarked values with the mode, and removed all fields that could be considered an id for that individual.

Thus, there is no missing data.

## One-hot encode categorical features

In [3]:
simulation_df = df.copy()

simulation_df = pd.get_dummies(simulation_df,drop_first=True)

simulation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
Survived          891 non-null int64
Pclass            891 non-null int64
Age               891 non-null float64
SibSp             891 non-null int64
Parch             891 non-null int64
Fare              891 non-null float64
Sex_male          891 non-null uint8
Embarked_Q        891 non-null uint8
Embarked_S        891 non-null uint8
Title_Dr          891 non-null uint8
Title_Military    891 non-null uint8
Title_Miss        891 non-null uint8
Title_Mr          891 non-null uint8
Title_Mrs         891 non-null uint8
Title_Noble       891 non-null uint8
Title_Rev         891 non-null uint8
dtypes: float64(2), int64(4), uint8(10)
memory usage: 50.5 KB


Now we have 16 features.

### Split into input/output data

In [4]:
# Set output feature
output_feature = 'Survived'

# Get all column names
column_names = list(simulation_df.columns)

# Exclude one of every categorical variable since the other one-hot encodings cover everything
input_features = [x for x in column_names if x != output_feature]

# Split into features and responses
X = simulation_df[input_features].copy()
y = simulation_df[output_feature].copy()

### Null model

In [5]:
simulation_df['Survived'].value_counts().values/float(simulation_df['Survived'].value_counts().values.sum())

array([ 0.61616162,  0.38383838])

Thus, null accuracy of ~62% if always predict death.

### Import data science library and initialize model collections

In [6]:
import data_science_lib as dsl

models = {}



### Basic models w/ no pre-processing
#### KNN
Here I do a simple K-nearest neighbors (KNN) classification with 10-fold (default) cross-validation with a grid search over the default of 1 to 30 nearest neighbors and the use of either "uniform" or "distance" weights:

In [7]:
%%time 
reload(dsl)
        
# Figure out best model
models['knn'] = dsl.train_model(X,y,
                                use_default_param_dist=True,
                                random_state=6,
                                suppress_output=False, # Can suppress print outs if desired
                                estimator='knn',) 

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.731741573034

Test set classification accuracy:  0.748603351955
Confusion matrix: 

 [[96 17]
 [28 38]]

Normalized confusion matrix: 

 [[ 0.53631285  0.09497207]
 [ 0.15642458  0.2122905 ]]

Classification report: 

              precision    recall  f1-score   support

          0       0.77      0.85      0.81       113
          1       0.69      0.58      0.63        66

avg / total       0.74      0.75      0.74       179


Best parameters:

{'estimator__n_neighbors': 30, 'estimator__weights': 'distance'}

 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=30, p=2,
           weights='distance'))])
CPU times: user 1.78 s, sys: 133 ms, tot

Turns out that the best settings are 30 neighbors and the use of the 'distance' weight.

Note how I've set the random_state to 6 so that the models can be compared using the same test/train split.

The output of the train_model() method is a Pipeline object with the optimal parameters found during the grid search and trained on all data.

Additional fields have been added to the pipeline object. These extra parameters are the type of score ('regression' or 'classification'), '.score_type', the training score (L2 norm for regression and classification accuracy for classification), '.train_score', the corresponding test score, '.test_score'. 

For classification problems, additional parameters also include the confusion matrix, '.confusion_matrix', normalized confusion matrix, '.normalized_confusion_matrix', and the classification report, '.classification_report'.

Here are the outputs of these additional parameters:

In [8]:
pipeline = models['knn']

# Score type and training/test scores
print(pipeline.score_type)

print(pipeline.train_score)

print(pipeline.test_score)

# Best parameters
print(pipeline.best_parameters)

# Confusion matrix, if classification
print(pipeline.confusion_matrix) 

# Confusion matrix divided by its total sum
print(pipeline.normalized_confusion_matrix)

# Classification report
print(pipeline.classification_report)

classification
0.731741573034
0.748603351955
{'estimator__n_neighbors': 30, 'estimator__weights': 'distance'}
[[96 17]
 [28 38]]
[[ 0.53631285  0.09497207]
 [ 0.15642458  0.2122905 ]]
             precision    recall  f1-score   support

          0       0.77      0.85      0.81       113
          1       0.69      0.58      0.63        66

avg / total       0.74      0.75      0.74       179



The print out from the solution above indicates that the default parameters to grid over are n_neighbors from 1 to 30 and the weights parameter as either 'uniform' or 'distance'.

This can be changed in two different ways. One way is to overwrite the parameter values by setting the param_dist keyword argument with the use_default_param_dist set to True:

In [10]:
%%time 
reload(dsl)

# Set custom parameters
param_dist = {
    'estimator__n_neighbors': range(30,500)
}

# Figure out best model
models['custom_overwrite_knn'] = dsl.train_model(X,y,
                                       use_default_param_dist=True,
                                       random_state=6,
                                       suppress_output=False, # Can suppress print outs if desired
                                       estimator='knn',
                                       param_dist = param_dist) 

Grid parameters:
estimator__n_neighbors : [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 23

We were able to get a slightly better accuracy doing this.

The second way to use different parameter grid values is to set them with the custom param_dist keyword argument yet set use_default_param_dist to False. This makes it so that you must set every single parameter manually.

Here's an example:

In [11]:
%%time 
reload(dsl)

# Set custom parameters
param_dist = {
    'estimator__n_neighbors': range(30,500)
}

# Figure out best model
models['from_scratch_knn'] = dsl.train_model(X,y,
                                       use_default_param_dist=False,
                                       random_state=6,
                                       suppress_output=False, # Can suppress print outs if desired
                                       estimator='knn',
                                       param_dist = param_dist) 

Grid parameters:
estimator__n_neighbors : [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 23

Note how the estimator__weights parameter isn't set for the KNN estimator.

#### Other models
This code currently supports K-nearest neighbors, logistic regression, support vector machines, multilayer perceptrons, random forest, and adaboost. 

We can loop through and pick the best model like this:

In [12]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models[model_name] = dsl.train_model(X,y,
                                    use_default_param_dist=True,
                                    random_state=6,
                                    estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.731741573034

Test set classification accuracy:  0.748603351955
Confusion matrix: 

 [[96 17]
 [28 38]]

Normalized confusion matrix: 

 [[ 0.53631285  0.09497207]
 [ 0.15642458  0.2122905 ]]

Classification report: 

              precision    recall  f1-score   support

          0       0.77      0.85      0.81       113
          1       0.69      0.58      0.63        66

avg / total       0.74      0.75      0.74       179


Best parameters:

{'estimator__n_neighbors': 30, 'estimator__weights': 'distance'}

 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=30, p=2,
           weights='distance'))])
Grid parameters:
estimator__C : [  1.000

In [23]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score, '\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



It turns out that the best model is logistic regression with a classfication accuracy of ~88%.

### Scaled data then classification
We can specify the scale_type keyword argument to scale the data before being fed to the desired estimator. Currently only standard scaling is supported:

In [24]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['scaled_%s'%(model_name)] = dsl.train_model(X,y,
                                                       use_default_param_dist=True,
                                                       random_state=6,
                                                       scale_type = 'standard',
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.832865168539

Test set classification accuracy:  0.821229050279
Confusion matrix: 

 [[100  13]
 [ 19  47]]

Normalized confusion matrix: 

 [[ 0.55865922  0.0726257 ]
 [ 0.10614525  0.26256983]]

Classification report: 

              precision    recall  f1-score   support

          0       0.84      0.88      0.86       113
          1       0.78      0.71      0.75        66

avg / total       0.82      0.82      0.82       179


Best parameters:

{'estimator__n_neighbors': 7, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
   

In [25]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score, '\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



### Feature selection, scaling, and classification
The feature_selection_type keyword argument can be used to select the best features:

In [26]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['select_scaled_%s'%(model_name)] = dsl.train_model(X,y,
                                                       use_default_param_dist=True,
                                                       random_state=6,
                                                       scale_type = 'standard',
                                                       feature_selection_type = 'select_k_best', 
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
feature_selection__k : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.835674157303

Test set classification accuracy:  0.832402234637
Confusion matrix: 

 [[102  11]
 [ 19  47]]

Normalized confusion matrix: 

 [[ 0.5698324   0.06145251]
 [ 0.10614525  0.26256983]]

Classification report: 

              precision    recall  f1-score   support

          0       0.84      0.90      0.87       113
          1       0.81      0.71      0.76        66

avg / total       0.83      0.83      0.83       179


Best parameters:

{'estimator__n_neighbors': 7, 'feature_selection__k': 12, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('feature_selection', SelectKBest(k=12, score_func=<function f_classif at 0x10e09b7d0>)), ('scaler', StandardScaler(co

In [27]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score, '\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



It turns out that logistic regression without scaling outperforms all combinations of scalling and the classifiers.

### Scaled, transformed, then classification
Setting the transform_type keyword argument allows the data to be transformed into a new coordinate system that is dependent on the algorithm.

Currently, only principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) are supported.

#### PCA transformation

In [33]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['scaled_pca_%s'%(model_name)] = dsl.train_model(X,y,use_default_param_dist=True,
                                                       random_state=6,
                                                        transform_type='pca',
                                                       scale_type = 'standard',
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.832865168539

Test set classification accuracy:  0.821229050279
Confusion matrix: 

 [[100  13]
 [ 19  47]]

Normalized confusion matrix: 

 [[ 0.55865922  0.0726257 ]
 [ 0.10614525  0.26256983]]

Classification report: 

              precision    recall  f1-score   support

          0       0.84      0.88      0.86       113
          1       0.78      0.71      0.75        66

avg / total       0.82      0.82      0.82       179


Best parameters:

{'estimator__n_neighbors': 7, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('transform', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('estimator', K

In [34]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score,'\n\n',models[best_model].classification_report

select_scaled_logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



Transformation with PCA doesn't appear to improve our results so far.

#### t-SNE

In [35]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['scaled_t-sne_%s'%(model_name)] = dsl.train_model(X,y,use_default_param_dist=True,
                                                       random_state=6,
                                                        transform_type='t-sne',
                                                       scale_type = 'standard',
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.566011235955

Test set classification accuracy:  0.63687150838
Confusion matrix: 

 [[97 16]
 [49 17]]

Normalized confusion matrix: 

 [[ 0.54189944  0.08938547]
 [ 0.27374302  0.09497207]]

Classification report: 

              precision    recall  f1-score   support

          0       0.66      0.86      0.75       113
          1       0.52      0.26      0.34        66

avg / total       0.61      0.64      0.60       179


Best parameters:

{'estimator__n_neighbors': 4, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('transform', pipeline_TSNE(angle=0.5, early_exaggeration=4.0, init='pca',
       learning_rate=1000.0, method='barnes_hut', metric='euclidean',
       min_

In [36]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score,'\n\n',models[best_model].classification_report

select_scaled_logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



Logistic regression with with scaling and selection appears to outperform all other scenarios so far.