# Data Science Library demo
Here I demonstrate the data science library I developed to quickly build scikit learn pipelines with optional scaling, feature interaction, data transformation (e.g. PCA, t-SNE) steps. It runs the pipeline through a grid-search (all combinations or a specific number of them) stratified (if classification) k-folds cross-validation and outputs the best model.

## Titanic dataset
Here I use the Titanic dataset I've cleaned and pickled in a separate tutorial.

### Import data

In [1]:
import pandas as pd

df = pd.read_pickle('trimmed_titanic_data.pkl')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null object
Title       891 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 62.7+ KB


By "cleaned" I mean I've derived titles (e.g. "Mr.", "Mrs.", "Dr.", etc) from the passenger names, imputed the missing Age values using polynomial regression with grid-searched 10-fold cross-validation, filled in the 3 missing Embarked values with the mode, and removed all fields that could be considered an id for that individual.

Thus, there is no missing data.

## Set categorical features as that type

In [2]:
simulation_df = df.copy()

categorical_features = ['Survived','Pclass','Sex','Embarked','Title']

for feature in categorical_features:
    simulation_df[feature] = simulation_df[feature].astype('category')
    
simulation_df.info()

# df["A"].astype('category')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived    891 non-null category
Pclass      891 non-null category
Sex         891 non-null category
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null category
Title       891 non-null category
dtypes: category(5), float64(2), int64(2)
memory usage: 32.4 KB


## One-hot encode categorical features

In [3]:
simulation_df = pd.get_dummies(simulation_df,drop_first=True)

simulation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
Age               891 non-null float64
SibSp             891 non-null int64
Parch             891 non-null int64
Fare              891 non-null float64
Survived_1        891 non-null uint8
Pclass_2          891 non-null uint8
Pclass_3          891 non-null uint8
Sex_male          891 non-null uint8
Embarked_Q        891 non-null uint8
Embarked_S        891 non-null uint8
Title_Dr          891 non-null uint8
Title_Military    891 non-null uint8
Title_Miss        891 non-null uint8
Title_Mr          891 non-null uint8
Title_Mrs         891 non-null uint8
Title_Noble       891 non-null uint8
Title_Rev         891 non-null uint8
dtypes: float64(2), int64(2), uint8(13)
memory usage: 39.2 KB


Now we have 17 features.

### Split into input/output data

In [4]:
# Set output feature
output_feature = 'Survived_1'

# Get all column names
column_names = list(simulation_df.columns)

# Exclude one of every categorical variable since the other one-hot encodings cover everything
input_features = [x for x in column_names if x != output_feature]

# Split into features and responses
X = simulation_df[input_features].copy()
y = simulation_df[output_feature].copy()

### Null model

In [5]:
simulation_df['Survived_1'].value_counts().values/float(simulation_df['Survived_1'].value_counts().values.sum())

array([ 0.61616162,  0.38383838])

Thus, null accuracy of ~62% if always predict death.

### Import data science library and initialize model collections

In [6]:
import data_science_lib as dsl

models = {}



### Basic models w/ no pre-processing
#### KNN
Here I do a simple K-nearest neighbors (KNN) classification with 10-fold (default) cross-validation with a grid search over the default of 1 to 30 nearest neighbors and the use of either "uniform" or "distance" weights:

In [7]:
%%time 
reload(dsl)
        
# Figure out best model
models['knn'] = dsl.train_model(X,y,
                                use_default_param_dist=True,
                                random_state=6,
                                suppress_output=False, # Can suppress print outs if desired
                                estimator='knn',) 

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.733146067416

Test set classification accuracy:  0.737430167598
Confusion matrix: 

 [[95 18]
 [29 37]]

Normalized confusion matrix: 

 [[ 0.53072626  0.10055866]
 [ 0.16201117  0.20670391]]

Classification report: 

              precision    recall  f1-score   support

          0       0.77      0.84      0.80       113
          1       0.67      0.56      0.61        66

avg / total       0.73      0.74      0.73       179


Best parameters:

{'estimator__n_neighbors': 24, 'estimator__weights': 'distance'}

 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=24, p=2,
           weights='distance'))])
CPU times: user 2.1 s, sys: 116 ms, tota

Turns out that the best settings are 30 neighbors and the use of the 'distance' weight.

Note how I've set the random_state to 6 so that the models can be compared using the same test/train split.

The output of the train_model() method is a Pipeline object with the optimal parameters found during the grid search and trained on all data.

Additional fields have been added to the pipeline object. These extra parameters are the type of score ('regression' or 'classification'), '.score_type', the training score (L2 norm for regression and classification accuracy for classification), '.train_score', the corresponding test score, '.test_score'. 

For classification problems, additional parameters also include the confusion matrix, '.confusion_matrix', normalized confusion matrix, '.normalized_confusion_matrix', and the classification report, '.classification_report'.

Here are the outputs of these additional parameters:

In [8]:
pipeline = models['knn']

# Score type and training/test scores
print(pipeline.score_type)

print(pipeline.train_score)

print(pipeline.test_score)

# Best parameters
print(pipeline.best_parameters)

# Confusion matrix, if classification
print(pipeline.confusion_matrix) 

# Confusion matrix divided by its total sum
print(pipeline.normalized_confusion_matrix)

# Classification report
print(pipeline.classification_report)

classification
0.733146067416
0.737430167598
{'estimator__n_neighbors': 24, 'estimator__weights': 'distance'}
[[95 18]
 [29 37]]
[[ 0.53072626  0.10055866]
 [ 0.16201117  0.20670391]]
             precision    recall  f1-score   support

          0       0.77      0.84      0.80       113
          1       0.67      0.56      0.61        66

avg / total       0.73      0.74      0.73       179



The print out from the solution above indicates that the default parameters to grid over are n_neighbors from 1 to 30 and the weights parameter as either 'uniform' or 'distance'.

This can be changed in two different ways. One way is to overwrite the parameter values by setting the param_dist keyword argument with the use_default_param_dist set to True:

In [9]:
%%time 
reload(dsl)

# Set custom parameters
param_dist = {
    'estimator__n_neighbors': range(30,500)
}

# Figure out best model
models['custom_overwrite_knn'] = dsl.train_model(X,y,
                                       use_default_param_dist=True,
                                       random_state=6,
                                       suppress_output=False, # Can suppress print outs if desired
                                       estimator='knn',
                                       param_dist = param_dist) 

Grid parameters:
estimator__n_neighbors : [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 23

We were able to get a slightly better accuracy doing this.

The second way to use different parameter grid values is to set them with the custom param_dist keyword argument yet set use_default_param_dist to False. This makes it so that you must set every single parameter manually.

Here's an example:

In [10]:
%%time 
reload(dsl)

# Set custom parameters
param_dist = {
    'estimator__n_neighbors': range(30,500)
}

# Figure out best model
models['from_scratch_knn'] = dsl.train_model(X,y,
                                       use_default_param_dist=False,
                                       random_state=6,
                                       suppress_output=False, # Can suppress print outs if desired
                                       estimator='knn',
                                       param_dist = param_dist) 

Grid parameters:
estimator__n_neighbors : [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 23

Note how the estimator__weights parameter isn't set for the KNN estimator.

#### Other models
This code currently supports K-nearest neighbors, logistic regression, support vector machines, multilayer perceptrons, random forest, and adaboost. 

We can loop through and pick the best model like this:

In [11]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models[model_name] = dsl.train_model(X,y,
                                    use_default_param_dist=True,
                                    random_state=6,
                                    estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.733146067416

Test set classification accuracy:  0.737430167598
Confusion matrix: 

 [[95 18]
 [29 37]]

Normalized confusion matrix: 

 [[ 0.53072626  0.10055866]
 [ 0.16201117  0.20670391]]

Classification report: 

              precision    recall  f1-score   support

          0       0.77      0.84      0.80       113
          1       0.67      0.56      0.61        66

avg / total       0.73      0.74      0.73       179


Best parameters:

{'estimator__n_neighbors': 24, 'estimator__weights': 'distance'}

 Pipeline(steps=[('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=24, p=2,
           weights='distance'))])
Grid parameters:
estimator__C : [  1.000

In [12]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score, '\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



It turns out that the best model is logistic regression with a classfication accuracy of ~88%.

### Scaled data then classification
We can specify the scale_type keyword argument to scale the data before being fed to the desired estimator. Currently only standard scaling is supported:

In [13]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['scaled_%s'%(model_name)] = dsl.train_model(X,y,
                                                       use_default_param_dist=True,
                                                       random_state=6,
                                                       scale_type = 'standard',
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.823033707865

Test set classification accuracy:  0.821229050279
Confusion matrix: 

 [[104   9]
 [ 23  43]]

Normalized confusion matrix: 

 [[ 0.58100559  0.05027933]
 [ 0.12849162  0.24022346]]

Classification report: 

              precision    recall  f1-score   support

          0       0.82      0.92      0.87       113
          1       0.83      0.65      0.73        66

avg / total       0.82      0.82      0.82       179


Best parameters:

{'estimator__n_neighbors': 6, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('estimator', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
   

In [14]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score, '\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



### Feature selection, scaling, and classification
The feature_selection_type keyword argument can be used to select the best features:

In [15]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['select_scaled_%s'%(model_name)] = dsl.train_model(X,y,
                                                       use_default_param_dist=True,
                                                       random_state=6,
                                                       scale_type = 'standard',
                                                       feature_selection_type = 'select_k_best', 
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
feature_selection__k : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.827247191011

Test set classification accuracy:  0.815642458101
Confusion matrix: 

 [[103  10]
 [ 23  43]]

Normalized confusion matrix: 

 [[ 0.57541899  0.05586592]
 [ 0.12849162  0.24022346]]

Classification report: 

              precision    recall  f1-score   support

          0       0.82      0.91      0.86       113
          1       0.81      0.65      0.72        66

avg / total       0.82      0.82      0.81       179


Best parameters:

{'estimator__n_neighbors': 4, 'feature_selection__k': 13, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('feature_selection', SelectKBest(k=13, score_func=<function f_classif at 0x1144d2758>)), ('scaler', StandardScale

In [16]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score, '\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



It turns out that logistic regression without scaling outperforms all combinations of scalling and the classifiers.

### Scaled, transformed, then classification
Setting the transform_type keyword argument allows the data to be transformed into a new coordinate system that is dependent on the algorithm.

Currently, only principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) are supported.

#### PCA transformation

In [17]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['scaled_pca_%s'%(model_name)] = dsl.train_model(X,y,use_default_param_dist=True,
                                                       random_state=6,
                                                        transform_type='pca',
                                                       scale_type = 'standard',
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.823033707865

Test set classification accuracy:  0.821229050279
Confusion matrix: 

 [[104   9]
 [ 23  43]]

Normalized confusion matrix: 

 [[ 0.58100559  0.05027933]
 [ 0.12849162  0.24022346]]

Classification report: 

              precision    recall  f1-score   support

          0       0.82      0.92      0.87       113
          1       0.83      0.65      0.73        66

avg / total       0.82      0.82      0.82       179


Best parameters:

{'estimator__n_neighbors': 6, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('transform', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('estimator', K

In [21]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score,'\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



Transformation with PCA doesn't appear to improve our results so far.

#### t-SNE

In [22]:
%%time 
reload(dsl)

# Set model names to iterate over
model_names = ['knn','logistic_regression','svm','multilayer_perceptron','random_forest','adaboost']        

# Cross-validate each model
for model_name in model_names:
    models['scaled_t-sne_%s'%(model_name)] = dsl.train_model(X,y,use_default_param_dist=True,
                                                       random_state=6,
                                                        transform_type='t-sne',
                                                       scale_type = 'standard',
                                                       estimator=model_name)

Grid parameters:
estimator__n_neighbors : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
estimator__weights : ['uniform', 'distance']

Training set classification accuracy:  0.566011235955

Test set classification accuracy:  0.703910614525
Confusion matrix: 

 [[85 28]
 [25 41]]

Normalized confusion matrix: 

 [[ 0.47486034  0.15642458]
 [ 0.1396648   0.22905028]]

Classification report: 

              precision    recall  f1-score   support

          0       0.77      0.75      0.76       113
          1       0.59      0.62      0.61        66

avg / total       0.71      0.70      0.71       179


Best parameters:

{'estimator__n_neighbors': 16, 'estimator__weights': 'uniform'}

 Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('transform', pipeline_TSNE(angle=0.5, early_exaggeration=4.0, init='pca',
       learning_rate=1000.0, method='barnes_hut', metric='euclidean',
       mi

In [23]:
trained_model_names = models.keys()

model_scores = [models[model].test_score for model in trained_model_names]

max_score = max(model_scores)

best_model = trained_model_names[model_scores.index(max_score)]

print best_model,max_score,'\n\n',models[best_model].classification_report

logistic_regression 0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179



Logistic regression with with scaling and selection appears to outperform all other scenarios so far.

Let's look at the best model's properties:

In [24]:
print 'Best model:','logistic_regression','\n'
print 'Training score:\t\t', models['logistic_regression'].train_score
print 'Test score:\t\t', models['logistic_regression'].test_score,'\n'
print models['logistic_regression'].classification_report

print 'Confusion matrix:\n\n',models['logistic_regression'].confusion_matrix,'\n'
print 'Normalized confusion matrix:\n\n',models['logistic_regression'].normalized_confusion_matrix,'\n'
print 'Best parameters:\n', models['logistic_regression'].best_parameters

Best model: logistic_regression 

Training score:		0.818820224719
Test score:		0.882681564246 

             precision    recall  f1-score   support

          0       0.88      0.95      0.91       113
          1       0.89      0.77      0.83        66

avg / total       0.88      0.88      0.88       179

Confusion matrix:

[[107   6]
 [ 15  51]] 

Normalized confusion matrix:

[[ 0.59776536  0.03351955]
 [ 0.08379888  0.2849162 ]] 

Best parameters:
{'estimator__C': 100000.0}


In [90]:
dir(models['logistic_regression'].steps[0][1])

['C',
 '__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__getstate__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_estimator_type',
 '_get_param_names',
 '_predict_proba_lr',
 'class_weight',
 'classes_',
 'coef_',
 'decision_function',
 'densify',
 'dual',
 'fit',
 'fit_intercept',
 'fit_transform',
 'get_params',
 'intercept_',
 'intercept_scaling',
 'max_iter',
 'multi_class',
 'n_iter_',
 'n_jobs',
 'penalty',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'random_state',
 'score',
 'set_params',
 'solver',
 'sparsify',
 'tol',
 'transform',
 'verbose',
 'warm_start']

In [26]:
import numpy as np
np.logspace(-10,10,100)

array([  1.00000000e-10,   1.59228279e-10,   2.53536449e-10,
         4.03701726e-10,   6.42807312e-10,   1.02353102e-09,
         1.62975083e-09,   2.59502421e-09,   4.13201240e-09,
         6.57933225e-09,   1.04761575e-08,   1.66810054e-08,
         2.65608778e-08,   4.22924287e-08,   6.73415066e-08,
         1.07226722e-07,   1.70735265e-07,   2.71858824e-07,
         4.32876128e-07,   6.89261210e-07,   1.09749877e-06,
         1.74752840e-06,   2.78255940e-06,   4.43062146e-06,
         7.05480231e-06,   1.12332403e-05,   1.78864953e-05,
         2.84803587e-05,   4.53487851e-05,   7.22080902e-05,
         1.14975700e-04,   1.83073828e-04,   2.91505306e-04,
         4.64158883e-04,   7.39072203e-04,   1.17681195e-03,
         1.87381742e-03,   2.98364724e-03,   4.75081016e-03,
         7.56463328e-03,   1.20450354e-02,   1.91791026e-02,
         3.05385551e-02,   4.86260158e-02,   7.74263683e-02,
         1.23284674e-01,   1.96304065e-01,   3.12571585e-01,
         4.97702356e-01,

In [27]:
%%time 
reload(dsl)
        
param_dist = {
    'estimator__C': np.logspace(-10,10,100),
    'estimator__penalty': ['l2','l1']
}
    
# Figure out best model
models['custom_logistic_regression'] = dsl.train_model(X,y,
                                use_default_param_dist=True,
                                random_state=6,
                                suppress_output=False, # Can suppress print outs if desired
                                estimator='logistic_regression',
                               param_dist=param_dist) 

Grid parameters:
estimator__penalty : ['l2', 'l1']
estimator__C : [  1.00000000e-10   1.59228279e-10   2.53536449e-10   4.03701726e-10
   6.42807312e-10   1.02353102e-09   1.62975083e-09   2.59502421e-09
   4.13201240e-09   6.57933225e-09   1.04761575e-08   1.66810054e-08
   2.65608778e-08   4.22924287e-08   6.73415066e-08   1.07226722e-07
   1.70735265e-07   2.71858824e-07   4.32876128e-07   6.89261210e-07
   1.09749877e-06   1.74752840e-06   2.78255940e-06   4.43062146e-06
   7.05480231e-06   1.12332403e-05   1.78864953e-05   2.84803587e-05
   4.53487851e-05   7.22080902e-05   1.14975700e-04   1.83073828e-04
   2.91505306e-04   4.64158883e-04   7.39072203e-04   1.17681195e-03
   1.87381742e-03   2.98364724e-03   4.75081016e-03   7.56463328e-03
   1.20450354e-02   1.91791026e-02   3.05385551e-02   4.86260158e-02
   7.74263683e-02   1.23284674e-01   1.96304065e-01   3.12571585e-01
   4.97702356e-01   7.92482898e-01   1.26185688e+00   2.00923300e+00
   3.19926714e+00   5.09413801e+00   

In [69]:
model_coefficients = models['custom_logistic_regression'].steps[0][1].coef_[0]

feature_ind = 0
indices = []
for feature in simulation_df.columns:
    if feature != 'Survived_1':
        print feature, '\t',model_coefficients[feature_ind]
        
        feature_ind += 1
        
model_df = pd.Series(model_coefficients,index=[feature for feature in simulation_df.columns if feature != 'Survived_1'])

Age 	-0.0393259488071
SibSp 	-0.574788721716
Parch 	-0.356123712301
Fare 	0.00342610855198
Pclass_2 	-1.16243960759
Pclass_3 	-2.32359915558
Sex_male 	-2.08805882672
Embarked_Q 	0.0140621138536
Embarked_S 	-0.358487135498
Title_Dr 	-2.58436225926
Title_Military 	-2.20878843244
Title_Miss 	-2.35635331024
Title_Mr 	-3.05281011203
Title_Mrs 	-1.35803888823
Title_Noble 	-2.70897267991
Title_Rev 	-4.92165404335


In [87]:
models.keys()

['select_scaled_random_forest',
 'select_scaled_logistic_regression',
 'scaled_t-sne_logistic_regression',
 'scaled_pca_adaboost',
 'logistic_regression',
 'custom_logistic_regression',
 'scaled_t-sne_multilayer_perceptron',
 'scaled_random_forest',
 'scaled_knn',
 'scaled_pca_svm',
 'scaled_pca_random_forest',
 'scaled_t-sne_knn',
 'custom_overwrite_knn',
 'multilayer_perceptron',
 'scaled_t-sne_svm',
 'scaled_adaboost',
 'scaled_pca_logistic_regression',
 'select_scaled_adaboost',
 'select_scaled_svm',
 'select_scaled_knn',
 'scaled_t-sne_adaboost',
 'scaled_pca_multilayer_perceptron',
 'scaled_logistic_regression',
 'knn',
 'svm',
 'from_scratch_knn',
 'scaled_svm',
 'select_scaled_multilayer_perceptron',
 'scaled_t-sne_random_forest',
 'adaboost',
 'random_forest',
 'scaled_multilayer_perceptron',
 'scaled_pca_knn']

In [70]:
model_df.sort_values()

Title_Rev        -4.921654
Title_Mr         -3.052810
Title_Noble      -2.708973
Title_Dr         -2.584362
Title_Miss       -2.356353
Pclass_3         -2.323599
Title_Military   -2.208788
Sex_male         -2.088059
Title_Mrs        -1.358039
Pclass_2         -1.162440
SibSp            -0.574789
Embarked_S       -0.358487
Parch            -0.356124
Age              -0.039326
Fare              0.003426
Embarked_Q        0.014062
dtype: float64

Looks like being a Reverend had a very negative effect on survival.

In [86]:
df['Survived'][df['Title']=='Dr'].value_counts()

0    4
1    3
Name: Survived, dtype: int64

There were six reverends and all died.

Next worst was having the title of "Mr". This is in contrast to just generally being a male.

In [78]:
df[df['Title']=='Noble']

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
30,0,1,male,40.0,0,0,27.7208,C,Noble
556,1,1,female,48.0,1,0,39.6,C,Noble
599,1,1,male,49.0,1,0,56.9292,C,Noble
759,1,1,female,33.0,0,0,86.5,S,Noble
822,0,1,male,38.0,0,0,0.0,S,Noble


In [76]:
df[(df['Sex']=='male')&(~df['Title'].isin(['Mr','Dr','Rev']))]

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
7,0,3,male,2.0,3,1,21.075,S,Child
16,0,3,male,2.0,4,1,29.125,Q,Child
30,0,1,male,40.0,0,0,27.7208,C,Noble
50,0,3,male,7.0,4,1,39.6875,S,Child
59,0,3,male,11.0,5,2,46.9,S,Child
63,0,3,male,4.0,3,2,27.9,S,Child
65,1,3,male,1.248543,1,1,15.2458,C,Child
78,1,2,male,0.83,0,2,29.0,S,Child
125,1,3,male,12.0,1,0,11.2417,C,Child
159,0,3,male,-10.035641,8,2,69.55,S,Child


In [66]:
simulation_df[simulation_df['Title_Military']==1]

Unnamed: 0,Age,SibSp,Parch,Fare,Survived_1,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S,Title_Dr,Title_Military,Title_Miss,Title_Mr,Title_Mrs,Title_Noble,Title_Rev
449,52.0,0,0,30.5,1,0,0,1,0,1,0,1,0,0,0,0,0
536,45.0,0,0,26.55,0,0,0,1,0,1,0,1,0,0,0,0,0
647,56.0,0,0,35.5,1,0,0,1,0,0,0,1,0,0,0,0,0
694,60.0,0,0,26.55,0,0,0,1,0,1,0,1,0,0,0,0,0
745,70.0,1,1,71.0,0,0,0,1,0,1,0,1,0,0,0,0,0


In [37]:
print models['select_scaled_logistic_regression'].steps[2][1].coef_[0].shape
print X.shape

(14,)
(891, 16)


In [55]:
feature_mask = models['select_scaled_logistic_regression'].steps[0][1].get_support()
input_features = [feature for feature in simulation_df.columns if feature != 'Survived_1']

model_coefficients = models['select_scaled_logistic_regression'].steps[2][1].coef_[0]

feature_ind = 0
for feature in np.array(input_features)[feature_mask]:
    print feature, '\t',model_coefficients[feature_ind]
        
    feature_ind += 1

Age 	-0.614482601163
SibSp 	-0.612587885195
Parch 	-0.283176325042
Fare 	0.168724602965
Pclass_2 	-0.461866528278
Pclass_3 	-1.15687512926
Sex_male 	-4.11242073907
Embarked_S 	-0.173196484322
Title_Dr 	-0.209834891899
Title_Miss 	-3.48708462491
Title_Mr 	-1.32195791696
Title_Mrs 	-2.6056826261
Title_Noble 	-0.208062438608
Title_Rev 	-0.957725925523


In [28]:
for x in models['select_scaled_logistic_regression'].steps[2][1].coef_[0]:
    print x

print simulation_df.columns

-0.614482601163
-0.612587885195
-0.283176325042
0.168724602965
-0.461866528278
-1.15687512926
-4.11242073907
-0.173196484322
-0.209834891899
-3.48708462491
-1.32195791696
-2.6056826261
-0.208062438608
-0.957725925523
Index([u'Age', u'SibSp', u'Parch', u'Fare', u'Survived_1', u'Pclass_2',
       u'Pclass_3', u'Sex_male', u'Embarked_Q', u'Embarked_S', u'Title_Dr',
       u'Title_Military', u'Title_Miss', u'Title_Mr', u'Title_Mrs',
       u'Title_Noble', u'Title_Rev'],
      dtype='object')


In [31]:
model_coefficients = models['select_scaled_logistic_regression'].steps[2][1].coef_[0]

feature_ind = 0
for feature in simulation_df.columns:
    if feature != 'Survived_1': 
        print feature, '\t',model_coefficients[feature_ind]
        
        feature_ind += 1
        
print model_coefficients[feature_ind]

Age 	-0.614482601163
SibSp 	-0.612587885195
Parch 	-0.283176325042
Fare 	0.168724602965
Pclass_2 	-0.461866528278
Pclass_3 	-1.15687512926
Sex_male 	-4.11242073907
Embarked_Q 	-0.173196484322
Embarked_S 	-0.209834891899
Title_Dr 	-3.48708462491
Title_Military 	-1.32195791696
Title_Miss 	-2.6056826261
Title_Mr 	-0.208062438608
Title_Mrs 	-0.957725925523
Title_Noble 	

IndexError: index 14 is out of bounds for axis 0 with size 14

In [32]:
simulation_df[simulation_df['Title_Dr']==1]

Unnamed: 0,Age,SibSp,Parch,Fare,Survived_1,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S,Title_Dr,Title_Military,Title_Miss,Title_Mr,Title_Mrs,Title_Noble,Title_Rev
245,44.0,2,0,90.0,0,0,0,1,1,0,1,0,0,0,0,0,0
317,54.0,0,0,14.0,0,1,0,1,0,1,1,0,0,0,0,0,0
398,23.0,0,0,10.5,0,1,0,1,0,1,1,0,0,0,0,0,0
632,32.0,0,0,30.5,1,0,0,1,0,0,1,0,0,0,0,0,0
660,50.0,2,0,133.65,1,0,0,1,0,1,1,0,0,0,0,0,0
766,45.062712,0,0,39.6,0,0,0,1,0,0,1,0,0,0,0,0,0
796,49.0,0,0,25.9292,1,0,0,0,0,1,1,0,0,0,0,0,0
