In [3]:
import pandas as pd
import numpy as np
import os
from modules.data_preparator import DataPreparator
from modules.model_pipeline import MLModelsPipeline

# Data

Read in the raw data and make some quick preprocessing changes. We need to convert the dependent variable to numeric and replace some anomolous values in the categorical columns (i.e., the ? symbols)

These sections are commented out. A hard copy of pre-processed and pre-split datasets in provided in the package. This always anyone to reproduce the results from this tutorial without the randomness from splitting data

In [4]:
# data = pd.read_csv(os.path.join("data", "census", "adult.csv"))
    # # replace the question mark values with missing
    # data.replace(to_replace='?', value=np.nan, inplace=True)
    # data.replace({'income': {'<=50K': 0, '>50K': 1}}, inplace=True)

## Dataset Parameters

In [6]:
# ---- Dataset paramaters ---- #
features = ['workclass', 'fnlwgt', 'education', 'educational-num',
            'marital-status', 'occupation', 'relationship',
            'capital-gain', 'capital-loss', 'hours-per-week']
dep_var = 'income'
demo_vars = ['race', 'gender']

Below is the code for creating the initial DataHandler class that contains several new features. The DataHandler can now conduct pairwise deletion of missing data, impute values for missing data, take in demogrpahics variables, and encode categorical variables to numeric values!

In [7]:
 # ---- We use this first run to split the data for us, then we save those outputs so that we have a fully
    # ---- reproducible datasets. We can comment out this code, because now we will on the pre-saved datasets with
    # ---- missing data already removed
    # data_prep = DataPreparator(data=data, features=features, dep_var=dep_var, demo_vars=demo_vars, max_miss=None)
    # data_prep.split_data(val_set=False, test_size=0.30, random_state=456)
    # data_prep.encode_categorical(strategy='TargetEncoder')
    # data_prep.x_train.columns
    # data_prep.features
    # data_prep.impute_missing(strategy='knn', n_neighbors=15)
    # data_prep.data.to_csv(os.path.join("data", "census", "data_pre_processed.csv"))
    # pd.concat([data_prep.x_train, data_prep.y_train, data_prep.d_train], axis=1).to_csv(os.path.join("data", "census",
    #                                                                                                  "train.csv"))
    # pd.concat([data_prep.x_test, data_prep.y_test, data_prep.d_test], axis=1).to_csv(os.path.join("data", "census",
    #                                                                                               "test.csv"))

The code below will read in the pre-saved split and pre-processed datasets

In [8]:
data_path = os.path.join("data", "census", "data_pre_processed.csv")
train_path = os.path.join("data", "census", "train.csv")
test_path = os.path.join("data", "census", "test.csv")
data_prep = DataPreparator(data=data_path, train_data=train_path, test_data=test_path, features=features,
                           dep_var=dep_var, demo_vars=demo_vars)

# Cross Validation

There is new class, MLModelsPipeline, that handles the pipeline for training several ML models. These custom method inherit from sklearn BaseEstimator and operate like sklearn models. Let's start by running cross validation to get the best hyper-parameters settings for our models.

Caution: I've randomly set some values for hyper-parameter. This is not an exhaustinve grid search and likley won't results in the best fitting models!!

In [9]:
model_grid_params = {
        'LGR': {'penalty': ['l1', 'l2'], 'C': [0.5, 1.0, 2.0], 'solver': ['liblinear']},
        'SVC': {'C': [0.5, 1.0], 'kernel': ['linear', 'poly']},
        'KNNC': {'n_neighbors': [5, 50], 'leaf_size': [10, 30], 'p': [1, 2]},
        'RFC': {'n_estimators': [100, 200], 'max_depth': [50, 100]},
        'MLPC': {'hidden_layer_sizes': [(100,), (20, 50, 20)], 'activation': ['relu', 'logistic']}
    }

Let's set some evaluation metrics for training the models. These should come from the sklearn.metrics classes

In [10]:
 model_evaluation_metrics = ['accuracy', 'precision', 'recall']

First, we instantiate the MLModelsPipeline class with our desired parameters

In [11]:
pipeline = MLModelsPipeline(data_preparator=data_prep, models=model_grid_params.keys())

Next, we call the cross_validate_models method within the class to perform k-folds cross validation on our models and get the best hyper-parameters from the ones we tested.

In [12]:
pipeline.cross_validate_models(scorer='accuracy', models_search_params=model_grid_params,
                                cv=2, return_train_score=True, refit='accuracy')

Fitting 2 folds for each of 6 candidates, totalling 12 fits


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Cross-validation for LGR has completed!
Fitting 2 folds for each of 4 candidates, totalling 8 fits
Cross-validation for SVC has completed!
Fitting 2 folds for each of 8 candidates, totalling 16 fits
Cross-validation for KNNC has completed!
Fitting 2 folds for each of 4 candidates, totalling 8 fits
Cross-validation for RFC has completed!
Fitting 2 folds for each of 4 candidates, totalling 8 fits
Cross-validation for MLPC has completed!


The warnings above may be due to our hyper-paramters settings, or because I'm not running enough iterations. I'm not going to fix it, because the training time will become too long for the purpose of this demonstration.

Below are the results for the logistic regression model as an example

In [14]:
pd.DataFrame(pipeline.cv_results['LGR'])

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_penalty,param_solver,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,mean_train_score,std_train_score
0,0.085289,0.01271,0.001001,1.192093e-07,0.5,l1,liblinear,"{'C': 0.5, 'penalty': 'l1', 'solver': 'libline...",0.792328,0.799432,0.79588,0.003552,1,0.799432,0.792328,0.79588,0.003552
1,0.066484,0.000485,0.0015,0.0004998446,0.5,l2,liblinear,"{'C': 0.5, 'penalty': 'l2', 'solver': 'libline...",0.792328,0.799432,0.79588,0.003552,1,0.799432,0.792328,0.79588,0.003552
2,0.06533,0.000329,0.001,1.192093e-07,1.0,l1,liblinear,"{'C': 1.0, 'penalty': 'l1', 'solver': 'libline...",0.792328,0.799432,0.79588,0.003552,1,0.799432,0.792328,0.79588,0.003552
3,0.066954,4.5e-05,0.001,5.960464e-07,1.0,l2,liblinear,"{'C': 1.0, 'penalty': 'l2', 'solver': 'libline...",0.792328,0.799432,0.79588,0.003552,1,0.799432,0.792328,0.79588,0.003552
4,0.067206,0.002794,0.002,3.576279e-07,2.0,l1,liblinear,"{'C': 2.0, 'penalty': 'l1', 'solver': 'libline...",0.792328,0.799432,0.79588,0.003552,1,0.799432,0.792328,0.79588,0.003552
5,0.0665,0.0005,0.002,3.576279e-07,2.0,l2,liblinear,"{'C': 2.0, 'penalty': 'l2', 'solver': 'libline...",0.792328,0.799432,0.79588,0.003552,1,0.799432,0.792328,0.79588,0.003552


In [None]:
Below are the results for the multi-layer perceptrion (neural network) model as an example

In [15]:
pd.DataFrame(pipeline.cv_results['MLPC'])

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_activation,param_hidden_layer_sizes,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,mean_train_score,std_train_score
0,0.3445,0.0315,0.0065,0.0004999638,relu,"(100,)","{'activation': 'relu', 'hidden_layer_sizes': (...",0.779041,0.767424,0.773232,0.005808,3,0.784473,0.763664,0.774068,0.010404
1,0.7425,0.2585,0.0055,0.0005002022,relu,"(20, 50, 20)","{'activation': 'relu', 'hidden_layer_sizes': (...",0.763079,0.785308,0.774194,0.011115,1,0.766672,0.780378,0.773525,0.006853
2,0.463,0.013,0.005881,0.000118494,logistic,"(100,)","{'activation': 'logistic', 'hidden_layer_sizes...",0.753552,0.794334,0.773943,0.020391,2,0.763914,0.790072,0.776993,0.013079
3,0.465968,0.108994,0.005,1.192093e-07,logistic,"(20, 50, 20)","{'activation': 'logistic', 'hidden_layer_sizes...",0.681347,0.795337,0.738342,0.056995,4,0.692378,0.788985,0.740682,0.048304


We can call the best_params attribute to get a dictionary of our best fitting hyper-parameters. We will need this next to retrain our models.

In [16]:
 print(pipeline.best_params)

{'LGR': {'C': 0.5, 'penalty': 'l1', 'solver': 'liblinear'}, 'SVC': {'C': 0.5, 'kernel': 'linear'}, 'KNNC': {'leaf_size': 10, 'n_neighbors': 5, 'p': 1}, 'RFC': {'max_depth': 50, 'n_estimators': 200}, 'MLPC': {'activation': 'relu', 'hidden_layer_sizes': (20, 50, 20)}}


# Train Models

Nest, we call the train_models method to train all of our models on the full training dataset. We pass in the best_params dictionary so that the pipeline know what to set our hyper-parameters to. 

In [17]:
pipeline.train_models(model_specs=pipeline.best_params)

Training for model LGR is complete!
Training for model SVC is complete!
Training for model KNNC is complete!
Training for model RFC is complete!
Training for model MLPC is complete!


We just trained 5 machine learning models with a single line of code :)

# Evaluate Models Performance

Now, let's evaluate how our models performed 

In [18]:
perf_results = pipeline.evaluate_performance(scorer=['accuracy_score', 'recall_score'], ensemble='classifier')
print(perf_results)

Evaluation for model LGR is complete!
Evaluation for model SVC is complete!
Evaluation for model KNNC is complete!
Evaluation for model RFC is complete!
Evaluation for model MLPC is complete!
                accuracy_score  recall_score
LGR      train        0.844518      0.549347
         test         0.849519      0.555815
SVC      train        0.792286      0.277682
         test         0.797857      0.271058
KNNC     train        0.832609      0.441541
         test         0.777247      0.301078
RFC      train        0.999708      0.999312
         test         0.838668      0.609443
MLPC     train        0.784431      0.131018
         test         0.787962      0.114835
ensemble train        0.843390      0.373281
         test         0.825974      0.299038


# Evaluate Fairness

We can also use the pipeline to evaluate the fairness of our model. It's best to use the term "fairness" here, because "bias" already has a specific meaning when training ML models. The Bias-Variance trade-off refers to when models are too simplistic in their training, which causes errors. Its somewhat similar to validity and reliability that we talk about in Psychometrics. 

In [19]:
comparison_dict = {
        'race': {'White': ['Black', 'Other']},
        'gender': {'Male': ['Female']}
    }

We use the dictionary above the state what demographic variable we are testing and the pairwise comparisons between groups for each variable.

In [20]:
fairness_results = pipeline.evaluate_fairness(scorer='disparate_impact', comparison_dict=comparison_dict)
print(fairness_results)


      White_Black  White_Other  Male_Female
LGR      0.425532     0.375307     0.296637
SVC      0.470609     0.359962     0.513184
KNNC     0.634699     0.524219     0.483059
RFC      0.481022     0.368568     0.369087
MLPC     0.604112     0.375982     0.535538
