# Capstone 1: Machine Learning Algorithms - Under-Sampling

<a id='TOC'></a>
<strong>Table of Contents</strong>
<ol>
    <li>Preliminaries</li>
    <ol>
        <li><a href=#Sec01A>Import EMS Incident Data</a></li>
        <li><a href=#Sec01B>Preprocess Dataset</a></li>
        <li><a href=#Sec01C>Segment &amp; Encode Variables</a></li>
        <li><a href=#Sec01D>Inspect Target Variables</a></li>
    </ol>
    <li>Prepare Training Data</li>
    <ol>
        <li><a href=#Sec02A>Create Train and Test Sets</a></li>
        <li><a href=#Sec02B>Instantiate Baseline Classifiers</a></li>
    </ol>
    <li>Random Under-Sampling (RUS)</li>
    <ol>
        <li><a href=#Sec03A>Instantiate Training Pipelines</a></li>
        <li><a href=#Sec03B>Parameter Tuning - LRCV Pipeline</a></li>
        <li><a href=#Sec03C>Parameter Tuning - RF Pipeline</a></li>
        <li><a href=#Sec03D>Evaluate Tuned Classifiers</a></li>
    </ol>
    <li>Near Miss (NM)</li>
    <ol>
        <li><a href=#Sec04A>Instantiate Training Pipelines</a></li>
        <li><a href=#Sec04B>Parameter Tuning - LRCV Pipeline</a></li>
        <li><a href=#Sec04C>Parameter Tuning - RF Pipeline</a></li>
        <li><a href=#Sec04D>Evaluate Tuned Classifiers</a></li>
    </ol>
    <li>Tomek's Links (TL)</li>
    <ol>
        <li><a href=#Sec05A>Instantiate Training Pipelines</a></li>
        <li><a href=#Sec05B>Parameter Tuning - LRCV Pipeline</a></li>
        <li><a href=#Sec05C>Parameter Tuning - RF Pipeline</a></li>
        <li><a href=#Sec05D>Evaluate Tuned Classifiers</a></li>
    </ol>
    <li>Evaluation of Classifiers</li>
    <ol>
        <li><a href=#Sec06A>Comparison of Baseline Models</a></li>
        <li><a href=#Sec06B>Comparison of Tuned Models</a></li>
        <li><a href=#Sec06C>Summary of Analyses</a></li>
    </ol>
</ol>

<p>The goal of this project is to develop machine learning models that predict whether or not the outcome of an EMS incident will result in a fatality. This is a supervised, binary classification problem. Analyses will be performed on a collection of nearly 8 million records of documented incidents, which span the six year period from January 2013 through December 2018, and appropriate predictive models will be developed to achieve the primary objective. This dataset is robust and contains several feature variables, of mixed data types, that describe both various attributes of each incident as well as the responsive action taken by the FDNY. All of the aforementioned factors affect an individual’s survivability once a response is initiated.</p>

The results from the <a href="https://github.com/jdwill917/SB-DSCT-Repo/blob/master/Capstones/Capstone%201/code/CP1-04a_MLA.ipynb" target="_blank">baseline MLAs</a> illustrated that the underlying dataset is highly imbalanced on the target variable: `fatality`. This notebook will examine several models generated by combining different under-sampling algorithms and classifiers (parametric and non-parametric). The primary metric that will be used to evaluate each model is <em>recall</em> (the fraction of correctly identified positives within the target variable), but other metrics tailored to imbalanced datasets will also be taken into consideration.

<p>The three sampling methods to be explored in this analysis are:</p>
<ul>
    <li>Random Under-Sampling</li>
    <li>NearMiss Sampling</li>
    <li>Tomek's Link Sampling</li>
</ul>
<p>Model scoring will be evaluated for both baseline and tuned classifiers for each method. A similar analysis will be performed for <a href="https://github.com/jdwill917/SB-DSCT-Repo/blob/master/Capstones/Capstone%201/code/CP1-04c_MLA.ipynb" target="_blank">over-sampling methods</a> in a separate notebook.</p>    

<h2 style="text-transform: uppercase;">1. Preliminaries</h2>

<a id='Sec01A'></a>
<h4>1A: Import EMS incident data</h4>

In [38]:
# Import packages and modules
import pandas as pd
import numpy as np
import category_encoders as ce

import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
#sb.set(style='whitegrid',color_codes=True)

# Model evaluation tools and metrics
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from pprint import pprint
from sklearn.metrics import (accuracy_score, balanced_accuracy_score,
                             precision_score, recall_score, f1_score, auc,
                             confusion_matrix, plot_confusion_matrix,
                             roc_curve, precision_recall_curve)
from imblearn.metrics import classification_report_imbalanced

# Prospective classifiers
from imblearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Assign file path
file_path = '../data/clean_EMS_data.csv'

# Read CSV data into a Pandas DataFrame
datetime_cols = ['incident_datetime',
                 'first_assignment_datetime',
                 'first_activation_datetime',
                 'first_on_scene_datetime',
                 'first_to_hosp_datetime',
                 'first_hosp_arrival_datetime',
                 'incident_close_datetime']

df = pd.read_csv(file_path,compression='gzip',
                 parse_dates=datetime_cols,
                 index_col=['incident_datetime'])

In [2]:
#df.info(verbose=True,memory_usage='deep')

<a href=#TOC>TOC</a>

<a id="Sec01B"></a>
<h4>1B: Preprocess Dataset</h4>

In [4]:
# Change dtypes
df['borough'] = df.borough.astype('category')
df['zipcode'] = df.zipcode.astype('category')

In [5]:
# Remove immaterial columns
list_of_cols = ['latitude','longitude',
                'aland_sqmi','awater_sqmi','held_indicator',
                'initial_call_type','initial_severity_level',
                'first_assignment_datetime','incident_dispatch_area',
                'dispatch_time','travel_time',
                'first_activation_datetime','first_on_scene_datetime',
                'first_to_hosp_datetime','first_hosp_arrival_datetime',
                'incident_close_datetime','incident_disposition_code']
df.drop(list_of_cols,axis=1,inplace=True)

In [1]:
#df.info(verbose=True,memory_usage='deep')

In [7]:
df.head()

Unnamed: 0_level_0,year,month,day,hour,weekday,borough,zipcode,final_call_type,final_severity_level,response_time,life_threatening,fatality
incident_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-01-01 00:00:04,2013,1,1,0,2,BRONX,10472.0,RESPIR,4,797.0,False,False
2013-01-01 00:05:52,2013,1,1,0,2,BRONX,10472.0,EDP,7,534.0,False,False
2013-01-01 00:20:37,2013,1,1,0,2,BRONX,10472.0,SICK,6,697.0,False,False
2013-01-01 01:53:11,2013,1,1,1,2,BRONX,10472.0,INJURY,4,223.0,False,False
2013-01-01 01:54:28,2013,1,1,1,2,BRONX,10472.0,SICK,4,298.0,False,False


<a href=#TOC>TOC</a>

<a id="Sec01C"></a>
<h4>1C: Segment &amp; Encode Feature Variables</h4>

In [8]:
# Nominal feature variables
var_names_nom = ['borough','zipcode','final_call_type']
vars_nom = df[var_names_nom]

# Binary encode dataframe
enc_binary = ce.BinaryEncoder(cols=var_names_nom)
df_bin = enc_binary.fit_transform(df)

# Inspect encoded dataframe
print(df_bin.shape)
df_bin.head()

(7988028, 31)


Unnamed: 0_level_0,year,month,day,hour,weekday,borough_0,borough_1,borough_2,borough_3,zipcode_0,...,final_call_type_3,final_call_type_4,final_call_type_5,final_call_type_6,final_call_type_7,final_call_type_8,final_severity_level,response_time,life_threatening,fatality
incident_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01 00:00:04,2013,1,1,0,2,0,0,0,1,0,...,0,0,0,0,0,1,4,797.0,False,False
2013-01-01 00:05:52,2013,1,1,0,2,0,0,0,1,0,...,0,0,0,0,1,0,7,534.0,False,False
2013-01-01 00:20:37,2013,1,1,0,2,0,0,0,1,0,...,0,0,0,0,1,1,6,697.0,False,False
2013-01-01 01:53:11,2013,1,1,1,2,0,0,0,1,0,...,0,0,0,1,0,0,4,223.0,False,False
2013-01-01 01:54:28,2013,1,1,1,2,0,0,0,1,0,...,0,0,0,0,1,1,4,298.0,False,False


In [11]:
# Target variable
y_bin = df_bin['fatality'].values
print(f"y\nType:  {type(y_bin)}\nShape: {y_bin.shape}")

print()

# All feature variables
X_bin = df_bin.iloc[:,:-1].values
print(f"X (Binary-Encoded)\nType:  {type(X_bin)}\nShape: {X_bin.shape}")

y
Type:  <class 'numpy.ndarray'>
Shape: (7988028,)

X (Binary-Encoded)
Type:  <class 'numpy.ndarray'>
Shape: (7988028, 30)


<p>Applying binary encoding increases the number of feature variables in the clean dataset from 11 to 30. <a href=#TOC>TOC</a></p>

<a id="Sec01D"></a>
<h4>1D: Inspect Target Variable</h4>

In [12]:
# Subsets for two classes of target
fatalities = df_bin[df_bin.fatality == True] # positives
survivals = df_bin[df_bin.fatality == False] # negatives

# Calculate frequency and proportion of classes
n_pos = len(fatalities.fatality)
n_neg = len(survivals.fatality)
pct_pos = (n_pos/len(df_bin['fatality'])) * 100
pct_neg = (n_neg/len(df_bin['fatality'])) * 100

# Output results
print("Fatalities: {0:7} ({1:5.4}%)".format(n_pos,pct_pos))
print("Survivals:  {0:7} ({1:5.4}%)".format(n_neg,pct_neg))

Fatalities:  338684 ( 4.24%)
Survivals:  7649344 (95.76%)


The target variable is segmented into two <em>imbalanced</em> classes: <strong>fatalities</strong> (`fatality == True`) and <strong>survivals</strong> (`fatality == False`). Based on the frequency values provided above, <strong>fatalities</strong> represent the <em>minority</em> class (4.24%) whereas <strong>survivals</strong> represent the <em>majority</em> class (95.76%). Various sampling techniques will be utilized in the sections to follow in order to develop effective models for analyses. <a href=#TOC>TOC</a>

<hr>

<h2 style="text-transform: uppercase;">2. Prepare Training Data</h2>

<a id="Sec02A"></a>
<h4>2A: Create Train and Test Sets</h4>

In [14]:
# Initialize parameters
RANDOM_STATE = 917
TEST_SIZE = 0.20

# Split and stratify the binary-encoded data
X_train, X_test, y_train, y_test = train_test_split(X_bin, y_bin, stratify=y_bin,
                                                    test_size=TEST_SIZE, random_state=RANDOM_STATE)

In [15]:
# Define function to calculate target class counts
def get_class_counts(arr):
    class_f = sum(arr)
    class_s = len(arr) - class_f
    return {'Fatalities': class_f, 'Survivals':class_s}

# Define function to calculate target class proportions
def get_class_proportions(arr):
    class_f = round(sum(arr)/len(arr),4)
    class_s = 1 - class_f
    return {'Fatalities': class_f, 'Survivals':class_s}

# Output results
print(f"TRAINING DATA = {len(y_train)} observations")
print(f"Class Counts: {get_class_counts(y_train)}")
print(f"Proportions:  {get_class_proportions(y_train)}")
print()
print(f"TEST DATA     = {len(y_test)} observations")
print(f"Class Counts: {get_class_counts(y_test)}")
print(f"Proportions:  {get_class_proportions(y_test)}")

TRAINING DATA = 6390422 observations
Class Counts: {'Fatalities': 270947, 'Survivals': 6119475}
Proportions:  {'Fatalities': 0.0424, 'Survivals': 0.9576}

TEST DATA     = 1597606 observations
Class Counts: {'Fatalities': 67737, 'Survivals': 1529869}
Proportions:  {'Fatalities': 0.0424, 'Survivals': 0.9576}


<p>Stratification is necessary to lock the distribution of classes in the train and test sets given the high imbalance within the target classes of the original data set. <a href=#TOC>TOC</a><p>

<a id="Sec02B"></a>
<h4>2B: Instantiate Baseline Classifiers</h4>

In [42]:
# Compute target class weights
classes = np.unique(y_bin)
cw = compute_class_weight('balanced',classes,y_bin)
cw_dict = {0:cw[0],1:cw[1]}

# Output results
print(f"Class Weights: {cw_dict}")

{0: 0.5221381075292209, 1: 11.792744859515064}


In [44]:
clf_lrcv = LogisticRegressionCV(random_state=RANDOM_STATE,
                                class_weight=cw_dict,max_iter=10000,scoring='recall_weighted')

In [43]:
clf_rf = RandomForestClassifier(random_state=RANDOM_STATE,
                                class_weight=cw_dict,warm_start=True)

<p>This analysis will compare the usage of a parametric classifier (LogisticRegressionCV) and a non-parametric classifier (RandomForest), and their respective impacts on the selected model scoring metrics. <a href=#TOC>TOC</a></p>

<hr>

<h2 style="text-transform: uppercase;">3. Random Under-Sampling (RUS)</h2>

<a id="Sec03A"></a>
<h4>3A: Instantiate Training Pipelines</h4>

In [45]:
from imblearn.under_sampling import RandomUnderSampler

In [46]:
# Make pipeline for sampling method and LogisticRegressionCV (LRCV)
clf_rus_lrcv = make_pipeline(RandomUnderSampler(random_state=RANDOM_STATE),clf_lrcv)

# Make pipeline for sampling method and Random Forest (RF)
clf_rus_rf = make_pipeline(RandomUnderSampler(random_state=RANDOM_STATE),clf_rf)

In [47]:
# Get parameters for LRCV pipeline
pprint(clf_rus_lrcv.get_params())

{'logisticregressioncv': LogisticRegressionCV(Cs=10,
                     class_weight={0: 0.5221381075292209,
                                   1: 11.792744859515064},
                     cv=None, dual=False, fit_intercept=True,
                     intercept_scaling=1.0, l1_ratios=None, max_iter=10000,
                     multi_class='auto', n_jobs=None, penalty='l2',
                     random_state=917, refit=True, scoring='recall_weighted',
                     solver='lbfgs', tol=0.0001, verbose=0),
 'logisticregressioncv__Cs': 10,
 'logisticregressioncv__class_weight': {0: 0.5221381075292209,
                                        1: 11.792744859515064},
 'logisticregressioncv__cv': None,
 'logisticregressioncv__dual': False,
 'logisticregressioncv__fit_intercept': True,
 'logisticregressioncv__intercept_scaling': 1.0,
 'logisticregressioncv__l1_ratios': None,
 'logisticregressioncv__max_iter': 10000,
 'logisticregressioncv__multi_class': 'auto',
 'logisticregressioncv__n_j

In [48]:
# Get parameters for RF pipeline
pprint(clf_rus_rf.get_params())

{'memory': None,
 'randomforestclassifier': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight={0: 0.5221381075292209,
                                     1: 11.792744859515064},
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=917,
                       verbose=0, warm_start=True),
 'randomforestclassifier__bootstrap': True,
 'randomforestclassifier__ccp_alpha': 0.0,
 'randomforestclassifier__class_weight': {0: 0.5221381075292209,
                                          1: 11.792744859515064},
 'randomforestclassifier__criterion': 'gini',
 'randomforestclassifier__max_depth': None,


<a href=#TOC>TOC</a>

<a id="Sec03B"></a>
<h4>3B: Parameter Tuning - LRCV Pipeline</h4>

In [58]:
# Create parameter ranges for LRCV pipeline
params_rus_lrcv ={
    'randomundersampler__replacement': [False,True],
    'logisticregressioncv__Cs': [1,10],
    'logisticregressioncv__cv': [3,5],
    'logisticregressioncv__solver': ['lbfgs','sag']
}

In [61]:
scores = ['balanced_accuracy','recall_weighted']

# Instantiate randomized search across finite combinations
rnd_rus_lrcv = RandomizedSearchCV(estimator = clf_rus_lrcv, 
                                  param_distributions = params_rus_lrcv, 
                                  n_iter = 4,
                                  scoring=scores,
                                  refit='recall_weighted',
                                  error_score=0,
                                  random_state = RANDOM_STATE)

# Instantiate exhaustive search across all combinations
grd_rus_lrcv = GridSearchCV(estimator = clf_rus_lrcv, 
                            param_grid = params_rus_lrcv,
                            scoring=scores,
                            refit='recall_weighted',
                            error_score=0)

In [None]:
# Fit base pipeline
clf_rus_lrcv.fit(X_train,y_train)

In [None]:
# Fit the random search model
rnd_rus_lrcv.fit(X_train,y_train)

In [None]:
# Fit the grid search model
grd_rus_lrcv.fit(X_train,y_train)

<a href=#TOC>TOC</a>

<a id="Sec03C"></a>
<h4>3C: Parameter Tuning - RF Pipeline</h4>

In [62]:
# Create parameter ranges for RF
params_rus_rf ={
    'randomundersampler__replacement': [False,True],
    'randomforestclassifier__bootstrap': [False,True],
    'randomforestclassifier__max_features': ['auto', 'sqrt'],
    'randomforestclassifier__n_estimators': [50,100,150],
    'randomforestclassifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'randomforestclassifier__min_samples_leaf': [1, 2, 4],
    'randomforestclassifier__min_samples_split': [2, 5, 10]
}

In [63]:
scores = ['balanced_accuracy','recall_weighted']

# Instantiate randomized search across combinations
rnd_rus_rf = RandomizedSearchCV(estimator = clf_rus_rf, 
                                param_distributions = params_rus_rf, 
                                n_iter = 20,
                                scoring=scores,
                                refit='recall_weighted',
                                error_score=0,
                                random_state = RANDOM_STATE)

# Instantiate exhaustive search across all combinations
grd_rus_rf = GridSearchCV(estimator = clf_rus_rf, 
                          param_grid = params_rus_rf,
                          scoring=scores,
                          refit='recall_weighted',
                          error_score=0)

In [None]:
# Fit base pipeline
clf_rus_rf.fit(X_train,y_train)

In [None]:
# Fit the random search model
rnd_rus_rf.fit(X_train,y_train)

In [None]:
# Fit the grid search model
grd_rus_rf.fit(X_train,y_train)

<a href=#TOC>TOC</a>

<a id="Sec03D"></a>
<h4>3D: Evaluate Tuned Classifiers</h4>

In [62]:
# Get best parameters for LRCV pipeline
grd_rus_lrcv.best_params_

# Get best parameters for RF pipeline
grd_rus_rf.best_params_

{'logisticregressioncv__cv': 5}

In [65]:
# Instantiate tuned classifiers
tuned_rus_lrcv = make_pipeline(RandomUnderSampler(random_state=RANDOM_STATE),
                               LogisticRegressionCV(random_state=RANDOM_STATE,cv=5,
                                                    class_weight=cw_dict,
                                                    max_iter=10000,
                                                    scoring='recall_weighted'))

tuned_rus_rf = make_pipeline(RandomUnderSampler(random_state=RANDOM_STATE),
                             RandomForestClassifier(random_state=RANDOM_STATE,
                                                    class_weight=cw_dict,warm_start=True))

In [67]:
# Fit tuned models
tuned_rus_lrcv.fit(X_train,y_train)
tuned_rus_rf.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('randomundersampler',
                 RandomUnderSampler(random_state=917, replacement=False,
                                    sampling_strategy='auto')),
                ('logisticregressioncv',
                 LogisticRegressionCV(Cs=10, class_weight='balanced', cv=5,
                                      dual=False, fit_intercept=True,
                                      intercept_scaling=1.0, l1_ratios=None,
                                      max_iter=10000, multi_class='auto',
                                      n_jobs=None, penalty='l2',
                                      random_state=917, refit=True,
                                      scoring='recall_weighted', solver='lbfgs',
                                      tol=0.0001, verbose=0))],
         verbose=False)

In [69]:
# Make prediction with the classifier
ypred_test_rus_lrcv = tuned_rus_lrcv.predict(X_test)

# Predict class probabilities and obtain scores
probs_rus_lrcv = tuned_rus_lrcv.predict_proba(X_test)
probs_rus_lrcv = probs_rus_lrcv[:, 1]
precision_rus_lrcv, recall_rus_lrcv, _ = precision_recall_curve(y_test, probs_rus_lrcv)
f1_rus_lrcv, auc_rus_lrcv = f1_score(y_test, ypred_test_rus_lrcv), auc(recall_rus_lrcv, precision_rus_lrcv)

# Make prediction with the classifier
ypred_test_rus_rf = tuned_rus_rf.predict(X_test)

# Predict class probabilities and obtain scores
probs_rus_rf = clf_rus_rf.predict_proba(X_test)
probs_rus_rf = probs_rus_rf[:, 1]
precision_rus_rf, recall_rus_rf, _ = precision_recall_curve(y_test, probs_rus_rf)
f1_rus_rf, auc_rus_rf = f1_score(y_test, ypred_test_rus_rf), auc(recall_rus_rf, precision_rus_rf)

In [70]:
# Evaluation metrics for Logistic Regression classifier
print('CLASSIFIER: LOGISTIC REGRESSION w/ Random Under-Sampling')
print()

# Compute and print key scores
print(f"Accuracy score:          {accuracy_score(y_test, ypred_test_rus_lrcv):.4f} (y_test, ypred_test)")
print(f"Balanced accuracy score: {balanced_accuracy_score(y_test, ypred_test_rus_lrcv):.4f} (y_test, ypred_test)")
print(f"Recall score:            {recall_score(y_test, ypred_test_rus_lrcv):.4f} (y_test, ypred_test)")
print(f"F1 score:                {f1_rus_lrcv:.4f} (y_test, ypred_test)")
print(f"AUC score:               {auc_rus_lrcv:.4f} (recall, precision)")
print()

# Show the classification report
print(classification_report_imbalanced(y_test, ypred_test_rus_lrcv))

print()
print()

# Evaluation metrics for Random Forest classifier
print('CLASSIFIER: RANDOM FOREST w/ Random Under-Sampling')
print()

# Compute and print key scores
print(f"Accuracy score:          {accuracy_score(y_test, ypred_test_rus_rf):.4f} (y_test, ypred_test)")
print(f"Balanced accuracy score: {balanced_accuracy_score(y_test, ypred_test_rus_rf):.4f} (y_test, ypred_test)")
print(f"Recall score:            {recall_score(y_test, ypred_test_rus_rf):.4f} (y_test, ypred_test)")
print(f"F1 score:                {f1_rus_rf:.4f} (y_test, ypred_test)")
print(f"AUC score:               {auc_rus_rf:.4f} (recall, precision)")

# Show the classification report
print(classification_report_imbalanced(y_test, ypred_test_rus_rf))

CLASSIFIER: LOGISTIC REGRESSION w/ Random Under-Sampling

Accuracy score:          0.6486 (y_test, ypred_test_rus_lrcv)
Balanced accuracy score: 0.6658 (y_test, ypred_test_rus_lrcv)
Recall score:            0.6845 (y_test, ypred_test_rus_lrcv)
F1 score:                0.1418 (y_test, ypred_test_rus_lrcv)
AUC score:               0.0942 (recall_rus_lrcv, precision_rus_lrcv)

                   pre       rec       spe        f1       geo       iba       sup

      False       0.98      0.65      0.68      0.78      0.67      0.44   1912336
       True       0.08      0.68      0.65      0.14      0.67      0.44     84671

avg / total       0.94      0.65      0.68      0.75      0.67      0.44   1997007



In [None]:
# Normalized confusion matrix plot for test data
y_class_names = ['Survivals','Fatalities']
fig3D1, (ax3D1a,ax3D1b) = plt.subplots(1,2,figsize=(16,8))

# Logistic Regression
disp = plot_confusion_matrix(clf_rus_lr,X_test,y_test,normalize='all',
                             ax=ax3D1a,display_labels=y_class_names,cmap='Reds')
ax3D1a.set_title('Test Data: Logistic Regression\nRandom Under-Sampling',size=16)
ax3D1a.set_xlabel('Predicted Label',size=14)
ax3D1a.set_ylabel('True Label',size=14)

# Random Forest
disp = plot_confusion_matrix(clf_rus_rf,X_test,y_test,normalize='all',
                             ax=ax3D1b,display_labels=y_class_names,cmap='Reds')
ax3D1b.set_title('Test Data: Random Forest\nRandom Under-Sampling',size=16)
ax3D1b.set_xlabel('Predicted Label',size=14)
ax3D1b.set_ylabel('True Label',size=14)

#plt.savefig('../graphics/CP1-04b_fig03D1.png') # Export confusion matrix plot to PNG file
plt.show()

In [None]:
# Plot the precision-recall curves
no_skill = len(y_test[y_test==1]) / len(y_test)
fig3D2, (ax3D2a,ax3D2b) = plt.subplots(1,2,figsize=(16,8))

# Logistic Regression
ax3D2a.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
ax3D2a.plot(recall_rus_lr, precision_rus_lr, marker='.', label='Logistic')
ax3D2a.set_title('Precision-Recall Curve: Logistic Regression\nRandom Under-Sampling',size=16)
ax3D2a.set_xlabel('Recall',size=14)
ax3D2a.set_ylabel('Precision',size=14)
ax3D2a.legend()

# Random Forest
ax3D2b.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
ax3D2b.plot(recall_rus_rf, precision_rus_rf, marker='.', label='Random Forest')
ax3D2b.set_title('Precision-Recall Curve: Random Forest\nRandom Under-Sampling',size=16)
ax3D2b.set_xlabel('Recall',size=14)
ax3D2b.set_ylabel('Precision',size=14)
ax3D2b.legend()

#plt.savefig('../graphics/CP1-04b_fig03D2.png') # Export precision-recall curves to PNG file
plt.show()

<a href=#TOC>TOC</a>

<hr>

<h2 style="text-transform: uppercase;">4. Near Miss (NM)</h2>

<a id="Sec04A"></a>
<h4>4A: Instantiate Training Pipelinest</h4>

In [64]:
from imblearn.under_sampling import NearMiss

In [65]:
# Make pipeline for sampling method and LogisticRegressionCV (LRCV)
clf_nm_lrcv = make_pipeline(NearMiss(),clf_lrcv)

# Make pipeline for sampling method and Random Forest (RF)
clf_nm_rf = make_pipeline(NearMiss(),clf_rf)

In [66]:
# Get parameters for LRCV pipeline
pprint(clf_nm_lrcv.get_params())

{'logisticregressioncv': LogisticRegressionCV(Cs=10,
                     class_weight={0: 0.5221381075292209,
                                   1: 11.792744859515064},
                     cv=None, dual=False, fit_intercept=True,
                     intercept_scaling=1.0, l1_ratios=None, max_iter=10000,
                     multi_class='auto', n_jobs=None, penalty='l2',
                     random_state=917, refit=True, scoring='recall_weighted',
                     solver='lbfgs', tol=0.0001, verbose=0),
 'logisticregressioncv__Cs': 10,
 'logisticregressioncv__class_weight': {0: 0.5221381075292209,
                                        1: 11.792744859515064},
 'logisticregressioncv__cv': None,
 'logisticregressioncv__dual': False,
 'logisticregressioncv__fit_intercept': True,
 'logisticregressioncv__intercept_scaling': 1.0,
 'logisticregressioncv__l1_ratios': None,
 'logisticregressioncv__max_iter': 10000,
 'logisticregressioncv__multi_class': 'auto',
 'logisticregressioncv__n_j

In [67]:
# Get parameters for RF pipeline
pprint(clf_nm_rf.get_params())

{'memory': None,
 'nearmiss': NearMiss(n_jobs=None, n_neighbors=3, n_neighbors_ver3=3,
         sampling_strategy='auto', version=1),
 'nearmiss__n_jobs': None,
 'nearmiss__n_neighbors': 3,
 'nearmiss__n_neighbors_ver3': 3,
 'nearmiss__sampling_strategy': 'auto',
 'nearmiss__version': 1,
 'randomforestclassifier': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight={0: 0.5221381075292209,
                                     1: 11.792744859515064},
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=917,
                       verbose=0, warm_start=True),
 'randomforestclassifier__bootstrap'

<a href=#TOC>TOC</a>

<a id="Sec04B"></a>
<h4>4B: Parameter Tuning - LRCV Pipeline</h4>

In [68]:
# Create parameter ranges for LRCV pipeline
params_nm_lrcv ={
    'nearmiss__version': [1,2,3],
    'logisticregressioncv__Cs': [1,10],
    'logisticregressioncv__cv': [3,5],
    'logisticregressioncv__solver': ['lbfgs','sag']
}

In [69]:
scores = ['balanced_accuracy','recall_weighted']

# Instantiate randomized search across finite combinations
rnd_nm_lrcv = RandomizedSearchCV(estimator = clf_nm_lrcv, 
                                  param_distributions = params_nm_lrcv, 
                                  n_iter = 4,
                                  scoring=scores,
                                  refit='recall_weighted',
                                  error_score=0,
                                  random_state = RANDOM_STATE)

# Instantiate exhaustive search across all combinations
grd_nm_lrcv = GridSearchCV(estimator = clf_nm_lrcv, 
                            param_grid = params_nm_lrcv,
                            scoring=scores,
                            refit='recall_weighted',
                            error_score=0)

In [None]:
# Fit base pipeline
clf_nm_lrcv.fit(X_train,y_train)

In [None]:
# Fit the random search model
rnd_nm_lrcv.fit(X_train,y_train)

In [None]:
# Fit the grid search model
grd_nm_lrcv.fit(X_train,y_train)

<a href=#TOC>TOC</a>

<a id="Sec04C"></a>
<h4>4C: Parameter Tuning - RF Pipeline</h4>

In [70]:
# Create parameter ranges for RF
params_nm_rf ={
    'nearmiss__version': [1,2,3],
    'randomforestclassifier__bootstrap': [False,True],
    'randomforestclassifier__max_features': ['auto', 'sqrt'],
    'randomforestclassifier__n_estimators': [50,100,150],
    'randomforestclassifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'randomforestclassifier__min_samples_leaf': [1, 2, 4],
    'randomforestclassifier__min_samples_split': [2, 5, 10]
}

In [71]:
scores = ['balanced_accuracy','recall_weighted']

# Instantiate randomized search across combinations
rnd_nm_rf = RandomizedSearchCV(estimator = clf_nm_rf, 
                                param_distributions = params_nm_rf, 
                                n_iter = 20,
                                scoring=scores,
                                refit='recall_weighted',
                                error_score=0,
                                random_state = RANDOM_STATE)

# Instantiate exhaustive search across all combinations
grd_nm_rf = GridSearchCV(estimator = clf_nm_rf, 
                          param_grid = params_nm_rf,
                          scoring=scores,
                          refit='recall_weighted',
                          error_score=0)

In [None]:
# Fit base pipeline
clf_nm_rf.fit(X_train,y_train)

In [None]:
# Fit the random search model
rnd_nm_rf.fit(X_train,y_train)

In [None]:
# Fit the grid search model
grd_nm_rf.fit(X_train,y_train)

<a href=#TOC>TOC</a>

<a id="Sec04D"></a>
<h4>4D: Evaluate Tuned Classifiers</h4>

In [None]:
# Get best parameters for LRCV pipeline
grd_nm_lrcv.best_params_

# Get best parameters for RF pipeline
grd_nm_rf.best_params_

In [65]:
# Instantiate tuned classifiers
tuned_nm_lrcv = make_pipeline(NearMiss(),
                               LogisticRegressionCV(random_state=RANDOM_STATE,cv=5,
                                                    class_weight=cw_dict,
                                                    max_iter=10000,
                                                    scoring='recall_weighted'))

tuned_nm_rf = make_pipeline(NearMiss(),
                             RandomForestClassifier(random_state=RANDOM_STATE,
                                                    class_weight=cw_dict,warm_start=True))

In [None]:
# Fit tuned models
tuned_nm_lrcv.fit(X_train,y_train)
tuned_nm_rf.fit(X_train,y_train)

In [69]:
# Make prediction with the classifier
ypred_test_nm_lrcv = tuned_nm_lrcv.predict(X_test)

# Predict class probabilities and obtain scores
probs_nm_lrcv = tuned_nm_lrcv.predict_proba(X_test)
probs_nm_lrcv = probs_nm_lrcv[:, 1]
precision_nm_lrcv, recall_nm_lrcv, _ = precision_recall_curve(y_test, probs_nm_lrcv)
f1_nm_lrcv, auc_nm_lrcv = f1_score(y_test, ypred_test_nm_lrcv), auc(recall_nm_lrcv, precision_nm_lrcv)

# Make prediction with the classifier
ypred_test_nm_rf = tuned_nm_rf.predict(X_test)

# Predict class probabilities and obtain scores
probs_nm_rf = clf_nm_rf.predict_proba(X_test)
probs_nm_rf = probs_nm_rf[:, 1]
precision_nm_rf, recall_nm_rf, _ = precision_recall_curve(y_test, probs_nm_rf)
f1_nm_rf, auc_nm_rf = f1_score(y_test, ypred_test_nm_rf), auc(recall_nm_rf, precision_nm_rf)

In [None]:
# Evaluation metrics for Logistic Regression classifier
print('CLASSIFIER: LOGISTIC REGRESSION w/ NearMiss Sampling')
print()

# Compute and print key scores
print(f"Accuracy score:          {accuracy_score(y_test, ypred_test_nm_lrcv):.4f} (y_test, ypred_test)")
print(f"Balanced accuracy score: {balanced_accuracy_score(y_test, ypred_test_nm_lrcv):.4f} (y_test, ypred_test)")
print(f"Recall score:            {recall_score(y_test, ypred_test_nm_lrcv):.4f} (y_test, ypred_test)")
print(f"F1 score:                {f1_nm_lrcv:.4f} (y_test, ypred_test)")
print(f"AUC score:               {auc_nm_lrcv:.4f} (recall, precision)")
print()

# Show the classification report
print(classification_report_imbalanced(y_test, ypred_test_nm_lrcv))

print()
print()

# Evaluation metrics for Random Forest classifier
print('CLASSIFIER: RANDOM FOREST w/ NearMiss Sampling')
print()

# Compute and print key scores
print(f"Accuracy score:          {accuracy_score(y_test, ypred_test_nm_rf):.4f} (y_test, ypred_test)")
print(f"Balanced accuracy score: {balanced_accuracy_score(y_test, ypred_test_nm_rf):.4f} (y_test, ypred_test)")
print(f"Recall score:            {recall_score(y_test, ypred_test_nm_rf):.4f} (y_test, ypred_test)")
print(f"F1 score:                {f1_nm_rf:.4f} (y_test, ypred_test)")
print(f"AUC score:               {auc_nm_rf:.4f} (recall, precision)")

# Show the classification report
print(classification_report_imbalanced(y_test, ypred_test_nm_rf))

In [None]:
# Normalized confusion matrix plot for test data
y_class_names = ['Survivals','Fatalities']
fig4D1, (ax4D1a,ax4D1b) = plt.subplots(1,2,figsize=(16,8))

# Logistic Regression
disp = plot_confusion_matrix(clf_nm_lr,X_test,y_test,normalize='all',
                             ax=ax4D1a,display_labels=y_class_names,cmap='Reds')
ax4D1a.set_title('Test Data: Logistic Regression\nNearMiss Sampling',size=16)
ax4D1a.set_xlabel('Predicted Label',size=14)
ax4D1a.set_ylabel('True Label',size=14)

# Random Forest
disp = plot_confusion_matrix(clf_nm_rf,X_test,y_test,normalize='all',
                             ax=ax4D1b,display_labels=y_class_names,cmap='Reds')
ax4D1b.set_title('Test Data: Random Forest\nNearMiss Sampling',size=16)
ax4D1b.set_xlabel('Predicted Label',size=14)
ax4D1b.set_ylabel('True Label',size=14)

#plt.savefig('../graphics/CP1-04b_fig04D1.png') # Export confusion matrix plot to PNG file
plt.show()

In [None]:
# Plot the precision-recall curves
no_skill = len(y_test[y_test==1]) / len(y_test)
fig4D2, (ax4D2a,ax4D2b) = plt.subplots(1,2,figsize=(16,8))

# Logistic Regression
ax4D2a.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
ax4D2a.plot(recall_nm_lr, precision_nm_lr, marker='.', label='Logistic')
ax4D2a.set_title('Precision-Recall Curve: Logistic Regression\nNearMiss Sampling',size=16)
ax4D2a.set_xlabel('Recall',size=14)
ax4D2a.set_ylabel('Precision',size=14)
ax4D2a.legend()

# Random Forest
ax4D2b.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
ax4D2b.plot(recall_nm_rf, precision_nm_rf, marker='.', label='Random Forest')
ax4D2b.set_title('Precision-Recall Curve: Random Forest\nNearMiss Sampling',size=16)
ax4D2b.set_xlabel('Recall',size=14)
ax4D2b.set_ylabel('Precision',size=14)
ax4D2b.legend()

#plt.savefig('../graphics/CP1-04b_fig04D2.png') # Export precision-recall curves to PNG file
plt.show()

<a href=#TOC>TOC</a>

<hr>

<h2 style="text-transform: uppercase;">5. Tomek's Links (TL)</h2>

<a id="Sec05A"></a>
<h4>5A: Instantiate Training Pipelines</h4>

In [72]:
from imblearn.under_sampling import TomekLinks

In [73]:
# Make pipeline for sampling method and LogisticRegressionCV (LRCV)
clf_tl_lrcv = make_pipeline(TomekLinks(),clf_lrcv)

# Make pipeline for sampling method and Random Forest (RF)
clf_tl_rf = make_pipeline(TomekLinks(),clf_rf)

In [74]:
# Get parameters for LRCV pipeline
pprint(clf_tl_lrcv.get_params())

{'logisticregressioncv': LogisticRegressionCV(Cs=10,
                     class_weight={0: 0.5221381075292209,
                                   1: 11.792744859515064},
                     cv=None, dual=False, fit_intercept=True,
                     intercept_scaling=1.0, l1_ratios=None, max_iter=10000,
                     multi_class='auto', n_jobs=None, penalty='l2',
                     random_state=917, refit=True, scoring='recall_weighted',
                     solver='lbfgs', tol=0.0001, verbose=0),
 'logisticregressioncv__Cs': 10,
 'logisticregressioncv__class_weight': {0: 0.5221381075292209,
                                        1: 11.792744859515064},
 'logisticregressioncv__cv': None,
 'logisticregressioncv__dual': False,
 'logisticregressioncv__fit_intercept': True,
 'logisticregressioncv__intercept_scaling': 1.0,
 'logisticregressioncv__l1_ratios': None,
 'logisticregressioncv__max_iter': 10000,
 'logisticregressioncv__multi_class': 'auto',
 'logisticregressioncv__n_j

In [75]:
# Get parameters for RF pipeline
pprint(clf_tl_rf.get_params())

{'memory': None,
 'randomforestclassifier': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight={0: 0.5221381075292209,
                                     1: 11.792744859515064},
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=917,
                       verbose=0, warm_start=True),
 'randomforestclassifier__bootstrap': True,
 'randomforestclassifier__ccp_alpha': 0.0,
 'randomforestclassifier__class_weight': {0: 0.5221381075292209,
                                          1: 11.792744859515064},
 'randomforestclassifier__criterion': 'gini',
 'randomforestclassifier__max_depth': None,


<a href=#TOC>TOC</a>

<a id="Sec05B"></a>
<h4>5B: Parameter Tuning - LRCV Pipeline</h4>

In [76]:
# Create parameter ranges for LRCV pipeline
params_tl_lrcv ={
    'logisticregressioncv__Cs': [1,10],
    'logisticregressioncv__cv': [3,5],
    'logisticregressioncv__solver': ['lbfgs','sag']
}

In [77]:
scores = ['balanced_accuracy','recall_weighted']

# Instantiate randomized search across finite combinations
rnd_tl_lrcv = RandomizedSearchCV(estimator = clf_tl_lrcv, 
                                  param_distributions = params_tl_lrcv, 
                                  n_iter = 4,
                                  scoring=scores,
                                  refit='recall_weighted',
                                  error_score=0,
                                  random_state = RANDOM_STATE)

# Instantiate exhaustive search across all combinations
grd_tl_lrcv = GridSearchCV(estimator = clf_tl_lrcv, 
                            param_grid = params_tl_lrcv,
                            scoring=scores,
                            refit='recall_weighted',
                            error_score=0)

In [None]:
# Fit base pipeline
clf_tl_lrcv.fit(X_train,y_train)

In [None]:
# Fit the random search model
rnd_tl_lrcv.fit(X_train,y_train)

In [None]:
# Fit the grid search model
grd_tl_lrcv.fit(X_train,y_train)

<a href=#TOC>TOC</a>

<a id="Sec05C"></a>
<h4>5C: Parameter Tuning - RF Pipeline</h4>

In [78]:
# Create parameter ranges for RF
params_tl_rf ={
    'randomforestclassifier__bootstrap': [False,True],
    'randomforestclassifier__max_features': ['auto', 'sqrt'],
    'randomforestclassifier__n_estimators': [50,100,150],
    'randomforestclassifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'randomforestclassifier__min_samples_leaf': [1, 2, 4],
    'randomforestclassifier__min_samples_split': [2, 5, 10]
}

In [79]:
scores = ['balanced_accuracy','recall_weighted']

# Instantiate randomized search across combinations
rnd_tl_rf = RandomizedSearchCV(estimator = clf_tl_rf, 
                                param_distributions = params_tl_rf, 
                                n_iter = 20,
                                scoring=scores,
                                refit='recall_weighted',
                                error_score=0,
                                random_state = RANDOM_STATE)

# Instantiate exhaustive search across all combinations
grd_tl_rf = GridSearchCV(estimator = clf_tl_rf, 
                          param_grid = params_tl_rf,
                          scoring=scores,
                          refit='recall_weighted',
                          error_score=0)

In [None]:
# Fit base pipeline
clf_tl_rf.fit(X_train,y_train)

In [None]:
# Fit the random search model
rnd_tl_rf.fit(X_train,y_train)

In [None]:
# Fit the grid search model
grd_tl_rf.fit(X_train,y_train)

<a href=#TOC>TOC</a>

<a id="Sec05D"></a>
<h4>5D: Evaluate Tuned Classifiers</h4>

In [None]:
# Get best parameters for LRCV pipeline
grd_tl_lrcv.best_params_

# Get best parameters for RF pipeline
grd_tl_rf.best_params_

In [65]:
# Instantiate tuned classifiers
tuned_tl_lrcv = make_pipeline(TomekLinks(),
                               LogisticRegressionCV(random_state=RANDOM_STATE,cv=5,
                                                    class_weight=cw_dict,
                                                    max_iter=10000,
                                                    scoring='recall_weighted'))

tuned_tl_rf = make_pipeline(TomekLinks(),
                             RandomForestClassifier(random_state=RANDOM_STATE,
                                                    class_weight=cw_dict,warm_start=True))

In [None]:
# Fit tuned models
tuned_tl_lrcv.fit(X_train,y_train)
tuned_tl_rf.fit(X_train,y_train)

In [69]:
# Make prediction with the classifier
ypred_test_tl_lrcv = tuned_tl_lrcv.predict(X_test)

# Predict class probabilities and obtain scores
probs_tl_lrcv = tuned_tl_lrcv.predict_proba(X_test)
probs_tl_lrcv = probs_nm_lrcv[:, 1]
precision_tl_lrcv, recall_tl_lrcv, _ = precision_recall_curve(y_test, probs_tl_lrcv)
f1_tl_lrcv, auc_tl_lrcv = f1_score(y_test, ypred_test_tl_lrcv), auc(recall_tl_lrcv, precision_tl_lrcv)

# Make prediction with the classifier
ypred_test_tl_rf = tuned_tl_rf.predict(X_test)

# Predict class probabilities and obtain scores
probs_tl_rf = clf_tl_rf.predict_proba(X_test)
probs_tl_rf = probs_tl_rf[:, 1]
precision_tl_rf, recall_tl_rf, _ = precision_recall_curve(y_test, probs_tl_rf)
f1_tl_rf, auc_tl_rf = f1_score(y_test, ypred_test_tl_rf), auc(recall_nm_rf, precision_tl_rf)

In [None]:
# Evaluation metrics for Logistic Regression classifier
print('CLASSIFIER: LOGISTIC REGRESSION w/ Tomek\'s Link Sampling')
print()

# Compute and print key scores
print(f"Accuracy score:          {accuracy_score(y_test, ypred_test_tl_lrcv):.4f} (y_test, ypred_test)")
print(f"Balanced accuracy score: {balanced_accuracy_score(y_test, ypred_test_tl_lrcv):.4f} (y_test, ypred_test)")
print(f"Recall score:            {recall_score(y_test, ypred_test_tl_lrcv):.4f} (y_test, ypred_test)")
print(f"F1 score:                {f1_tl_lrcv:.4f} (y_test, ypred_test)")
print(f"AUC score:               {auc_tl_lrcv:.4f} (recall, precision)")
print()

# Show the classification report
print(classification_report_imbalanced(y_test, ypred_test_tl_lrcv))

print()
print()

# Evaluation metrics for Random Forest classifier
print('CLASSIFIER: RANDOM FOREST w/ Tomek\'s Link Sampling')
print()

# Compute and print key scores
print(f"Accuracy score:          {accuracy_score(y_test, ypred_test_tl_rf):.4f} (y_test, ypred_test)")
print(f"Balanced accuracy score: {balanced_accuracy_score(y_test, ypred_test_tl_rf):.4f} (y_test, ypred_test)")
print(f"Recall score:            {recall_score(y_test, ypred_test_tl_rf):.4f} (y_test, ypred_test)")
print(f"F1 score:                {f1_tl_rf:.4f} (y_test, ypred_test)")
print(f"AUC score:               {auc_tl_rf:.4f} (recall, precision)")

# Show the classification report
print(classification_report_imbalanced(y_test, ypred_test_tl_rf))

In [None]:
# Normalized confusion matrix plot for test data
y_class_names = ['Survivals','Fatalities']
fig5D1, (ax5D1a,ax5D1b) = plt.subplots(1,2,figsize=(16,8))

# Logistic Regression
disp = plot_confusion_matrix(clf_tl_lr,X_test,y_test,normalize='all',
                             ax=ax5D1a,display_labels=y_class_names,cmap='Reds')
ax5D1a.set_title('Test Data: Logistic Regression\nTomek\'s Link Sampling',size=16)
ax5D1a.set_xlabel('Predicted Label',size=14)
ax5D1a.set_ylabel('True Label',size=14)

# Random Forest
disp = plot_confusion_matrix(clf_tl_rf,X_test,y_test,normalize='all',
                             ax=ax5D1b,display_labels=y_class_names,cmap='Reds')
ax5D1b.set_title('Test Data: Random Forest\nTomek\'s Link Sampling',size=16)
ax5D1b.set_xlabel('Predicted Label',size=14)
ax5D1b.set_ylabel('True Label',size=14)

#plt.savefig('../graphics/CP1-04b_fig05D1.png') # Export confusion matrix plot to PNG file
plt.show()

In [None]:
# Plot the precision-recall curves
no_skill = len(y_test[y_test==1]) / len(y_test)
fig5D2, (ax5D2a,ax5D2b) = plt.subplots(1,2,figsize=(16,8))

# Logistic Regression
ax5D2a.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
ax5D2a.plot(recall_tl_lr, precision_tl_lr, marker='.', label='Logistic')
ax5D2a.set_title('Precision-Recall Curve: Logistic Regression\nTomek\'s Link Sampling',size=16)
ax5D2a.set_xlabel('Recall',size=14)
ax5D2a.set_ylabel('Precision',size=14)
ax5D2a.legend()

# Random Forest
ax5D2b.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
ax5D2b.plot(recall_tl_rf, precision_tl_rf, marker='.', label='Random Forest')
ax5D2b.set_title('Precision-Recall Curve: Random Forest\nTomek\'s Link Sampling',size=16)
ax5D2b.set_xlabel('Recall',size=14)
ax5D2b.set_ylabel('Precision',size=14)
ax5D2b.legend()

#plt.savefig('../graphics/CP1-04b_fig05D2.png') # Export precision-recall curves to PNG file
plt.show()

<a href=#TOC>TOC</a>

<hr>

## 6. Evaluation of Classifiers

<a id="Sec06A"></a>
<h4>6A: Comparison of Baseline Models</h4>

In [80]:
# Define function to compare classifier metrics
def eval_clf(clf, df_scores, clf_name=None):
    from sklearn.pipeline import Pipeline
    if clf_name is None:
        if isinstance(clf, Pipeline):
            clf_name = clf[-1].__class__.__name__
        else:
            clf_name = clf.__class__.__name__
    acc = clf.fit(X_train, y_train).score(X_test, y_test)
    y_pred = clf.predict(X_test)
    bal_acc = balanced_accuracy_score(y_test, y_pred)
    prec_score = precision_score(y_test, y_pred) # precision score
    rec_score = recall_score(y_test, y_pred) # recall score
    
    probs = clf.predict_proba(X_test)
    probs = probs[:, 1]
    precision, recall, _ = precision_recall_curve(y_test, probs)
    f1, auc_val = f1_score(y_test, y_pred), auc(recall, precision) #F1 score and AUC
   
    clf_score = pd.DataFrame(
        {clf_name: [acc, bal_acc, prec_score, rec_score, f1, auc_val]},
        index=['Accuracy', 'Balanced Accuracy',
               'Precision Score','Recall Score',
               'F1 Score','P-R AUC']
    )
    df_scores = pd.concat([df_scores, clf_score], axis=1).round(decimals=4)
    return df_scores

In [81]:
# Instantiate dataframe to contain scores for baseline classifiers
df_base_scores = pd.DataFrame()

In [83]:
df_base_scores = eval_clf(clf_rus_lrcv, df_base_scores, "LR w/ RUS")

In [None]:
df_base_scores = eval_clf(clf_nm_lrcv, df_base_scores, "LR w/ NearMiss")

In [None]:
df_base_scores = eval_clf(clf_tl_lrcv, df_base_scores, "LR w/ Tomek's Links")

In [None]:
df_base_scores = eval_clf(clf_rus_rf, df_base_scores, "RF w/ RUS")

In [None]:
df_base_scores = eval_clf(clf_nm_rf, df_base_scores, "RF w/ NearMiss")

In [None]:
df_base_scores = eval_clf(clf_tl_rf, df_base_scores, "RF w/ Tomek's Links")

In [None]:
print('BASELINE CLASSIFIERS')
df_base_scores

<a href=#TOC>TOC</a>

<a id="Sec06B"></a>
<h4>6B: Comparison of Tuned Models</h4>

In [None]:
# Instantiate dataframe to contain scores for tuned classifiers
df_tuned_scores = pd.DataFrame()

In [None]:
df_tuned_scores = eval_clf(tuned_rus_lrcv, df_tuned_scores, "LR w/ RUS")

In [None]:
df_tuned_scores = eval_clf(tuned_nm_lrcv, df_tuned_scores, "LR w/ NearMiss")

In [None]:
df_tuned_scores = eval_clf(tuned_tl_lrcv, df_tuned_scores, "LR w/ Tomek's Links")

In [None]:
df_tuned_scores = eval_clf(tuned_rus_rf, df_tuned_scores, "RF w/ RUS")

In [None]:
df_tuned_scores = eval_clf(tuned_nm_rf, df_tuned_scores, "RF w/ NearMiss")

In [None]:
df_tuned_scores = eval_clf(tuned_tl_rf, df_tuned_scores, "RF w/ Tomek's Links")

In [None]:
print('TUNED CLASSIFIERS')
df_tuned_scores

<a href=#TOC>TOC</a>

<a id="Sec06C"></a>
<h4>6C: Summary of Analyses</h4>

<a href=#TOC>TOC</a>