# Imbalanced Supervised Machine Learning

In this notebook, we train the following supervised ML models: 

- Logistic Regression

- Support Vector Machine classifier

- Random Forest classifier

- XGBoost classifier

- SMOTE with XGBoost classifier

The class distribution is 99.8 $\%$ majority class and 0.2 $\%$ minority class. The resulting  performance metrics of the training set show that none of these models predicted the minority class although the overall accuracy of the models are 99 $\%$. This shows that accuracy is not a good performance measure for imbalanced datasets. 

The optimizimation of the area under the precision-recall (AUPR) curve and the area under the receiver operating characteristic (AUROC) curve does not give any improvement on the predictive power of the models, so we optimized the recall.  

SMOTE in combination with XGBoost classifier was applied on the training set, but it did not give any improvement on the imbalanced test set.

In [1]:
%matplotlib inline

# Ignore deprecated warning
import warnings
warnings.filterwarnings("ignore")

# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set font scale and style
plt.rcParams.update({'font.size': 15})

# Resampling
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

# Machine learning models
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Grid search and model selection
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split

# Model performance metrics
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import  accuracy_score, auc,recall_score,precision_recall_curve
from sklearn.metrics import roc_curve, roc_auc_score, average_precision_score

# Pickle
import joblib

In [2]:
# Import custom class
%run -i '../src/helper/transfxn.py'
%run -i '../src/helper/ml.py'
%run -i '../src/helper/imputer.py'

In [3]:
# Instantiate the classes
transfxn = TransformationPipeline()
imputer = DataFrameImputer()
model = SupervisedModels()

# Load data

In [4]:
df = pd.read_csv('../data/feat_engr_data.csv') # Load cleaned data

df = df.sample(frac =1).reset_index(drop = True) # shuffle

print('Data size:', df.shape) # data size
df.head()

Data size: (89990, 31)


Unnamed: 0,x,y,environment,light,surface_condition,traffic_control,traffic_control_condition,class,impact_type,no_of_pedestrians,...,hr_per_day,impact_per_hour,impact_per_day,envmt_per_hour,surcond_per_hour,enviro_ind,impact_ind,traf_cont_ind,sur_cont_ind,light_ind
0,385914.8054,5038118.0,01 - Clear,05 - Dusk,05 - Packed snow,10 - No control,,0,06 - SMV unattended vehicle,0,...,0.438425,1.563514,0.685484,15.23047,0.520883,N,N,N,N,N
1,372695.25,5023408.0,02 - Rain,07 - Dark,03 - Loose snow,10 - No control,,0,06 - SMV unattended vehicle,0,...,0.192006,2.422058,0.465049,2.730808,1.965806,N,N,N,N,Y
2,356891.0625,5006390.0,01 - Clear,01 - Daylight,01 - Dry,01 - Traffic signal,01 - Functioning,0,04 - Sideswipe,0,...,0.331986,2.948368,0.978816,17.14077,14.264247,N,N,N,N,N
3,359589.46875,5019488.0,01 - Clear,01 - Daylight,01 - Dry,10 - No control,,0,03 - Rear end,0,...,0.445636,6.483926,2.889469,14.984032,12.469449,N,N,N,N,N
4,378187.65625,5014852.0,01 - Clear,07 - Dark,01 - Dry,10 - No control,,0,03 - Rear end,0,...,0.25429,7.966257,2.025742,18.409626,15.320167,N,N,N,N,Y


# Class distribution

In [5]:
label_pct = df['class'].value_counts(normalize = True)*100
label_ct =  df['class'].value_counts()
pd.DataFrame({'labels': label_pct.index, 'count': label_ct.values, 'percentage': label_pct.values})

Unnamed: 0,labels,count,percentage
0,0,89846,99.839982
1,1,144,0.160018


# Prepare the Data for Machine Learning

In [6]:
# Feature matrix and class label
X, y = df.drop('class', axis = 1), df['class']

# Create a test set
We now split the data set into 80$\%$ training set and 20$\%$ test set in a stratify fashion

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42, stratify = y)

In [8]:
print('Training set size:', X_train.shape, y_train.shape)
print('Test set size:', X_test.shape, y_test.shape)

Training set size: (71992, 30) (71992,)
Test set size: (17998, 30) (17998,)


In [9]:
print('Training set class distribution:\n', (y_train.value_counts()/X_train.shape[0])*100)
print('--' * 15)
print('Test set class distribution:\n', (y_test.value_counts()/X_test.shape[0])*100)

Training set class distribution:
 0    99.84026
1     0.15974
Name: class, dtype: float64
------------------------------
Test set class distribution:
 0    99.838871
1     0.161129
Name: class, dtype: float64


# Transformation pipeline

# 1. Impute missing values

In [10]:
# Fit transform the training set
X_train_imputed = imputer.fit_transform(X_train)

# Transform the test set
X_test_imputed = imputer.fit_transform(X_test)

# 2. Preprocessing

In [11]:
# Transform and scale data
X_train_scaled, X_test_scaled, feat_names = transfxn.preprocessing(X_train_imputed, X_test_imputed)

In [12]:
# Size of the data after pre-processing
print('Training set size after pre-processing:', X_train_scaled.shape)
print('Test set size after pre-processing:', X_test_scaled.shape)

Training set size after pre-processing: (71992, 101)
Test set size after pre-processing: (17998, 101)


In [13]:
# Convert the class labels to arrays
y_train, y_test = y_train.values,  y_test.values

# A. Model Selection by Cross-Validation

## A-1. Logistic Regression
Logistic regression predicted only one class - the majority class, although the accuracy and AUROC are very high

In [14]:
log_clf = LogisticRegression()   
model.eval_metrics_cv(log_clf, X_train_scaled, y_train, cv_fold = 5, scoring = 'accuracy',
                      model_nm = "Logistic Regression")

5-Fold cross-validation results for Logistic Regression
------------------------------------------------------------
Accuracy (std): 0.998375 (0.000034)
AUROC: 0.837079
AUPRC: 0.011036
Predicted classes: [0 1]
Confusion matrix:
 [[71875     2]
 [  115     0]]
Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71877
           1       0.00      0.00      0.00       115

    accuracy                           1.00     71992
   macro avg       0.50      0.50      0.50     71992
weighted avg       1.00      1.00      1.00     71992

------------------------------------------------------------


In [15]:
# Class ratio of the negative class 
# to the positive class
neg = y_train == 0
pos = y_train == 1
class_ratio = np.sum(neg)/np.sum(pos)
class_ratio

625.0173913043478

In [16]:
# Range of hyperparameters
params = {'C': [2**x for x in range(-2,9,2)], 
          'class_weight': ['balanced', {0:1, 1:3}, {0:1, 1:class_ratio}]
          }
                             
# Grid search
gsearch_log = RandomizedSearchCV(estimator = log_clf, param_distributions = params, 
                                scoring = 'roc_auc', cv = 5, n_jobs = -1, 
                                 n_iter = 200,random_state = 42, verbose = 1)

# Fit the  training set
gsearch_log.fit(X_train_scaled, y_train)

# Pickle trained model
joblib.dump(gsearch_log.best_estimator_, '../src/model/log_clf.pkl')

# Print results
print('Grid search best AUC score:', gsearch_log.best_score_)
print('Grid search best parameters:', gsearch_log.best_params_)
print()
model.eval_metrics_cv(gsearch_log.best_estimator_, X_train_scaled, y_train, cv_fold = 5,
 scoring = 'accuracy', model_nm = "Logistic Regression with best hyperparameters")

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Grid search best AUC score: 0.8387711007949609
Grid search best parameters: {'class_weight': {0: 1, 1: 3}, 'C': 0.25}

5-Fold cross-validation results for Logistic Regression with best hyperparameters
------------------------------------------------------------
Accuracy (std): 0.998361 (0.000056)
AUROC: 0.837162
AUPRC: 0.011555
Predicted classes: [0 1]
Confusion matrix:
 [[71874     3]
 [  115     0]]
Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71877
           1       0.00      0.00      0.00       115

    accuracy                           1.00     71992
   macro avg       0.50      0.50      0.50     71992
weighted avg       1.00      1.00      1.00     71992

------------------------------------------------------------


## A-2. Support Vector Machine

In [17]:
svm_clf = SVC(probability = True, kernel = 'rbf')   
model.eval_metrics_cv(svm_clf, X_train_scaled, y_train, cv_fold = 5, scoring = 'accuracy', 
                      model_nm = "SVM Classifier")

5-Fold cross-validation results for SVM Classifier
------------------------------------------------------------
Accuracy (std): 0.998403 (0.000000)
AUROC: 0.691234
AUPRC: 0.009473
Predicted classes: [0]
Confusion matrix:
 [[71877     0]
 [  115     0]]
Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     71877
           1       0.00      0.00      0.00       115

    accuracy                           1.00     71992
   macro avg       0.50      0.50      0.50     71992
weighted avg       1.00      1.00      1.00     71992

------------------------------------------------------------


In [None]:
# Range of hyperparameters
params = {'C': [2**x for x in range(-2,11,2)], 
          'gamma': [2**x for x in range(-11,1,2)],
          'class_weight': [None, 'balanced',{0:1, 1:class_ratio}]
         } 
                                                              
# Randomized search for SVM
svm_clf = SVC(probability = True, kernel = 'rbf')
rsearch_svm = RandomizedSearchCV(svm_clf, param_distributions = params, cv = 5,
                                 scoring = 'roc_auc', n_iter =200,
                                 n_jobs = -1,random_state = 42, verbose = 1) 
# Fit the training set
rsearch_svm.fit(X_train_scaled, y_train)

# Pickle trained model
joblib.dump(rsearch_svm.best_estimator_, '../src/model/svm_clf.pkl')

print('Best recall AUC score: ', rsearch_svm.best_score_)
print('Best parameters: ', rsearch_svm.best_params_)
print()
model.eval_metrics_cv(rsearch_svm.best_estimator_, X_train_scaled, y_train, cv_fold = 5,
 scoring = 'accuracy', model_nm = "SVM Classifier with Best Hyperparameters")

Fitting 5 folds for each of 126 candidates, totalling 630 fits


## A-3. Random Forest

In [None]:
rf_clf = RandomForestClassifier(random_state = 42)                         
model.eval_metrics_cv(rf_clf, X_train_scaled, y_train, cv_fold = 5, scoring = 'accuracy', 
                      model_nm = "Random Forest Classifier")

In [None]:
# Compute feature importances
importances_df = pd.DataFrame({'Features': feat_names, 
                                'Importances': rf_clf.feature_importances_
                              })
# Bar plot
importances_df.sort_values('Importances', ascending = True, inplace = True)
importances_df.set_index('Features', inplace = True)
importances_df.tail(20).plot(kind='barh', figsize = (18,10))
plt.title('Top 20 Feature Importances for Random Forest Classifier')

## A-4. XGBoost

In [None]:
param_dist = {'objective':'binary:logistic', 'eval_metric':'logloss', 'random_state':42}

xgb_clf = XGBClassifier(**param_dist) 
model.eval_metrics_cv(xgb_clf, X_train_scaled, y_train, cv_fold = 5, scoring = 'accuracy', 
                      model_nm = "XGBoost Classifier")

In [None]:
# Compute feature importances
importances_df = pd.DataFrame({'Features': feat_names,
                                 'Importances': xgb_clf.feature_importances_
                              })
# Bar plot
importances_df.sort_values('Importances', ascending = True, inplace = True)
importances_df.set_index('Features', inplace = True)
importances_df.tail(20).plot(kind='barh', figsize = (18,10))
plt.title('Top 20 Feature Importances for XGBoost Classifier')

# B. Hyperparameter tuning - XGB
Based on the results above, we select XGBoost and tune the Hyperparameters

In [None]:
# Class ratio of the negative class to the positive class
neg = y_train == 0
pos = y_train == 1
class_ratio = np.sum(neg)/np.sum(pos)

# Range of hyperparameters
params = {'subsample':[i/10 for i in range(5,9)],
            'colsample_bytree': [i/10 for i in range(5,9)]
            }

# Randomized search
param_dist = {'objective':'binary:logistic', 'eval_metric':'logloss', 
                'n_estimators':1000,'scale_pos_weight':class_ratio, 
                'learning_rate':0.1, 'min_child_weight':5, 
                'max_depth':9,'random_state':42
             }
              
xgb_clf = XGBClassifier(**param_dist)
rsearch_xgb = RandomizedSearchCV(estimator = xgb_clf, param_distributions = params, 
                                  scoring = 'roc_auc', cv = 5, n_jobs = -1, n_iter = 200, 
                                  random_state = 42, verbose = 1)   
# Fit the  training set                                                            
rsearch_xgb.fit(X_train_scaled, y_train)

# Pickle trained model
joblib.dump(rsearch_xgb.best_estimator_, '../src/model/xgb_clf.pkl')

# Print results
print('Randomized search best AUC score:', rsearch_xgb.best_score_) 
print('Randomized search best hyperparameters:', rsearch_xgb.best_params_) 
print()
model.eval_metrics_cv(rsearch_xgb.best_estimator_, X_train_scaled, y_train, cv_fold = 5,
 scoring = 'accuracy', model_nm = "XGBoost Classifier with best hyperparameters")

# C. Resampling Methods
In this section, we will employ a resampling technique on the training set to balance the classes. However, the final prediction will be made on the imbalanced test set. The idea of resampling is to trick the classifier using a balanced dataset.  

## C-1. SMOTE  combined with XGBoost Classifier
In Synthetic Minority Over Sampling Technique (SMOTE), we generate synthetic oberservations to match the minority clas.
SMOTE oversampled the minority class in the training set, so we now have equal class distribution. 

In [None]:
# Over sample the minority class
sm = SMOTE(random_state = 42)
X_train_scaled_ovsm, y_train_ovsm = sm.fit_resample(X_train_scaled, y_train)

In [None]:
print('SMOTE training data size:', X_train_scaled_ovsm.shape, y_train_ovsm.shape)

In [None]:
print('Imbalanced training set class distribution:', np.bincount(y_train))
print('SMOTE resampled training set class distribution:', np.bincount(y_train_ovsm))

In [None]:
# XGBoost cross-validation on the SMOTE dataset
param_dist = {'objective':'binary:logistic', 'eval_metric':'logloss', 'learning_rate':0.1,
            'random_state':42
            }
            
xgb_ovsm  = XGBClassifier(**param_dist)
model.eval_metrics_cv(xgb_ovsm, X_train_scaled_ovsm, y_train_ovsm, cv_fold = 5, 
                      scoring = 'accuracy', model_nm = "SMOTE with XGBoost Classifier")

In [None]:
# Load trained model
xgb_clf = joblib.load('../src/model/xgb_clf.pkl') 

# D. Prediction on the Imbalanced Test Set 
In this section, we make our final prediction on the imbalanced dataset after training the model using the resampling techniques.

## D-1. Normal Imbalanced Dataset

In [None]:
model.test_pred(xgb_clf, X_train_scaled, y_train, X_test_scaled, y_test, model_nm = "XBoost Classifier")

## D-2. SMOTE Dataset

In [None]:
model.test_pred(xgb_ovsm, X_train_scaled_ovsm, y_train_ovsm, X_test_scaled, y_test, model_nm = "SMOTE with XGBoost Classifier")

# E.  ROC and PR Curves


In [None]:
plt.figure(figsize = (20,15))

# Normal imbalanced distribution
model.plot_roc_pr_curves(xgb_clf, X_train_scaled, y_train, X_test_scaled, y_test, cv_fold = 5,
                       color= 'b', label = 'Normal (AUC= %0.2f)')
                     
# SMOTE distribution
model.plot_roc_pr_curves(xgb_ovsm, X_train_scaled_ovsm, y_train_ovsm, X_test_scaled, y_test, 
                         cv_fold = 5, color= 'r', label = 'SMOTE (AUC= %0.2f)') 