## Class Balancing techniques application

In supervised learning, a common strategy to overcome the class imbalance problem is to resample the original training dataset to decrease the overall level of class imbalance. Resampling is done either by oversampling the minority (positive) class and/or under-sampling the majority (negative) class until the classes are approximately equally represented.

All resampling operations are applied to only training datasets. If upsampling is done before splitting the dataset into a train and validation set, then it could end up with the same observation in both datasets. As a result, a machine learning model will be able to perfectly predict the value for those observations when predicting on the validation set, hence inflating the accuracy and recall.

* Input : Transformed datasets from Notebook3 - x_train_outfix.csv, x_test_outfix.csv, y_train_c3.csv and y_test_c3.csv
* Outcome : Smote+TomekLink class balanced datasets - x_train_smtom.csv, y_train_smtom.csv, x_test_c4.csv and y_test_c4.csv

In [1]:
from __future__ import print_function 
import time

# Import libraries
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import random
import sklearn
import scipy

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

#Import Data balancing libraries
from imblearn.under_sampling import RandomUnderSampler, NearMiss
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.combine import SMOTETomek, SMOTEENN
from collections import Counter

# Import models from sklearn
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Import evaluation metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score,roc_curve, auc
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import cohen_kappa_score
from sklearn import model_selection

In [2]:
x_train=pd.read_csv('x_train_outfix.csv')
x_test=pd.read_csv('x_test_outfix.csv')

y_train=pd.read_csv('y_train_c3.csv')
y_test=pd.read_csv('y_test_c3.csv')

print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(21000, 29) (9000, 29) (21000, 1) (9000, 1)


### Functions for Model performance comparison

In [3]:
classifier = [
    ensemble.AdaBoostClassifier(), ensemble.BaggingClassifier(), XGBClassifier(),
    ensemble.GradientBoostingClassifier(), ensemble.RandomForestClassifier(), tree.DecisionTreeClassifier(),
    linear_model.LogisticRegressionCV(), naive_bayes.GaussianNB(), neighbors.KNeighborsClassifier(),svm.SVC(probability=True)
    ]

In [3]:
def models_comparison(x_train, y_train, x_test, y_test, folds):
    
    time_start = time.time()
    classifier_columns = []
    classifier_compare = pd.DataFrame(columns = classifier_columns)

    row_index = 0
    for alg in classifier:
    
        pred = alg.fit(x_train, y_train).predict(x_test)
        classifier_name = alg.__class__.__name__
        
        classifier_compare.loc[row_index,'ML Algorithm'] = classifier_name
        classifier_compare.loc[row_index, 'Train Accuracy'] = model_selection.cross_val_score(alg,x_train,y_train,cv=folds,scoring='accuracy').mean()
        classifier_compare.loc[row_index, 'Test Accuracy'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='accuracy').mean()
        classifier_compare.loc[row_index, 'Precision'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='precision').mean()
        classifier_compare.loc[row_index, 'Recall'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='recall').mean()
        classifier_compare.loc[row_index, 'F1 score'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='f1').mean()
        fp, tp, th = roc_curve(y_test, pred)
        classifier_compare.loc[row_index, 'ROC AUC'] = auc(fp, tp)       
        classifier_compare.loc[row_index, 'Kappa'] = cohen_kappa_score(y_test, pred, labels=None, weights=None, sample_weight=None)  
        roc_auc = auc(fp, tp)
        classifier_compare.loc[row_index, 'GINI'] = (2 * roc_auc) - 1
        tn, fp, fn, tp = confusion_matrix(y_test, pred, labels=[0,1]).ravel()
        classifier_compare.loc[row_index, 'Type II error'] = fn
        row_index+=1
    
    classifier_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)  
    print('Time elapsed: {} seconds'.format(time.time()-time_start))
    return classifier_compare

### Under Sampling:

#### 1. Random undersampling

The simplest form of undersampling is to remove random records from the majority class. With imblearn's implementation we can choose to remove samples with or without replacement. The biggest drawback to this form of undersampling is loss of information.

In [11]:
rus =  RandomUnderSampler()
x_train_rus, y_train_rus = rus.fit_sample(x_train, y_train)
print(x_train_rus.shape,y_train_rus.shape)
print('\n',y_train_rus['DEFAULT'].value_counts())

(9392, 29) (9392, 1)

 1    4696
0    4696
Name: DEFAULT, dtype: int64


#### 2. Near Miss

In [12]:
nm = NearMiss()
x_train_nm,y_train_nm = nm.fit_sample(x_train, y_train)
print(x_train_nm.shape,y_train_nm.shape)
print('\n',y_train_nm['DEFAULT'].value_counts())

(9392, 29) (9392, 1)

 1    4696
0    4696
Name: DEFAULT, dtype: int64


### Over Sampling:

#### 3. Random oversampling

The simplest implementation of oversampling is to duplicate random records from the minority class, this can cause overfitting.

In [13]:
ros =  RandomOverSampler()
x_train_ros, y_train_ros = ros.fit_sample(x_train, y_train)
print(x_train_ros.shape,y_train_ros.shape)
print('\n',y_train_ros['DEFAULT'].value_counts())

(32608, 29) (32608, 1)

 1    16304
0    16304
Name: DEFAULT, dtype: int64


#### 4. SMOTE (Synthetic Minority Over-Sampling Technique)

SMOTE synthesizes new examples by interpolating existing observations. SMOTE begins by iterating over every minority class instace and choosing its k nearest neighbors. The algorithm then constructs new instances halfway between the chosen obervations and its k neighbors. The greatest limitation of SMOTE is that it can only construct examples within the body of observations, never outside.

In [14]:
# Smote oversampling
sm = SMOTE()
x_train_smote, y_train_smote = sm.fit_sample(x_train, y_train)
print(x_train_smote.shape,y_train_smote.shape)
print('\n',y_train_smote['DEFAULT'].value_counts())

(32608, 29) (32608, 1)

 1    16304
0    16304
Name: DEFAULT, dtype: int64


#### 5. ADASYN (Adaptive Synthetic Sampling)

ADASYN adaptively generates samples next to original observations which are wrongly classified by a KNN classifier. Unlike SMOTE that generates new samples that lie inside the class boundary, ADASYN tends to generate new samples near existing outliers.

In [15]:
adasyn =  ADASYN()
x_train_adasyn, y_train_adasyn = adasyn.fit_sample(x_train, y_train)
print(x_train_adasyn.shape,y_train_adasyn.shape)
print('\n',y_train_adasyn['DEFAULT'].value_counts())

(32124, 29) (32124, 1)

 0    16304
1    15820
Name: DEFAULT, dtype: int64


### Hybrid Sampling:

#### 6.SMOTE+ENN

SMOTEENN is the combination of SMOTE and Edited Nearest Neighbor. ENN removes any example whose class label differs from the class label of at least two of its three nearest neighbors. ENN tends to remove more examples than the Tomek links.

In [18]:
smtenn =  SMOTEENN()
x_train_smtenn, y_train_smtenn = smtenn.fit_sample(x_train, y_train)
print(x_train_smtenn.shape,y_train_smtenn.shape)
print('\n',y_train_smtenn['DEFAULT'].value_counts())

(18829, 29) (18829, 1)

 1    11764
0     7065
Name: DEFAULT, dtype: int64


#### 7. SMOTE+Tomek link

SMOTETomek is the combination of using Tomek links to undersample the majoirty class and the use of SMOTE to oversample the minority class.

In [19]:
smtom =  SMOTETomek()
x_train_smtom, y_train_smtom = smtom.fit_sample(x_train, y_train)
print(x_train_smtom.shape,y_train_smtom.shape)
print('\n',y_train_smtom['DEFAULT'].value_counts())

(31656, 29) (31656, 1)

 1    15828
0    15828
Name: DEFAULT, dtype: int64


#### 8. Cost-sensitive learning

Cost-sensitive learning for imbalanced classification is focused on first assigning different costs to the types of misclassification errors that can be made, then using specialized methods to take those costs into account.
Instead of resampling with a focus on balancing the skewed class distribution, the focus is on changing the composition of the training dataset to meet the expectations of the cost matrix.

In [17]:
## class_weight='balanced' option will be assigned to algorithms directlyin the below section while running 
    # model comparison codes.The reason for its less popularity is that there are no cost-sensitive implementations 
    # of all learning algorithms.

### Selection of Sampling method based on ML model performance

10 Machine learning algorithms are considered with 10 fold cross validation

In [31]:
#1 No Sampling 

models_comparison(x_train, y_train, x_test, y_test, 10)

Time elapsed: 3849.278704404831 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.82,0.83,0.68,0.36,0.47,0.66,0.38,0.31,1238.0
0,AdaBoostClassifier,0.82,0.82,0.67,0.34,0.46,0.64,0.34,0.28,1327.0
4,RandomForestClassifier,0.81,0.82,0.64,0.35,0.45,0.65,0.35,0.3,1251.0
2,XGBClassifier,0.81,0.81,0.62,0.36,0.45,0.65,0.35,0.3,1245.0
1,BaggingClassifier,0.8,0.81,0.6,0.34,0.41,0.64,0.32,0.27,1297.0
6,LogisticRegressionCV,0.78,0.78,0.48,0.0,0.01,0.5,0.0,0.0,1937.0
9,SVC,0.78,0.78,0.0,0.0,0.0,0.5,0.0,0.0,1940.0
8,KNeighborsClassifier,0.75,0.76,0.36,0.17,0.23,0.56,0.14,0.11,1561.0
5,DecisionTreeClassifier,0.73,0.73,0.37,0.4,0.38,0.62,0.23,0.24,1114.0
7,GaussianNB,0.57,0.56,0.28,0.65,0.39,0.6,0.13,0.19,673.0


In [37]:
#2 Random UnderSampling

models_comparison(x_train_rus, y_train_rus, x_test, y_test, 10)

Time elapsed: 2461.4590957164764 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.71,0.83,0.68,0.36,0.47,0.71,0.36,0.42,707.0
0,AdaBoostClassifier,0.71,0.82,0.67,0.34,0.46,0.7,0.36,0.4,779.0
4,RandomForestClassifier,0.7,0.82,0.64,0.36,0.45,0.7,0.35,0.41,728.0
2,XGBClassifier,0.69,0.81,0.62,0.36,0.45,0.69,0.3,0.37,668.0
1,BaggingClassifier,0.68,0.81,0.58,0.33,0.41,0.68,0.3,0.35,798.0
6,LogisticRegressionCV,0.63,0.78,0.48,0.0,0.01,0.61,0.15,0.22,625.0
9,SVC,0.61,0.78,0.0,0.0,0.0,0.61,0.15,0.22,634.0
8,KNeighborsClassifier,0.58,0.76,0.36,0.17,0.23,0.58,0.12,0.17,787.0
5,DecisionTreeClassifier,0.63,0.73,0.37,0.4,0.4,0.62,0.17,0.23,752.0
7,GaussianNB,0.59,0.56,0.28,0.65,0.39,0.58,0.1,0.17,465.0


In [38]:
#3 NearMiss UnderSampling

models_comparison(x_train_nm, y_train_nm, x_test, y_test, 10)

Time elapsed: 1645.0514953136444 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.73,0.83,0.68,0.36,0.47,0.52,0.02,0.03,626.0
0,AdaBoostClassifier,0.73,0.82,0.67,0.34,0.46,0.52,0.02,0.04,652.0
4,RandomForestClassifier,0.71,0.82,0.63,0.35,0.45,0.51,0.01,0.02,570.0
2,XGBClassifier,0.69,0.81,0.62,0.36,0.45,0.51,0.01,0.02,580.0
1,BaggingClassifier,0.68,0.81,0.59,0.32,0.42,0.5,0.0,0.0,646.0
6,LogisticRegressionCV,0.62,0.78,0.48,0.0,0.01,0.48,-0.02,-0.04,554.0
9,SVC,0.65,0.78,0.0,0.0,0.0,0.42,-0.1,-0.16,1140.0
8,KNeighborsClassifier,0.59,0.76,0.36,0.17,0.23,0.45,-0.06,-0.1,894.0
5,DecisionTreeClassifier,0.63,0.72,0.37,0.4,0.39,0.49,-0.01,-0.01,607.0
7,GaussianNB,0.66,0.56,0.28,0.65,0.39,0.42,-0.11,-0.17,1245.0


In [36]:
#4 Random OverSampling

models_comparison(x_train_ros, y_train_ros, x_test, y_test, 10)

Time elapsed: 9730.347610235214 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.73,0.83,0.68,0.36,0.47,0.71,0.37,0.42,730.0
0,AdaBoostClassifier,0.71,0.82,0.67,0.34,0.46,0.7,0.37,0.41,769.0
4,RandomForestClassifier,0.95,0.82,0.64,0.35,0.46,0.67,0.38,0.34,1129.0
2,XGBClassifier,0.84,0.81,0.62,0.36,0.45,0.69,0.35,0.38,870.0
1,BaggingClassifier,0.94,0.81,0.59,0.34,0.42,0.65,0.34,0.3,1180.0
6,LogisticRegressionCV,0.63,0.78,0.48,0.0,0.01,0.61,0.15,0.22,596.0
9,SVC,0.62,0.78,0.0,0.0,0.0,0.61,0.16,0.23,635.0
8,KNeighborsClassifier,0.72,0.76,0.36,0.17,0.23,0.56,0.1,0.13,947.0
5,DecisionTreeClassifier,0.9,0.72,0.37,0.4,0.38,0.61,0.21,0.21,1187.0
7,GaussianNB,0.59,0.56,0.28,0.65,0.39,0.58,0.1,0.17,465.0


In [32]:
#5 SMOTE

models_comparison(x_train_smote, y_train_smote, x_test, y_test, 10)

Time elapsed: 6234.433809518814 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.85,0.83,0.68,0.36,0.47,0.67,0.38,0.33,1174.0
0,AdaBoostClassifier,0.85,0.82,0.67,0.34,0.46,0.65,0.36,0.31,1224.0
4,RandomForestClassifier,0.87,0.82,0.65,0.36,0.46,0.66,0.35,0.31,1171.0
2,XGBClassifier,0.86,0.81,0.62,0.36,0.45,0.65,0.35,0.3,1212.0
1,BaggingClassifier,0.85,0.81,0.59,0.34,0.42,0.64,0.31,0.27,1244.0
6,LogisticRegressionCV,0.64,0.78,0.48,0.0,0.01,0.61,0.15,0.21,657.0
9,SVC,0.63,0.78,0.0,0.0,0.0,0.62,0.15,0.23,544.0
8,KNeighborsClassifier,0.76,0.76,0.36,0.17,0.23,0.57,0.11,0.15,926.0
5,DecisionTreeClassifier,0.8,0.73,0.37,0.39,0.39,0.61,0.22,0.23,1111.0
7,GaussianNB,0.6,0.56,0.28,0.65,0.39,0.58,0.09,0.16,458.0


In [33]:
#6 ADASYN

models_comparison(x_train_adasyn, y_train_adasyn, x_test, y_test, 10)

Time elapsed: 5203.879971027374 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.85,0.83,0.68,0.36,0.47,0.67,0.39,0.34,1157.0
0,AdaBoostClassifier,0.84,0.82,0.67,0.34,0.46,0.66,0.36,0.31,1214.0
4,RandomForestClassifier,0.87,0.82,0.64,0.35,0.45,0.66,0.36,0.33,1145.0
2,XGBClassifier,0.85,0.81,0.62,0.36,0.45,0.66,0.36,0.32,1189.0
1,BaggingClassifier,0.85,0.8,0.59,0.32,0.42,0.64,0.31,0.27,1243.0
6,LogisticRegressionCV,0.62,0.78,0.48,0.0,0.01,0.6,0.14,0.2,730.0
9,SVC,0.61,0.78,0.0,0.0,0.0,0.62,0.15,0.24,549.0
8,KNeighborsClassifier,0.72,0.76,0.36,0.17,0.23,0.57,0.11,0.15,874.0
5,DecisionTreeClassifier,0.79,0.72,0.37,0.4,0.38,0.61,0.21,0.22,1118.0
7,GaussianNB,0.58,0.56,0.28,0.65,0.39,0.58,0.09,0.15,444.0


In [35]:
#7 SMOTE + ENN

models_comparison(x_train_smtenn, y_train_smtenn, x_test, y_test, 10)

Time elapsed: 1717.5548222064972 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.89,0.83,0.68,0.36,0.47,0.7,0.37,0.4,809.0
0,AdaBoostClassifier,0.87,0.82,0.67,0.34,0.46,0.69,0.34,0.38,814.0
4,RandomForestClassifier,0.92,0.82,0.63,0.36,0.45,0.7,0.35,0.39,804.0
2,XGBClassifier,0.9,0.81,0.62,0.36,0.45,0.68,0.32,0.36,834.0
1,BaggingClassifier,0.9,0.8,0.59,0.33,0.42,0.68,0.33,0.35,898.0
6,LogisticRegressionCV,0.74,0.78,0.48,0.0,0.01,0.6,0.11,0.19,305.0
9,SVC,0.75,0.78,0.0,0.0,0.0,0.6,0.11,0.2,198.0
8,KNeighborsClassifier,0.91,0.76,0.36,0.17,0.23,0.58,0.1,0.17,580.0
5,DecisionTreeClassifier,0.86,0.72,0.37,0.39,0.39,0.63,0.21,0.25,892.0
7,GaussianNB,0.71,0.56,0.28,0.65,0.39,0.57,0.08,0.15,435.0


In [34]:
#8 SMOTE + Tomeklink

models_comparison(x_train_smtom, y_train_smtom, x_test, y_test, 10)

Time elapsed: 4968.085988998413 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.86,0.83,0.68,0.36,0.47,0.67,0.4,0.34,1147.0
0,AdaBoostClassifier,0.85,0.82,0.67,0.34,0.46,0.65,0.36,0.3,1228.0
4,RandomForestClassifier,0.88,0.82,0.65,0.35,0.45,0.67,0.37,0.33,1139.0
2,XGBClassifier,0.86,0.81,0.62,0.36,0.45,0.66,0.37,0.32,1182.0
1,BaggingClassifier,0.86,0.81,0.59,0.33,0.41,0.65,0.33,0.3,1192.0
6,LogisticRegressionCV,0.64,0.78,0.48,0.0,0.01,0.6,0.14,0.21,636.0
9,SVC,0.63,0.78,0.0,0.0,0.0,0.62,0.15,0.23,575.0
8,KNeighborsClassifier,0.76,0.76,0.36,0.17,0.23,0.57,0.1,0.14,932.0
5,DecisionTreeClassifier,0.81,0.73,0.37,0.4,0.39,0.6,0.19,0.2,1145.0
7,GaussianNB,0.6,0.56,0.28,0.65,0.39,0.58,0.09,0.16,455.0


In [7]:
#9 Cost sensitive learning

# Class Weights are used to correct class imbalances as a proxy for over \ undersampling. 
# Only below algorithms support cost-sensitive learning 

classifier = [
    XGBClassifier(scale_pos_weight=4), ensemble.RandomForestClassifier(class_weight='balanced'), 
    DecisionTreeClassifier(class_weight='balanced'), LogisticRegression(solver='lbfgs', class_weight='balanced')
    ]

In [5]:
models_comparison(x_train, y_train, x_test, y_test, 10)

Time elapsed: 172.29226803779602 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
1,RandomForestClassifier,0.812286,0.817444,0.648718,0.321134,0.433462,0.640116,0.341044,0.280231,1301.0
0,XGBClassifier,0.748762,0.770556,0.472006,0.517526,0.493378,0.69722,0.347827,0.394441,767.0
2,DecisionTreeClassifier,0.729905,0.729556,0.379844,0.39433,0.385633,0.614597,0.22586,0.229195,1156.0
3,LogisticRegression,0.688238,0.688222,0.282692,0.288144,0.284907,0.558524,0.109305,0.117048,1262.0


### Conclusion

In [1]:
## Choosing best Sampling technique and why:
    
#SMOTE, ADASYN and SMOTE+TOMEK perform better than all other techniques
    #Train AccuracY, Test AccuracY, Precision, Recall AND F1 score metrics are similar among SMOTE, ADASYN and SMOTE+TOMEK
    # Metrics like  AUC, Kappa, GINI and Type II error, which are more efficient for evaluating imbalanced datasets are slightly
    # better for SMOTE+TOMEKLink method and also this method is least explored among researches. Hence this method is finalized.

In [41]:
# # Saving processed csv file for part 5 and 6 - Feature Selection and Dimentionality reduction

x_train_smtom.to_csv("x_train_smtom.csv", index=None)
y_train_smtom.to_csv("y_train_smtom.csv", index=None)

x_test.to_csv("x_test_c4.csv", index=None)
y_test.to_csv("y_test_c4.csv", index=None)