In [1]:
import sys
sys.path.append("functions/")

In [2]:
from datastore import DataStore
from searchgrid import SearchGrid
from crossvalidate import CrossValidate
from sklearn.dummy import DummyClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sampleddatastore import SampledDataStore

First, we will load up our predefined functions for loading the data and running the model.

Due to the high class imbalance, F1 score is a much better metric to use than just accuracy (since 99% of the data belongs to class 0). We will also have ROC-AUC for comparison.

We fetch the true positives, false positives and false negatives to calculate the f1 score across all folds rather than using the builtin functionality. This is because the averaged f1 score returned by Sklearn is slighly biased for imbalanced class problems (for cross validation). This doesn't matter when evaluating the test set. All the relevant functions are in their respective python files (same folder as the notebook).

Reference: https://www.hpl.hp.com/techreports/2009/HPL-2009-359.pdf

In [3]:
#Load object for CrossValidation
crossvalidate = CrossValidate()

#Load object for GridSearchCV
GridSpace = SearchGrid()

Let's establish a baseline model that simply predicts the minority class:

In [4]:
classifier = DummyClassifier(strategy="constant", constant=1)
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.0038545073871373366
ROC-AUC is: 0.5


A good model is one that can perform better than the baseline, in terms of F1 Score. Anything below is worse than a model that simply predicts minority class.

Note that 0.5 ROC-AUC score indicates that it's a random classifier.

In [5]:
classifier = LogisticRegression()
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.15532286212914484
ROC-AUC is: 0.5462118916903937


Looks like it's slightly better than a random classifier; this means that our model is learning some relationships for the underlying data, albeit small.

The low score is to expected, especially given the class imbalance. Let's try using the class weight functionality that assigns weights to each class based on their frequency.

In [6]:
classifier = LogisticRegression(class_weight='balanced')
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.14040146390151936
ROC-AUC is: 0.8218717660393189


Looks like the balanced class weight performs worse in terms of f1 score (probably because it results in a lot more false positives).

Let's test different parameters using GridSearchCV. We will be using our custom objects.

In [7]:
parameters = {'class_weight':[{0:1,1:1}, {0:1,1:10}, {0:1,1:100}, {0:10,1:1}]}
GridSpace.setGridParameters(parameters)
GridSpace.setClassifier(LogisticRegression())
GridSpace.run()
parameters, scores = GridSpace.getMetrics().getBestResults()
f1 = scores[0]
roc = scores[1]
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")
print(f"Best Parameters: {parameters}")

F1 score is: 0.28378378378378377
ROC-AUC is: 0.6413425931373595
Best Parameters: {'class_weight': {0: 1, 1: 10}}


We are making progress, but can we do even better?

Adjusting the weights were not enough, we will have to try different sampling techniques. Imbalanced-learn library will come in handy here.

We will start with RandomOverSampler to duplicates records from the minority class. We will use a sampling ratio of 0.1 (i.e. increasing gilded class to ~10%).

Read more: https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#a-practical-guide

In [8]:
SampledDataStore = SampledDataStore()
SampledDataStore.initializeSamplers()

Resampling and Saving Data: RandomSample
Resampling and Saving Data: SMOTE
Resampling and Saving Data: ADASYN
Resampling and Saving Data: SMOTETomek
Resampling and Saving Data: SMOTEEnn
Loading Sampling Data...


In [10]:
#Using RandomOverSampler to duplicate records belonging to class 1 (gilded)
random = SampledDataStore.getRandomSampled

X_resampled, y_resampled = random()
classifier = LogisticRegression(class_weight={0: 1, 1: 10})
crossvalidate.getDataStore().setxTrain(X_resampled)
crossvalidate.getDataStore().setyTrain(y_resampled)
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print("Random Over Sampling:")
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

crossvalidate.getDataStore().revertToOriginal()

Random Over Sampling:
F1 score is: 0.7297813462213041
ROC-AUC is: 0.8224302218282359


We can also generate new samples with SMOTE and ADASYN based on existing samples. We will keep the sampling ratio the same for comparison.

In [11]:
smote = SampledDataStore.getSMOTESampled
ada = SampledDataStore.getADASYNSampled
samplers = [smote, ada]
sampler_names = ["SMOTE", "ADASYN"]

for i in range(len(samplers)):
    X_resampled, y_resampled = samplers[i]()
    crossvalidate.getDataStore().setxTrain(X_resampled)
    crossvalidate.getDataStore().setyTrain(y_resampled)
    classifier = LogisticRegression(class_weight={0: 1, 1: 10})
    crossvalidate.setClassifier(classifier)
    crossvalidate.run()
    f1, roc = crossvalidate.getMetrics().getScores()
    print(f"{sampler_names[i]}: ")
    print(f"F1 score is: {f1}")
    print(f"ROC-AUC is: {roc}")
    print("\n")
        
crossvalidate.getDataStore().revertToOriginal()

SMOTE: 
F1 score is: 0.7275005987957244
ROC-AUC is: 0.8211875770559016


ADASYN: 
F1 score is: 0.7039175165101756
ROC-AUC is: 0.8081044664836933




Imbalanced learn also recommends combining oversampling with undersampling the majority class.

Ref: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html

SMOTE can generate noisy samples (ex: when classes cannot be well separated), undersampling allows to clean the noisy data.

In [12]:
smote_tomek = SampledDataStore.getSMOTETOMEKSampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [smote_tomek, smote_enn]
sampler_names = ["SMOTE TOMEK", "SMOTE ENN"]

for i in range(len(samplers)):
    X_resampled, y_resampled = samplers[i]()
    crossvalidate.getDataStore().setxTrain(X_resampled)
    crossvalidate.getDataStore().setyTrain(y_resampled)
    classifier = LogisticRegression(class_weight={0: 1, 1: 10})
    crossvalidate.setClassifier(classifier)
    crossvalidate.run()
    f1, roc = crossvalidate.getMetrics().getScores()
    print(f"{sampler_names[i]}: ")
    print(f"F1 score is: {f1}")
    print(f"ROC-AUC is: {roc}")
    print("\n")
        
crossvalidate.getDataStore().revertToOriginal()

SMOTE TOMEK: 
F1 score is: 0.7318982387475538
ROC-AUC is: 0.8243926252502568


SMOTE ENN: 
F1 score is: 0.774218448475492
ROC-AUC is: 0.8593981574703357




SMOTE, SMOTEENN and RandomOverSampler produces the best results so far. Let's evaluate those them on our test set.

In [13]:
random = SampledDataStore.getRandomSampled
smote = SampledDataStore.getSMOTESampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [random, smote, smote_enn]
sampler_names = ["Random OverSampler", "SMOTE", "SMOTE ENN"]

classifier = LogisticRegression()

for i in range(len(samplers)):
    parameters = {'class_weight':[{0:1,1:10}]}
    X_resampled, y_resampled = samplers[i]()
    GridSpace.getDataStore().setxTrain(X_resampled)
    GridSpace.getDataStore().setyTrain(y_resampled) 
    GridSpace.setGridParameters(parameters)
    GridSpace.setClassifier(classifier)
    grid = GridSpace.run()
    y_preds = grid.predict(GridSpace.getDataStore().getxTest())
    print(f"{sampler_names[i]} on test set:")
    print(f"F1 score: {f1_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print(f"ROC_AUC score: {roc_auc_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print(f"Balanced accuracy score: {balanced_accuracy_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print("\n")
    
GridSpace.getDataStore().revertToOriginal()

Random OverSampler on test set:
F1 score: 0.14326107445805844
ROC_AUC score: 0.8097009156137511
Balanced accuracy score: 0.8097009156137511


SMOTE on test set:
F1 score: 0.14212248714352502
ROC_AUC score: 0.8096324660369305
Balanced accuracy score: 0.8096324660369305


SMOTE ENN on test set:
F1 score: 0.10788643533123028
ROC_AUC score: 0.8451410363265931
Balanced accuracy score: 0.8451410363265931




Logistic regression predicts the class probabilities for each sample and decides class based on a threshold (default: 0.5). We can also check if a different threshold value produces better results.

Ref: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/

Let's define the relevant functions first.

In [14]:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegressionCV

def trainAndgetProbabilities(xTrain, yTrain, xTest):
    rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=42)
    model = LogisticRegressionCV(cv=rskf, class_weight=[{0: 1, 1: 10}])
    model.fit(xTrain, yTrain)
    return model.predict_proba(xTest)[:,1]

def convert_probs(probs, threshold):
    return (probs >= threshold).astype('int')

In [15]:
from datastore import DataStore

Data = DataStore()
random = SampledDataStore.getRandomSampled
smote = SampledDataStore.getSMOTESampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [random, smote, smote_enn]
sampler_names = ["Random Oversampling", "SMOTE", "SMOTE ENN"]
thresholds = np.arange(0, 1, 0.001)

for i in range(len(samplers)):
    X_resampled, y_resampled = samplers[i]()
    probs = trainAndgetProbabilities(X_resampled, y_resampled, Data.getxTest())
    f1_scores = [f1_score(Data.getyTest(), convert_probs(probs, t)) for t in thresholds]
    roc_scores = [roc_auc_score(Data.getyTest(), convert_probs(probs, t)) for t in thresholds]
    maxf1Index = np.argmax(f1_scores)
    maxrocIndex = np.argmax(roc_scores)
    print(f"\n{sampler_names[i]} on test set:")
    print("Maxiziming F1 Score:")
    print(f"Threshold: {thresholds[maxf1Index]}, F1 Score: {f1_scores[maxf1Index]}, ROC AUC: {roc_scores[maxf1Index]}")
    print("Maxiziming ROC-AUC Score:")
    print(f"Threshold: {thresholds[maxrocIndex]}, F1 Score: {f1_scores[maxrocIndex]}, ROC AUC: {roc_scores[maxrocIndex]}")


Random Oversampling on test set:
Maxiziming F1 Score:
Threshold: 0.996, F1 Score: 0.30316742081447967, ROC AUC: 0.6390397631644643
Maxiziming ROC-AUC Score:
Threshold: 0.048, F1 Score: 0.030894055234826023, ROC AUC: 0.8626526022918528

SMOTE on test set:
Maxiziming F1 Score:
Threshold: 0.996, F1 Score: 0.30316742081447967, ROC AUC: 0.6390397631644643
Maxiziming ROC-AUC Score:
Threshold: 0.056, F1 Score: 0.08689424683695392, ROC AUC: 0.8639990457323702

SMOTE ENN on test set:
Maxiziming F1 Score:
Threshold: 0.993, F1 Score: 0.27439886845827444, ROC AUC: 0.7005935484260625
Maxiziming ROC-AUC Score:
Threshold: 0.036000000000000004, F1 Score: 0.042553191489361694, ROC AUC: 0.8805897132365377


Better, but not ideal. The difference in ROC_AUC score points to the problem; The higher threshold value causes the model to predict smaller number of samples to be positive (true positive or false positive), resulting in lower ROC AUC and a higher F1 score.

Overall, our results are better than the baseline model, but not ideal. Perhaps, we can achieve better results with a more complex (non-linear) model. Let's try SVM next.