In [1]:
import sys
sys.path.append("functions/")

In [2]:
from datastore import DataStore
from searchgrid import SearchGrid
from crossvalidate import CrossValidate
from sklearn.dummy import DummyClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sampleddatastore import SampledDataStore

First, we will load up our predefined functions for loading the data and running the model.

Due to the high class imbalance, F1 score is a much better metric to use than just accuracy (since 99% of the data belongs to class 0). We will also have ROC-AUC for comparison.

We fetch the true positives, false positives and false negatives to calculate the f1 score across all folds rather than using the builtin functionality. This is because the averaged f1 score returned by Sklearn is slighly biased for imbalanced class problems (for cross validation). This doesn't matter when evaluating the test set. All the relevant functions are in their respective python files (same folder as the notebook).

Reference: https://www.hpl.hp.com/techreports/2009/HPL-2009-359.pdf

In [3]:
#Load object for CrossValidation
crossvalidate = CrossValidate()

#Load object for GridSearchCV
GridSpace = SearchGrid()

Let's establish a baseline model that simply predicts the minority class:

In [4]:
classifier = DummyClassifier(strategy="constant", constant=1)
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.0038792557287250116
ROC-AUC is: 0.5


A good model is one that can perform better than the baseline, in terms of F1 Score. Anything below is worse than a model that simply predicts minority class.

Note that 0.5 ROC-AUC score indicates that it's a random classifier.

In [4]:
classifier = LogisticRegression()
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.14389101745423585
ROC-AUC is: 0.5423767409646068


Looks like it's slightly better than a random classifier; this means that our model is learning some relationships for the underlying data, albeit small.

The low score is to expected, especially given the class imbalance. Let's try using the class weight functionality that assigns weights to each class based on their frequency.

In [5]:
classifier = LogisticRegression(class_weight='balanced')
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.07312321528648821
ROC-AUC is: 0.8275665348137995


Looks like the balanced class weight performs worse in terms of f1 score (probably because it results in a lot more false positives).

Let's test different parameters using GridSearchCV. We will be using our custom objects.

In [9]:
parameters = {'class_weight':[{0:1,1:1}, {0:1,1:10}, {0:1,1:100}, {0:10,1:1}]}
GridSpace.setGridParameters(parameters)
GridSpace.setClassifier(LogisticRegression())
GridSpace.run()
parameters, scores = GridSpace.getMetrics().getBestResults()
f1 = scores[0]
roc = scores[1]
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")
print(f"Best Parameters: {parameters}")

F1 score is: 0.29769820971867006
ROC-AUC is: 0.6456231724605754
Best Parameters: {'class_weight': {0: 1, 1: 10}}


We are making progress, but can we do even better?

Adjusting the weights were not enough, we will have to try different sampling techniques. Imbalanced-learn library will come in handy here.

We will start with RandomOverSampler to duplicates records from the minority class. We will use a sampling ratio of 0.1 (i.e. ~10% increase in gilded class).

Read more: https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html#a-practical-guide

In [4]:
SampledDataStore = SampledDataStore()
SampledDataStore.initializeSamplers()

In [6]:
#Using RandomOverSampler to duplicate records belonging to class 1 (gilded)
random = SampledDataStore.getRandomSampled

X_resampled, y_resampled = random()
classifier = LogisticRegression(class_weight={0: 1, 1: 10})
crossvalidate.getDataStore().setxTrain(X_resampled)
crossvalidate.getDataStore().setyTrain(y_resampled)
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print("Random Over Sampling:")
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

crossvalidate.getDataStore()..revertToOriginal()

Random Over Sampling:
F1 score is: 0.6781609747429389
ROC-AUC is: 0.828094671658085


<datastore.DataStore at 0x2aaaf0907fd0>

We can also generate new samples with SMOTE and ADASYN based on existing samples. We will keep the sampling ratio the same for comparison.

In [7]:
smote = SampledDataStore.getSMOTESampled
ada = SampledDataStore.getADASYNSampled
samplers = [smote, ada]
sampler_names = ["SMOTE", "ADASYN"]

for i in range(len(samplers)):
    X_resampled, y_resampled = samplers[i]()
    crossvalidate.getDataStore().setxTrain(X_resampled)
    crossvalidate.getDataStore().setyTrain(y_resampled)
    classifier = LogisticRegression(class_weight={0: 1, 1: 10})
    crossvalidate.setClassifier(classifier)
    crossvalidate.run()
    f1, roc = crossvalidate.getMetrics().getScores()
    print(f"{sampler_names[i]}: ")
    print(f"F1 score is: {f1}")
    print(f"ROC-AUC is: {roc}")
    print("\n")
        
crossvalidate.getDataStore().revertToOriginal()

SMOTE: 
F1 score is: 0.6686993132649306
ROC-AUC is: 0.8241454838082516


ADASYN: 
F1 score is: 0.5863412863698032
ROC-AUC is: 0.802431136593601




Imbalanced learn also recommends combining oversampling with undersampling the majority class.

Ref: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html

SMOTE can generate noisy samples (ex: when classes cannot be well separated), undersampling allows to clean the noisy data.

In [8]:
smote_tomek = SampledDataStore.getSMOTETOMEKSampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [smote_tomek, smote_enn]
sampler_names = ["SMOTE TOMEK", "SMOTE ENN"]

for i in range(len(samplers)):
    X_resampled, y_resampled = samplers[i]()
    crossvalidate.getDataStore().setxTrain(X_resampled)
    crossvalidate.getDataStore().setyTrain(y_resampled)
    classifier = LogisticRegression(class_weight={0: 1, 1: 10})
    crossvalidate.setClassifier(classifier)
    crossvalidate.run()
    f1, roc = crossvalidate.getMetrics().getScores()
    print(f"{sampler_names[i]}: ")
    print(f"F1 score is: {f1}")
    print(f"ROC-AUC is: {roc}")
    print("\n")
        
crossvalidate.getDataStore().revertToOriginal()

SMOTE TOMEK: 
F1 score is: 0.6715279378987011
ROC-AUC is: 0.8249707966279305


SMOTE ENN: 
F1 score is: 0.7544018221916917
ROC-AUC is: 0.8578015461438296




SMOTE, SMOTEENN and RandomOverSampler produces the best results so far. Let's evaluate those them on our test set.

In [9]:
random = SampledDataStore.getRandomSampled
smote = SampledDataStore.getSMOTESampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [random, smote, smote_enn]
sampler_names = ["Random OverSampler", "SMOTE", "SMOTE ENN"]

classifier = LogisticRegression()

for i in range(len(samplers)):
    parameters = {'class_weight':[{0:1,1:10}]}
    X_resampled, y_resampled = samplers[i]()
    GridSpace.getDataStore().setxTrain(X_resampled)
    GridSpace.getDataStore().setyTrain(y_resampled) 
    GridSpace.setGridParameters(parameters)
    GridSpace.setClassifier(classifier)
    grid = GridSpace.run()
    y_preds = grid.predict(GridSpace.getDataStore().getxTest())
    print(f"{sampler_names[i]} on test set:")
    print(f"F1 score: {f1_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print(f"ROC_AUC score: {roc_auc_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print(f"Balanced accuracy score: {balanced_accuracy_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print("\n")
    
GridSpace.getDataStore().revertToOriginal()

Random OverSampler on test set:
F1 score: 0.0638904734740445
ROC_AUC score: 0.8183981729232407
Balanced accuracy score: 0.8183981729232408


SMOTE on test set:
F1 score: 0.060296191819464044
ROC_AUC score: 0.8228175600742684
Balanced accuracy score: 0.8228175600742684


SMOTE ENN on test set:
F1 score: 0.08897775556110973
ROC_AUC score: 0.8434413510604898
Balanced accuracy score: 0.8434413510604898




Logistic regression predicts the class probabilities for each sample and decides class based on a threshold (default: 0.5). We can also check if a different threshold value produces better results.

Ref: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/

Let's define the relevant functions first.

In [11]:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegressionCV

def trainAndgetProbabilities(xTrain, yTrain, xTest):
    rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=42)
    model = LogisticRegressionCV(cv=rskf, class_weight=[{0: 1, 1: 10}])
    model.fit(xTrain, yTrain)
    return model.predict_proba(xTest)[:,1]

def convert_probs(probs, threshold):
    return (probs >= threshold).astype('int')

In [12]:
from datastore import DataStore

Data = DataStore()
random = SampledDataStore.getRandomSampled
smote = SampledDataStore.getSMOTESampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [random, smote, smote_enn]
sampler_names = ["Random Oversampling", "SMOTE", "SMOTE ENN"]
thresholds = np.arange(0, 1, 0.001)

for i in range(len(samplers)):
    parameters = {'class_weight':[{0:1,1:10}]}    
    X_resampled, y_resampled = samplers[i]()
    probs = trainAndgetProbabilities(X_resampled, y_resampled, Data.getxTest())
    f1_scores = [f1_score(Data.getyTest(), convert_probs(probs, t)) for t in thresholds]
    roc_scores = [roc_auc_score(Data.getyTest(), convert_probs(probs, t)) for t in thresholds]
    maximize_f1 = np.argmax(f1_scores)
    maximize_roc = np.argmax(roc_scores)
    print(f"\n{sampler_names[i]} on test set:")
    print("Maxiziming F1 Score:")
    print(f"Threshold: {thresholds[maximize_f1]}, F1 Score: {f1_scores[maximize_f1]}, ROC AUC: {roc_scores[maximize_f1]}")
    print("Maxiziming ROC-AUC Score:")
    print(f"Threshold: {thresholds[maximize_roc]}, F1 Score: {f1_scores[maximize_roc]}, ROC AUC: {roc_scores[maximize_roc]}")


Random Oversampling on test set:
Maxiziming F1 Score:
Threshold: 0.9470000000000001, F1 Score: 0.3143350604490501, ROC AUC: 0.681795495628806
Maxiziming ROC-AUC Score:
Threshold: 0.114, F1 Score: 0.06988058381247236, ROC AUC: 0.8011632750856418

SMOTE on test set:
Maxiziming F1 Score:
Threshold: 0.9470000000000001, F1 Score: 0.31762652705061084, ROC AUC: 0.6818189791785794
Maxiziming ROC-AUC Score:
Threshold: 0.115, F1 Score: 0.07159142726858185, ROC AUC: 0.7996836228270289

SMOTE ENN on test set:
Maxiziming F1 Score:
Threshold: 0.998, F1 Score: 0.3019431988041854, ROC AUC: 0.7015627029169681
Maxiziming ROC-AUC Score:
Threshold: 0.107, F1 Score: 0.08415217939027463, ROC AUC: 0.8214351900710419


Better, but not ideal. The difference in ROC_AUC score points to the problem; The higher threshold value causes the model to predict smaller number of samples to be positive (true positive or false positive), resulting in lower ROC AUC and a higher F1 score.

Overall, our results are better than the baseline model, but not ideal. Perhaps, we can achieve better results with a more complex (non-linear) model. Let's try SVM next.