In [1]:
import sys
sys.path.append("functions/")

In [2]:
from datastore import DataStore
from searchgrid import SearchGrid
from crossvalidate import CrossValidate
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import balanced_accuracy_score
from sampleddatastore import SampledDataStore as sds

In [3]:
#Load object for CrossValidation
crossvalidate = CrossValidate()

#Load object for GridSearchCV
GridSpace = SearchGrid()

Regular SVM doesn't scale well to our particular problem (due to the size of the dataset and limited resources available). So, we will use Stochastic Gradient Descent with hinge loss (equates to SVM) instead. Let's try with the default parameters first:

In [4]:
classifier = SGDClassifier(loss="hinge",random_state=42)
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.08413351623228166
ROC-AUC is: 0.5238336749995037


That's the baseline. Let's try assigning weights to class. As with logistic regression, we will start off with balanced class weights.

In [5]:
classifier = SGDClassifier(loss="hinge", random_state=42, class_weight='balanced')
crossvalidate.setClassifier(classifier)
crossvalidate.run()
f1, roc = crossvalidate.getMetrics().getScores()
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")

F1 score is: 0.02984576721635957
ROC-AUC is: 0.8055003393516408


Considering our results with Logistic Regression, we can hypothesize that 1:10 non gilded to gilded ratio produces the best result, but let's test it out:

In [6]:
parameters = {'class_weight':[{0:1,1:1}, {0:1,1:10}, {0:1,1:100}, {0:10,1:1}]}
classifier = SGDClassifier(loss="hinge", random_state=42, max_iter=2000)
GridSpace.setGridParameters(parameters)
GridSpace.setClassifier(classifier)
GridSpace.run()
parameters, scores = GridSpace.getMetrics().getBestResults()
f1 = scores[0]
roc = scores[1]
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")
print(f"Best Parameters: {parameters}")

F1 score is: 0.24144226380648107
ROC-AUC is: 0.636675772913657
Best Parameters: {'class_weight': {0: 1, 1: 10}}


Worse F1 score (0.24 vs 0.30) compared to our results from weighted Logistic regression (same best weights) but comparable ROC-AUC score (0.64 vs 0.65).

SGD uses 2000 iterations (due to no convergence in the next block; just to check if we can get better results).

This may suggest that SVM is not a good model for our problem. However before we make any conclusions, let's 
fit the model with several other values for hyperparameters (such as using l1 norm or elastic net). We will also experiment with different values for alpha (regularization strength).

Note: l1_ratio is only used for elasticnet.

In [7]:
parameters = {'penalty':['l2', 'l1', 'elasticnet'], 'l1_ratio': [0.15, 0.30], 
              'alpha' : [0.0001, 0.001, 0.01, 0.1] , 'class_weight':[{0:1,1:10}]}
classifier = SGDClassifier(loss="hinge", random_state=42, max_iter=2000)
GridSpace.setGridParameters(parameters)
GridSpace.setClassifier(classifier)
GridSpace.run()
parameters, scores = GridSpace.getMetrics().getBestResults()
f1 = scores[0]
roc = scores[1]
print(f"F1 score is: {f1}")
print(f"ROC-AUC is: {roc}")
print(f"Best Parameters: {parameters}")

F1 score is: 0.28295656591313184
ROC-AUC is: 0.6441571799070308
Best Parameters: {'alpha': 0.01, 'class_weight': {0: 1, 1: 10}, 'l1_ratio': 0.15, 'penalty': 'l2'}


We can use these hyperparameter values for sampling methods; we only need to set the class weight and alpha as SGD uses l2 norm (penalty) as default. For comparison, we will also run the models with default alpha value (0.001).

As with Logistic Regression, we will try experimenting with different oversampling techniques. Let's go try RandomOverSampler, SMOTE and ADASYN first (0.1 sampling rate; augmenting gilded class to be roughly 10% of total).

RandomOverSampler duplicates samples belonging to minority class (gilded) while SMOTE and ADASYN creates synthentic samples that are similar to true ones.

We can load up up Sampling data using SampledDataStore.

In [8]:
SampledDataStore = sds()
SampledDataStore.initializeSamplers()

Loading Sampling Data...


In [9]:
random = SampledDataStore.getRandomSampled
smote = SampledDataStore.getSMOTESampled
ada = SampledDataStore.getADASYNSampled
samplers = [random, smote, ada]
sampler_names = ["Random OverSampler", "SMOTE", "ADASYN"]
alpha_vals = [0.001, 0.01]

for i in range(len(samplers)):
    X_resampled, y_resampled = samplers[i]()
    crossvalidate.getDataStore().setxTrain(X_resampled)
    crossvalidate.getDataStore().setyTrain(y_resampled)
    assert len(X_resampled) == len(y_resampled)
    for j in range(len(alpha_vals)):
        classifier = SGDClassifier(loss="hinge", class_weight={0: 1, 1: 10}, alpha=alpha_vals[j], random_state=42)
        crossvalidate.setClassifier(classifier)
        crossvalidate.run()
        f1, roc = crossvalidate.getMetrics().getScores()
        print(f"{sampler_names[i]}, {alpha_vals[j]}:")
        print(f"F1 score is: {f1}")
        print(f"ROC-AUC is: {roc}")
        print("\n")

crossvalidate.getDataStore().revertToOriginal()

Random OverSampler, 0.001:
F1 score is: 0.7306697040739594
ROC-AUC is: 0.8247817029378792


Random OverSampler, 0.01:
F1 score is: 0.729177897574124
ROC-AUC is: 0.8198285905218491


SMOTE, 0.001:
F1 score is: 0.7290083922261484
ROC-AUC is: 0.8243498900813335


SMOTE, 0.01:
F1 score is: 0.7275133907288482
ROC-AUC is: 0.8193328643619701


ADASYN, 0.001:
F1 score is: 0.7014310832164875
ROC-AUC is: 0.809260044221294


ADASYN, 0.01:
F1 score is: 0.7001766996620404
ROC-AUC is: 0.8016497166444345




Not as big of a difference between 0.001 and 0.01 as reg strength.

SMOTE can generate noisy samples (ex: when classes cannot be well separated). In such cases, Imbalanced learn recommends combining oversampling with undersampling the majority class. This can be done through SMOTETomek and SMOTEENN.

Ref: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html

In [10]:
smote_tomek = SampledDataStore.getSMOTETOMEKSampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [smote_tomek, smote_enn]
sampler_names = ["SMOTE TOMEK", "SMOTE ENN"]
alpha_vals = [0.001, 0.01]      

for i in range(len(samplers)):
    X_resampled, y_resampled = samplers[i]()
    crossvalidate.getDataStore().setxTrain(X_resampled)
    crossvalidate.getDataStore().setyTrain(y_resampled)
    assert len(X_resampled) == len(y_resampled)
    for j in range(len(alpha_vals)):
        classifier = SGDClassifier(loss="hinge", class_weight={0: 1, 1: 10}, alpha=alpha_vals[j], random_state=42)
        crossvalidate.setClassifier(classifier)
        crossvalidate.run()
        f1, roc = crossvalidate.getMetrics().getScores()
        print(f"{sampler_names[i]}, {alpha_vals[j]}:")
        print(f"F1 score is: {f1}")
        print(f"ROC-AUC is: {roc}")
        print("\n")

crossvalidate.getDataStore().revertToOriginal()

SMOTE TOMEK, 0.001:
F1 score is: 0.733375342251864
ROC-AUC is: 0.8282549602693855


SMOTE TOMEK, 0.01:
F1 score is: 0.7320765885754617
ROC-AUC is: 0.8219595320608377


SMOTE ENN, 0.001:
F1 score is: 0.7721603956898074
ROC-AUC is: 0.8609611811303036


SMOTE ENN, 0.01:
F1 score is: 0.7771496874766541
ROC-AUC is: 0.8463373912084468




As with Logistic regression, SMOTEENN and RandomOverSampler produces the best results (though the results are much closer with SGD; note that SMOTE results are very close as well). Let's try evaluating on our test set (for both f1 score and accuracy) with 0.001 as the regularization strength (default value).

In [11]:
random = SampledDataStore.getRandomSampled
smote = SampledDataStore.getSMOTESampled
ada = SampledDataStore.getADASYNSampled
smote_tomek = SampledDataStore.getSMOTETOMEKSampled
smote_enn = SampledDataStore.getSMOTEENNSampled
samplers = [random, smote, ada, smote_tomek, smote_enn]
sampler_names = ["Random OverSampler", "SMOTE", "ADASYN", "SMOTE TOMEK", "SMOTE ENN"]

classifier = SGDClassifier(loss="hinge", random_state=42, max_iter=2000)

for i in range(len(samplers)):
    parameters = {'class_weight':[{0:1,1:10}]}
    X_resampled, y_resampled = samplers[i]()
    GridSpace.getDataStore().setxTrain(X_resampled)
    GridSpace.getDataStore().setyTrain(y_resampled) 
    GridSpace.setGridParameters(parameters)
    GridSpace.setClassifier(classifier)
    grid = GridSpace.run()
    y_preds = grid.predict(GridSpace.getDataStore().getxTest())
    print(f"{sampler_names[i]} on test set:")
    print(f"F1 score: {f1_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print(f"ROC_AUC score: {roc_auc_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print(f"Balanced accuracy score: {balanced_accuracy_score(GridSpace.getDataStore().getyTest(), y_preds)}")
    print("\n")
    
GridSpace.getDataStore().revertToOriginal()

Random OverSampler on test set:
F1 score: 0.12569610182975338
ROC_AUC score: 0.8206467075753549
Balanced accuracy score: 0.820646707575355


SMOTE on test set:
F1 score: 0.1167608286252354
ROC_AUC score: 0.8138168993952279
Balanced accuracy score: 0.8138168993952279


ADASYN on test set:
F1 score: 0.13013420089467262
ROC_AUC score: 0.8250428816466552
Balanced accuracy score: 0.8250428816466553


SMOTE TOMEK on test set:
F1 score: 0.11666666666666665
ROC_AUC score: 0.8259182812713906
Balanced accuracy score: 0.8259182812713904


SMOTE ENN on test set:
F1 score: 0.09666848716548333
ROC_AUC score: 0.8556841837186643
Balanced accuracy score: 0.8556841837186642




With the updated feature set, SVM achieved slightly worse results than Logistic Regression (Ex: 0.13 vs 0.14 with Random Sampler; default thresholding).

Better F1 score was achieved with modified thresholding, however that is neither suitable or a fair comparison (as reflected in the roc_auc value; we can hypothesize that the model simply predicted less samples to be positive/gilded due to larger threshold - resulting in lower number of true and false positives and a result, smaller roc_auc value).

We will try Decision Trees next.