In [1]:
from collections import Counter
from imblearn.datasets import fetch_datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
import numpy as np

Using TensorFlow backend.


# The importance of balancing an imbalanced dataset

When implementing classification algorithms, the distrubtion of your dataset is important. For every obseravable output of your predictions performance realise on the number of observations for each potiental output.


For example, lets say the majority (most obseravations)  class is clearly dominating the other classes and the decision boundaries for the other classes are barely recognizable. As we keep adding more observations to the underrepresented classes, the decision boundaries change dramatically. Thus, when trying to make predictions with regards to a minority class, we need to find ways to avoid being misled by the majority class.


        Let's define a simple way to keep track of the result
        

In [2]:
def print_results(headline, true_value, pred):
    print(headline)
    print("accuracy: {}".format(accuracy_score(true_value, pred)))
    print("precision: {}".format(precision_score(true_value, pred)))
    print("recall: {}".format(recall_score(true_value, pred)))
    print("f1: {}".format(f1_score(true_value, pred)))

For that dataset we will be using the wine quality dataset which is provided by imblearn package under their dataset, let's explore the dataset

In [3]:
data = fetch_datasets()['wine_quality']

In [6]:
data

{'data': array([[ 7.  ,  0.27,  0.36, ...,  3.  ,  0.45,  8.8 ],
        [ 6.3 ,  0.3 ,  0.34, ...,  3.3 ,  0.49,  9.5 ],
        [ 8.1 ,  0.28,  0.4 , ...,  3.26,  0.44, 10.1 ],
        ...,
        [ 6.5 ,  0.24,  0.19, ...,  2.99,  0.46,  9.4 ],
        [ 5.5 ,  0.29,  0.3 , ...,  3.34,  0.38, 12.8 ],
        [ 6.  ,  0.21,  0.38, ...,  3.26,  0.32, 11.8 ]]),
 'target': array([-1, -1, -1, ..., -1, -1, -1], dtype=int64),
 'DESCR': 'wine_quality'}

In [11]:
unique, counts = np.unique(data.target, return_counts=True)

In [13]:
print(np.asarray((unique, counts)).T)

[[  -1 4715]
 [   1  183]]


Through simple analysis we can conclude that the majority class (-1) has 4715 counts as opposed to the minority with 183. This means that the dataset is imbalanced and that any decision boundary made on this may mislead predicition performance

    Classification-  We will now be using the random forest classification with additional parameters for under and over sampling

In [14]:
classifier = RandomForestClassifier

In [15]:
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], random_state=2)

# SMOTE and NEARMISS



SMOTE is an over-sampling method. What it does is, it creates synthetic (not duplicate) samples of the minority class. Hence making the minority class equal to the majority class. SMOTE does this by selecting similar records and altering that record one column at a time by a random amount within the difference to the neighbouring records.


whereas NearMiss is an under-sampling technique. Instead of resampling the Minority class, using a distance, this will make the majority class equal to minority class.

In [16]:
# build normal model
pipeline = make_pipeline(classifier(random_state=42))
model = pipeline.fit(X_train, y_train)
prediction = model.predict(X_test)

# build model with SMOTE imblearn
smote_pipeline = make_pipeline_imb(SMOTE(random_state=4), classifier(random_state=42))
smote_model = smote_pipeline.fit(X_train, y_train)
smote_prediction = smote_model.predict(X_test)

# build model with undersampling
nearmiss_pipeline = make_pipeline_imb(NearMiss(random_state=42), classifier(random_state=42))
nearmiss_model = nearmiss_pipeline.fit(X_train, y_train)
nearmiss_prediction = nearmiss_model.predict(X_test)



Now that we have created the model we'd like to check the validation metric for the model, this means evualating the model with metrics that display the distrubtion of the data pre and post processing as well as the classificaation matrix

In [19]:
# print information about both models
print()
print("normal data distribution: {}".format(Counter(data['target'])))
X_smote, y_smote = SMOTE().fit_sample(data['data'], data['target'])
print("SMOTE data distribution: {}".format(Counter(y_smote)))
X_nearmiss, y_nearmiss = NearMiss().fit_sample(data['data'], data['target'])
print("NearMiss data distribution: {}".format(Counter(y_nearmiss)))

# classification report
print(classification_report(y_test, prediction))
print(classification_report_imbalanced(y_test, smote_prediction))
print(classification_report_imbalanced(y_test, nearmiss_prediction))

print()
print('normal Pipeline Score {}'.format(pipeline.score(X_test, y_test)))
print('SMOTE Pipeline Score {}'.format(smote_pipeline.score(X_test, y_test)))
print('NearMiss Pipeline Score {}'.format(nearmiss_pipeline.score(X_test, y_test)))


print()
print_results("normal classification", y_test, prediction)
print()
print_results("SMOTE classification", y_test, smote_prediction)
print()
print_results("NearMiss classification", y_test, nearmiss_prediction)


normal data distribution: Counter({-1: 4715, 1: 183})
SMOTE data distribution: Counter({-1: 4715, 1: 4715})
NearMiss data distribution: Counter({-1: 183, 1: 183})
              precision    recall  f1-score   support

          -1       0.97      0.99      0.98      1182
           1       0.46      0.14      0.21        43

    accuracy                           0.96      1225
   macro avg       0.72      0.57      0.60      1225
weighted avg       0.95      0.96      0.95      1225

                   pre       rec       spe        f1       geo       iba       sup

         -1       0.98      0.97      0.33      0.97      0.56      0.34      1182
          1       0.31      0.33      0.97      0.32      0.56      0.30        43

avg / total       0.95      0.95      0.35      0.95      0.56      0.34      1225

                   pre       rec       spe        f1       geo       iba       sup

         -1       0.96      0.35      0.65      0.51      0.47      0.22      1182
       

# Conclusion

The normal classifier without any over and undersampling has a high accuracy meaning the model is accurate but in this case if we were predicting the outcome of the  minority class, the best metric would be a tradeoff between precision and recall. 

Recall which means how well our model can predict all the interested samples in our dataset. whereas precision is  a good measure to determine, when the costs of False Positive is high. For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). 