# Unbalanced Data

Unbalanced data is one of the most common stumbling block for new data science practioners.  This is because ultimately, sometimes, its about accepting failure.  On some worked problems you can get okay class separation, but most of the time, you are going to fail on imbalanced problems.  There are tools in few shot and zero shot learning that can help, but only on some classes of problems and only under certain constraints.  

After going through undergrad and likely a masters or maybe even two, the feeling for most new practioners is that the suite of machine learning tools available today can solve any problem!  With the kernel tricks of SVMs, the highly non-linear universal approximators from tree based algorithms and from the mighty neural network, all problems must assuredly fall!  

And yet, it turns out all these genius algorithms aren't always all that smart.  Sometimes, all the tricks in the world won't get you a good looking confusion matrix.  There are some things we can do, to be sure.  But even they may fail in some cases.

Let's begin with a somewhat realistic data split, 90/10:

In [16]:
from imblearn.datasets import make_imbalance
from sklearn.datasets import make_moons
import pandas as pd
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

def ratio_func(y, multiplier, minority_class):
    target_stats = Counter(y)
    return {minority_class: int(multiplier * target_stats[minority_class])}

X, y = make_moons(n_samples=4000, shuffle=True, noise=0.5, random_state=10)
X = pd.DataFrame(X, columns=["feature 1", "feature 2"])

multiplier = 0.1
X_resampled, y_resampled = make_imbalance(
    X,
    y,
    sampling_strategy=ratio_func,
    **{"multiplier": multiplier, "minority_class": 1},
)

In [17]:
pd.Series(y_resampled).value_counts()

0    2000
1     200
dtype: int64

In [18]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC


X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled)
svc = SVC()
svc.fit(X_train, y_train)
pred = svc.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.94      0.99      0.96       505
           1       0.61      0.24      0.35        45

    accuracy                           0.93       550
   macro avg       0.77      0.62      0.65       550
weighted avg       0.91      0.93      0.91       550



Ofph!  Not looking great.  That class 1 recall _sucks_.  Let's see if we can do better with a Linear SVC:

In [9]:
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled)
l_svc = LinearSVC()
l_svc.fit(X_train, y_train)
pred = l_svc.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95       496
           1       0.58      0.20      0.30        54

    accuracy                           0.91       550
   macro avg       0.75      0.59      0.63       550
weighted avg       0.89      0.91      0.89       550





Not really any better!  Okay, let's try some hyperparameter tuning.  Because we are using cross validation we can stick with train and test.  If we didn't use CV then we'd need an explicit validation set!

In [12]:
from sklearn.model_selection import GridSearchCV

tuned_parameters = [
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
    {"kernel": ["linear"], "C": [1, 10, 100, 1000]},
]
svc = SVC()
grid_search = GridSearchCV(svc, tuned_parameters)
grid_search.fit(X_train, y_train)
svc = SVC(**grid_search.best_params_)
svc.fit(X_train, y_train)
pred = svc.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       496
           1       0.00      0.00      0.00        54

    accuracy                           0.90       550
   macro avg       0.45      0.50      0.47       550
weighted avg       0.81      0.90      0.86       550



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Well that was somehow worse!  Ofph.  Let's see what happens if we try doing stratified train test split:

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, stratify=y_resampled)
l_svc = LinearSVC()
l_svc.fit(X_train, y_train)
pred = l_svc.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.93      0.99      0.96       500
           1       0.71      0.24      0.36        50

    accuracy                           0.92       550
   macro avg       0.82      0.61      0.66       550
weighted avg       0.91      0.92      0.90       550





Okay!  Finally some slight improvement!  So it looks like rebalancing based on the data _may_ be what we need!  Time to bring in imbalance-learn to see if we can do any better.

Rebalancing means either over sampling the minority class or under sampling the majority class to make the classification more balanced.  Take heed!  You can only rebalance on the training data though!

In [19]:
from imblearn.under_sampling import (
    RandomUnderSampler,
    ClusterCentroids,
    TomekLinks,
    EditedNearestNeighbours,
    OneSidedSelection,
    NeighbourhoodCleaningRule,
    InstanceHardnessThreshold,
    NearMiss,
    RepeatedEditedNearestNeighbours,
)

def rebalance_learn(X_train, y_train, random_state: int = 123):
    rus = RandomUnderSampler(random_state=random_state)
    X_train_ru, y_train_ru = rus.fit_resample(X_train, y_train)

    tl = TomekLinks()
    X_train_tl, y_train_tl = tl.fit_resample(X_train, y_train)

    enn = EditedNearestNeighbours()
    X_train_enn, y_train_enn = enn.fit_resample(X_train, y_train)

    r_enn = RepeatedEditedNearestNeighbours()
    X_train_r_enn, y_train_r_enn = r_enn.fit_resample(X_train, y_train)

    oss = OneSidedSelection(random_state=random_state)
    X_train_oss, y_train_oss = oss.fit_resample(X_train, y_train)

    ncr = NeighbourhoodCleaningRule()
    X_train_ncr, y_train_ncr = ncr.fit_resample(X_train, y_train)

    iht = InstanceHardnessThreshold(random_state=random_state)
    X_train_iht, y_train_iht = iht.fit_resample(X_train, y_train)

    nm = NearMiss()
    X_train_nm, y_train_nm = nm.fit_resample(X_train, y_train)

    return [
        (X_train_ru, y_train_ru),
        (X_train_tl, y_train_tl),
        (X_train_enn, y_train_enn),
        (X_train_r_enn, y_train_r_enn),
        (X_train_oss, y_train_oss),
        (X_train_ncr, y_train_ncr),
        (X_train_iht, y_train_iht),
        (X_train_nm, y_train_nm),
    ]

samplers = [
    "Random Under Sampler",
    "Tomek Links",
    "Edited Nearest Neighbors",
    "Repeated Edited Nearest Neighbors",
    "One Sided Selection",
    "Neighbourhood Cleaning Rule",
    "Instance Hardness Threshold",
    "Near Miss"
]

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, stratify=y_resampled)
rebalanced_datasets = rebalance_learn(X_train, y_train, random_state = 123)
for index, rebalanced_data in enumerate(rebalanced_datasets):
    method = samplers[index]
    X_train, y_train = rebalanced_data
    l_svc = LinearSVC()
    l_svc.fit(X_train, y_train)
    pred = l_svc.predict(X_test)
    print("method", method)
    print(classification_report(y_test, pred))

method Random Under Sampler
              precision    recall  f1-score   support

           0       0.97      0.76      0.85       500
           1       0.25      0.80      0.38        50

    accuracy                           0.76       550
   macro avg       0.61      0.78      0.61       550
weighted avg       0.91      0.76      0.81       550

method Tomek Links
              precision    recall  f1-score   support

           0       0.93      0.99      0.96       500
           1       0.68      0.30      0.42        50

    accuracy                           0.92       550
   macro avg       0.81      0.64      0.69       550
weighted avg       0.91      0.92      0.91       550

method Edited Nearest Neighbors
              precision    recall  f1-score   support

           0       0.95      0.95      0.95       500
           1       0.47      0.46      0.46        50

    accuracy                           0.90       550
   macro avg       0.71      0.70      0.71      

In summary, we do better than before with rebalancing!  But still not great.  And that's basically the point.  You can get to a place with rebalancing.  But you can't get all the way if your data is super unbalanced.   We could still try more hyperparameter tuning and more model classes.  Additionally, we can try anamoly detection methods:

* https://scikit-learn.org/stable/auto_examples/covariance/plot_mahalanobis_distances.html - mahalanobis distances
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html - isolation forests
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html - one class svm

If none of that works, then it's onto few shot and zero shot learning methods