<img src = 'over_under_sampling.jpg'>
<img src = 'over_under_sampling_1.jpg'>

In [14]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline

import seaborn as sns
import pandas as pd
import numpy as np

from collections import Counter

from numpy import mean

## Random OverSampling

In [15]:
## Defining an imbalance data-set We can demonstrate this on a simple synthetic binary classification 
## problem with a 1:100 class imbalance.
X, y = list(make_classification(n_samples=10000, weights=[0.99], flip_y=0))

print("X-type : ",type(X))
print("y-type : ",type(y))

print("X-Dimension : ",X.ndim)
print("Y-Dimension : ",y.ndim)

X-type :  <class 'numpy.ndarray'>
y-type :  <class 'numpy.ndarray'>
X-Dimension :  2
Y-Dimension :  1


In [16]:
print(Counter(y))
np.array(np.unique(y, return_counts=True))

Counter({0: 9900, 1: 100})


array([[   0,    1],
       [9900,  100]], dtype=int64)

This means that if the majority class had 1,000 examples and the minority class had 100, 
this strategy would oversampling the minority class so that it has 1,000 examples.


In [17]:
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')  
## we are going to scale the minority class to majority class
type(oversample)

imblearn.over_sampling._random_over_sampler.RandomOverSampler

In [18]:
oversample = RandomOverSampler(sampling_strategy=0.5)


A floating point value can be specified to indicate the ratio of minority class majority 
examples in the transformed dataset.

This would ensure that the minority class was oversampled to have half the number of examples as the majority class, for binary classification problems.

This means that if the majority class had 1,000 examples and the minority class had 100, 
the transformed dataset would have 500 examples of the minority class.


In [19]:
# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)

In [20]:
print(Counter(y_over))
np.array(np.unique(y_over, return_counts=True))

Counter({0: 9900, 1: 4950})


array([[   0,    1],
       [9900, 4950]], dtype=int64)

In [21]:
# define pipeline
steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)

# evaluate pipeline
cv     = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1)
score  = mean(scores)
print('F1 Score: %.3f' % score)

F1 Score: 0.997


This is Evaluating a decision tree on an imbalanced dataset with a 1:100 class distribution.

The model is evaluated using repeated 10-fold cross-validation with three repeats, 
and the oversampling is performed on the training dataset within each fold separately, 
ensuring that there is no data leakage as might occur if the oversampling was performed prior 
to the cross-validation.

Running the example evaluates the decision tree model on the imbalanced dataset with oversampling.

The chosen model and resampling configuration are arbitrary, designed to provide a template that
you can use to test undersampling with your dataset and learning algorithm, 
rather than optimally solve the synthetic dataset.

## Random UnderSampling

In [22]:
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')
## we are going to scale the majority class to minority class
type(oversample)

# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy=0.5)

In [23]:
print(Counter(y))
np.array(np.unique(y, return_counts=True))

Counter({0: 9900, 1: 100})


array([[   0,    1],
       [9900,  100]], dtype=int64)

In [24]:
# fit and apply the transform
X_over, y_over = undersample.fit_resample(X, y)

In [25]:
print(Counter(y_over))
np.array(np.unique(y_over, return_counts=True))

Counter({0: 200, 1: 100})


array([[  0,   1],
       [200, 100]], dtype=int64)

In [26]:
# define pipeline
steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)

# evaluate pipeline
cv     = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1)
score  = mean(scores)
print('F1 Score: %.3f' % score)

F1 Score: 0.997
