### To explore various methods of reducing the issues brought upon by imbalanced datasets in ML problems

Learning phase and subsequent prediction of models can be affected by the problem of imbalanced datasets. More often than not, the decision function of classifiers would usually favour the majority class.

- Undersampling
- Oversampling
- Generate synthetic data for minority class

How should we rebalance dataset:
- Equal proportion of both classes?
- Majority stays represented?

No straight-forward answer as modifying datasets with resampling methods is changing reality.
Exercise with caution.

Evaluation methods

Accuracy based: classifier would as a matter of fact, always predict the majority class if real data is imbalanced.
Cost-based: 

In [2]:
# Loading in datsets
from collections import Counter
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from imblearn.datasets import make_imbalance

  from ._conv import register_converters as _register_converters


In [3]:
iris_df = load_iris()

In [19]:
x = iris_df.data

In [20]:
y = iris_df.target

In [21]:
Counter(y)

Counter({0: 50, 1: 50, 2: 50})

In [22]:
X, Y = make_imbalance(x, y, sampling_strategy={0: 24, 1: 48, 2: 50}, random_state=42)

In [None]:
# Illustrate imbalance dataset

In [17]:
# Plotting
import matplotlib.pyplot as plt
import numpy as np

def plot_decision_function(X, y, clf, ax):
    plot_step = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1:3].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.4)
    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor='k')

In [10]:
fig, ax1 = plt.subplots(1,2, figsize=(15, 7))

In [11]:
clf = LinearSVC().fit(X, Y)



In [18]:
plot_decision_function(X, Y, clf, ax1)

ValueError: X has 2 features per sample; expecting 4

In [65]:
from sklearn.datasets import make_classification
def create_dataset(n_samples=1000, weights=(0.01, 0.01, 0.98), n_classes=3,
                   class_sep=0.8, n_clusters=1):
    return make_classification(n_samples=n_samples, n_features=2,
                               n_informative=2, n_redundant=0, n_repeated=0,
                               n_classes=n_classes,
                               n_clusters_per_class=n_clusters,
                               weights=list(weights),
                               class_sep=class_sep, random_state=0)

In [66]:
X1, y1 = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94),
                      class_sep=0.8)

In [67]:
X = X1
y = y1

In [59]:
clf = LinearSVC().fit(X, y)

In [70]:
xx

array([[-3.25680222, -3.23680222, -3.21680222, ...,  4.20319778,
         4.22319778,  4.24319778],
       [-3.25680222, -3.23680222, -3.21680222, ...,  4.20319778,
         4.22319778,  4.24319778],
       [-3.25680222, -3.23680222, -3.21680222, ...,  4.20319778,
         4.22319778,  4.24319778],
       ...,
       [-3.25680222, -3.23680222, -3.21680222, ...,  4.20319778,
         4.22319778,  4.24319778],
       [-3.25680222, -3.23680222, -3.21680222, ...,  4.20319778,
         4.22319778,  4.24319778],
       [-3.25680222, -3.23680222, -3.21680222, ...,  4.20319778,
         4.22319778,  4.24319778]])

In [71]:
yy

array([[-6.58515661, -6.58515661, -6.58515661, ..., -6.58515661,
        -6.58515661, -6.58515661],
       [-6.56515661, -6.56515661, -6.56515661, ..., -6.56515661,
        -6.56515661, -6.56515661],
       [-6.54515661, -6.54515661, -6.54515661, ..., -6.54515661,
        -6.54515661, -6.54515661],
       ...,
       [ 4.39484339,  4.39484339,  4.39484339, ...,  4.39484339,
         4.39484339,  4.39484339],
       [ 4.41484339,  4.41484339,  4.41484339, ...,  4.41484339,
         4.41484339,  4.41484339],
       [ 4.43484339,  4.43484339,  4.43484339, ...,  4.43484339,
         4.43484339,  4.43484339]])

In [72]:
np.c_[xx.ravel(), yy.ravel()]

array([[-3.25680222, -6.58515661],
       [-3.23680222, -6.58515661],
       [-3.21680222, -6.58515661],
       ...,
       [ 4.20319778,  4.43484339],
       [ 4.22319778,  4.43484339],
       [ 4.24319778,  4.43484339]])

In [69]:
# def plot_decision_function(X, y, clf, ax):
plot_step = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1:3].min() - 1, X[:, 1:3].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

In [68]:
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.4)
    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor='k')

IndentationError: unexpected indent (<ipython-input-68-dcd0c6cb67c7>, line 2)

In [61]:
plot_decision_function(X, y, clf, ax1)

ValueError: X has 2 features per sample; expecting 4

In [15]:
from collections import Counter

from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

from imblearn.datasets import make_imbalance
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced

print(__doc__)

RANDOM_STATE = 42

# Create a folder to fetch the dataset
iris = load_iris()
X, y = make_imbalance(iris.data, iris.target,
                      sampling_strategy={0: 25, 1: 50, 2: 50},
                      random_state=RANDOM_STATE)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=RANDOM_STATE)

print('Training target statistics: {}'.format(Counter(y_train)))
print('Testing target statistics: {}'.format(Counter(y_test)))

# Create a pipeline
pipeline = make_pipeline(NearMiss(version=2),
                         LinearSVC(random_state=RANDOM_STATE))
pipeline.fit(X_train, y_train)

# Classify and report the results
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

Automatically created module for IPython interactive environment
Training target statistics: Counter({1: 38, 2: 38, 0: 17})
Testing target statistics: Counter({1: 12, 2: 12, 0: 8})
                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      1.00      1.00      1.00      1.00      1.00         8
          1       1.00      0.83      1.00      0.91      0.91      0.82        12
          2       0.86      1.00      0.90      0.92      0.95      0.91        12

avg / total       0.95      0.94      0.96      0.94      0.95      0.90        32



