```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. SMOTE, ADASYN
2. imbalanced-learn library

```

# Imbalanced Classes
If you are classifying data, and the classes are not relatively balanced in size, the bias toward more popular classes can carry over into your model. For example, if you have 1 positive case and 99 negative cases, you can get 99% accuracy simply by classifying everything as negative.

In [2]:
# Connect with underlying Python code
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0, '../src')

In [3]:
from datasets import (
    get_dataset
)

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

In [5]:
# This is the titanic3 dataset with some extra columns for cabin information
df = get_dataset('titanic3')

## Use a Different Metric (than Accuracy)
One hint is to use a measure other than accuracy (`AUC` is a good choice) for calibrating models. `Precision` and `recall` are also better options when the target sizes are different. However, there are other options to consider as well.

## Tree-based Algorithms and Ensembles
Tree-based models may perform better depending on the distribution of the smaller class. If they tend to be clustered, they can be classified easier.

Ensemble methods can further aid in pulling out the minority classes. `Bagging` and `boosting` are options found in tree models like random forests and `Extreme Gradient Boosting (XGBoost)`.

## Penalize Models
Many scikit-learn classification models support the `class_weight` parameter. Setting this to 'balanced' will attempt to regularize minority classes and incentivize the model to classify them correctly. Alternatively, you can grid search and specify the `weight` options by passing in a dictionary mapping class to weight (give higher weight to smaller classes).

The XGBoost library has the `max_delta_step` parameter, which can be set from 1 to 10 to make the update step more conservative. It also has the `scale_pos_weight` parameter that sets the ratio of negative to positive samples (for binary classes). Also, the `eval_metric` should be set to 'auc' rather than the default value of 'error' for classification.

The KNN model has a `weights` parameter that can bias neighbors that are closer. If the minority class samples are close together, setting this parameter to 'distance' may improve performance.

## Upsampling Minority
You can upsample the minority class in a couple of ways.

In [6]:
from sklearn.utils import resample

mask = df.survived == 1
mask

0        True
1        True
2       False
3       False
4       False
        ...  
1304    False
1305    False
1306    False
1307    False
1308    False
Name: survived, Length: 1309, dtype: bool

In [9]:
df_surv = df[mask]
df_surv.shape

(500, 14)

In [10]:
df_death = df[~mask]
df_death.shape

(809, 14)

In [11]:
df_upsample = resample(df_surv, replace=True, n_samples=len(df_death), random_state=42,)
df_upsample.shape

(809, 14)

In [13]:
df2 = pd.concat([df_death, df_upsample])
df2.survived.value_counts()

1    809
0    809
Name: survived, dtype: int64

In [14]:
df2.shape

(1618, 14)

We can also use the imbalanced-learn library to randomly sample with replacement

In [None]:
from imblearn.over_sampling import (
    RandomOverSampler,
)

ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_sample(X, y)
pd.Series(y_ros).value_counts()

## Generate Minority Data
The imbalanced-learn library can also generate new samples of minority classes with both the `Synthetic Minority Over-sampling Technique (SMOTE)` and `Adaptive Synthetic (ADASYN)` sampling approach algorithms.

## Downsampling Majority
Another method to balance classes is to downsample majority classes.

In [15]:
from sklearn.utils import resample

mask = df.survived == 1
mask

0        True
1        True
2       False
3       False
4       False
        ...  
1304    False
1305    False
1306    False
1307    False
1308    False
Name: survived, Length: 1309, dtype: bool

In [16]:
df_surv = df[mask]
df_surv.shape

(500, 14)

In [17]:
df_death = df[~mask]
df_death.shape

(809, 14)

In [18]:
# Don’t use replacement when downsampling
df_downsample = resample(df_death, replace=False, n_samples=len(df_surv), random_state=42,)
df_downsample.shape

(500, 14)

In [19]:
df3 = pd.concat([df_surv, df_downsample])
df3.survived.value_counts()

1    500
0    500
Name: survived, dtype: int64

In [20]:
df3.shape

(1000, 14)

## Upsampling Then Downsampling
The imbalanced-learn library implements SMOTEENN and SMOTETomek, which both upsample and then apply downsampling to clean up the data.