# Imbalanced Classes

If you are classifying data, and the classes are not relatively balanced in size, the bias toward more popular classes can carry over into your model. For example, if you have 1 positive case and 99 negative cases, you can get 99% accuracy by simply classifying everything as negative. There are various options for dealing with _imbalanced classes_

## Use a different metric

One hint is to use a measure other than accuracy (AUC is a good choice) for calibrating models. Precision and recall are also better options when the target sizes are different. However, there are other options to consider as well. 

## Tree-based Algorithms and Ensembles

Tree-based models may perform better depending on the distribution of the smaller class. If they tend to be clustered, they can be classified easier. 

Ensamble methods can further aid in pulling out the minority classes. Bagging and boosting are options found in tree models like random forest and Extreme Gradient Boosting (XGBoost).

## Penalize Models

Many scikit-learn classification models support the `class_weight` parameter. Setting this to `balanced` will attempt to regularize minority classes and incetivize the model to classify them correctly. Alternatively, you can grid search and specify the weight options by passing in a dictionary mapping class to  weight (give higher weight to smaller classes).

The XGBoost library has the `max_delta_step` parameter, which can b set from 1 to 10 to make the update step more conservative. It also has the `scale_pos_weight` parameter that sets the ratio of negative to positive samples (for binary classes). Also, the `eval_metric` should be set to `'auc'` rather than the default value of `'error'` for classification. 

The KNN model has a `weights` parameter that can bias neighbor that are closer. If the minority class samples are close together, setting this parameter to `distance` may improve performance. 

## Upsampling Minority

You can upsample the minority class in a couple of ways. Here is an sklearn implementation:

In [1]:
from sklearn.utils import resample

In [2]:
import pandas as pd 
import numpy as np

In [4]:
df = pd.read_excel('df.xls')

In [7]:
df.drop(columns='Unnamed: 0', inplace = True)

In [8]:
mask = df.survived == 1
surv_df = df[mask]
death_df = df[-mask]

In [9]:
df_upsample = resample(surv_df, replace= True, n_samples = len(death_df), random_state=42)

In [10]:
df2 = pd.concat([death_df, df_upsample])

In [11]:
df2.survived.value_counts()

0    809
1    809
Name: survived, dtype: int64

We can also use the imbalanced-learn library to randomly sample with replacement: 

In [12]:
#! pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.8.1-py3-none-any.whl (189 kB)
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.8.1 imblearn-0.0


In [13]:
from imblearn.over_sampling import (RandomOverSampler)

In [14]:
ros = RandomOverSampler(random_state = 42)

In [15]:
X = pd.read_excel('X.xls')
y = pd.read_excel('y.xls')
X.drop(columns = 'Unnamed: 0', inplace = True)
y.drop(columns = 'Unnamed: 0', inplace = True)

In [27]:
X_ros, y_ros = ros.fit_resample(X,y)
y_ros.value_counts()

survived
0           809
1           809
dtype: int64

## Generate Minority Data

The imbalanced-learn library can also generate new samples of minority classes with both the Synthetic Minority OverSampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) sampling approach algorithms. SMOTE works by choosing one of its k-nearest-neighbors, connecting a line to one of them, and choosing a point along that line. ADASYN is similar to SMOTE, but generates more smaple from those that are harder to learn. The classes in imbanced-learn are named `over_sampling.SMOTE` and `over_sampling.ADASYN`

## DownSampling Majority 

Another method to balance classes is to downsample majority classes. Here is an sklearn example:

In [28]:
from sklearn.utils import resample

In [29]:
mask = df.survived == 1
surv_df = df[mask]
death_df = df[-mask]
df_downsample = resample(death_df, replace = False, n_samples=len(surv_df), random_state=42)

In [30]:
df3 = pd.concat([surv_df, df_downsample])

In [34]:
df3.survived.value_counts()

1    500
0    500
Name: survived, dtype: int64

__TIP__ : Do not use replacement when downsampling

The imbalanced-learn library also implements various downsampling algorithms: 

* `ClusterCentroids` -> This class uses K-means to synthesize data with the centroids
* `RandomUnderSampler` -> This class randomly selects samples
* `NearMiss` -> This class uses nearest neighbors to downsample
* `TomekLink` -> This class downsamples by removing samples that are cloe to each other
* `EditedNearestNeighbours` -> This class removes samples that have neighbors that are either not in the majority or all of the same class.
* `RepeatedNearestNeighbours` -> This class repeatedly calls `EditedNearestNeighbours` 
* `AllKNN` -> This class is similar but increases the number of nearest neighbors during the iteration of downsampling.
* `CondensedNearestNeighbour` -> This class picks one sample of the class to be downsampled, then iterates through the other samples of the class, and if KNN does not missclassify, it adds that sample. 
* `OneSidedSelection` -> This class removes noisy samples. 
* `NeighbourhoodCleaningRule` -> This class uses `EditedNearestNeighbours` results and applies KNN to it. 
* `InstanceHardnessThreshold` -> This class trains a model, then removes samples with low probabilities.

All of these classes support the `.fit_sample` method.