You have a target vector with highly imbalanced classes.

Solution: Collect more data. If that isn’t possible, change the metrics used to evaluate your
model. If that doesn’t work, consider using a model’s built-in class weight
parameters (if available), downsampling, or upsampling. We cover evaluation
metrics in a later chapter, so for now let us focus on class weight parameters,
downsampling, and upsampling.
To demonstrate our solutions, we need to create some data with imbalanced
classes. Fisher’s Iris dataset contains three balanced classes of 50 observations,
each indicating the species of flower (Iris setosa, Iris virginica, and Iris
versicolor). To unbalance the dataset, we remove 40 of the 50 Iris setosa
observations and then merge the Iris virginica and Iris versicolor classes. The
end result is a binary target vector indicating if an observation is an Iris setosa
flower or not. The result is 10 observations of Iris setosa (class 0) and 100
observations of not Iris setosa (class 1):

In [19]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


In [20]:
# Load iris data
iris = load_iris()
# Create feature matrix
features = iris.data
# Create target vector
target = iris.target


In [21]:
# Remove first 40 observations
features = features[40:,:]
target = target[40:]

In [22]:
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [23]:
# Create binary target vector indicating if class 0
target = np.where((target == 0), 0, 1) #. The end result is a binary target vector indicating if an observation is an Iris setosa flower or not. 


In [24]:
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Many algorithms in scikit-learn offer a parameter to weight classes during
training to counteract the effect of their imbalance. While we have not covered it
yet, RandomForestClassifier is a popular classification algorithm and
includes a class_weight parameter. You can pass an argument specifying the
desired class weights explicitly:

In [25]:
# Create weights
weights = {0: .9, 1: 0.1}

In [26]:
RandomForestClassifier(class_weight=weights)

Or you can pass balanced, which automatically creates weights inversely
proportional to class frequencies:

In [27]:
# Train a random forest with balanced class weights
RandomForestClassifier(class_weight="balanced")

Alternatively, we can downsample the majority class or upsample the minority
class. In downsampling, we randomly sample without replacement from the
majority class (i.e., the class with more observations) to create a new subset of
observations equal in size to the minority class. For example, if the minority
class has 10 observations, we will randomly select 10 observations from the
majority class and use those 20 observations as our data. Here we do exactly that
using our unbalanced Iris data:

In [33]:
# Indicies of each class' observations
i_class0 = np.where(target == 0)
i_class1 = np.where(target == 1)
i_class0


(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64),)

In [34]:
i_class1

array([ 10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,
        23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,
        36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,
        49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,
        62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104, 105, 106, 107, 108, 109], dtype=int64)

In [37]:
# Number of observations in each class
n_class0 = len(i_class0)
n_class1 = len(i_class1)
# For every observation of class 0, randomly sample
# from class 1 without replacement
i_class1_downsampled = np.random.choice(i_class1, size=n_class0, replace=False)



12

In [None]:
# Join together class 0's target vector with the
# downsampled class 1's target vector
np.hstack((target[i_class0], target[i_class1_downsampled]))

# Join together class 0's feature matrix with the
# downsampled class 1's feature matrix
np.vstack((features[i_class0,:], features[i_class1_downsampled,:]))[0:5]

Our other option is to upsample the minority class. In upsampling, for every
observation in the majority class, we randomly select an observation from the
minority class with replacement. The end result is the same number of
observations from the minority and majority classes. Upsampling is
implemented very similarly to downsampling, just in reverse:

In [31]:
# For every observation in class 1, randomly sample from class 0 with replacement
i_class0_upsampled = np.random.choice(i_class0, size=n_class1, replace=True)
# Join together class 0's upsampled target vector with class 1's target vector
np.concatenate((target[i_class0_upsampled], target[i_class1]))

# Join together class 0's upsampled feature matrix with class 1's feature matrix
np.vstack((features[i_class0_upsampled,:], features[i_class1,:]))[0:5]

array([[4.5, 2.3, 1.3, 0.3],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5. , 3.3, 1.4, 0.2],
       [4.8, 3. , 1.4, 0.3]])

observations from the minority class. However, this is often just not possible, so
we have to resort to other options.
A second strategy is to use a model evaluation metric better suited to imbalanced
classes. Accuracy is often used as a metric for evaluating the performance of a
model, but when imbalanced classes are present accuracy can be ill suited. For
example, if only 0.5% of observations have some rare cancer, then even a naive
model that predicts nobody has cancer will be 99.5% accurate. Clearly this is not
ideal. Some better metrics we discuss in later chapters are confusion matrices,
precision, recall, F1 scores, and ROC curves.
A third strategy is to use the class weighing parameters included in
implementations of some models. This allows us to have the algorithm adjust for
imbalanced classes. Fortunately, many scikit-learn classifiers have a
class_weight parameter, making it a good option.
The fourth and fifth strategies are related: downsampling and upsampling. In
downsampling we create a random subset of the majority class of equal size to
the minority class. In upsampling we repeatedly sample with replacement from
the minority class to make it of equal size as the majority class. The decision
between using downsampling and upsampling is context-specific, and in general
we should try both to see which produces better results.