# 5. Imbalanced data

Imbalanced data is a problem that is present in almost any dataset.
In some datasets, the problem is barely noticeable; in others, the imbalance is the only thing you can see.
In either case, re-balancing the data can improve results.

In this notebook, we explore different techniques to undersample or oversample imbalanced datasets.

To re-balance datasets, we won't use `scikit-learn`, but `imbalanced-learn`, another Python package based on `scikit-learn`.
You can either install `imbalanced-learn` through Conda or uncomment the following line of code by removing the `#` symbol.

In [1]:
%pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


As usual, we start by importing some of the datasets we'll use throughout this notebook.

In [2]:
import imblearn # the imbalanced-learn package
import pandas as pd

## Undersampling

Undersampling is only really feasible in very large datasets.
To save some processing time, we focus on a relatively small dataset that has already been pre-processed for us: part of the Enron spam email dataset.

The next cell loads the dataset and removes the `Email No.` column, which is a unique identifier.
Finally, the code cell concludes by counting the number of emails that have a label of 0 (not spam) and 1 (spam).
As you can see, the dataset has a large imbalance, with around 70% being non-spam emails.

> Note: When dropping the `Email No.` column, we specify `axis=1`.
        Axis 0 are the rows, and axis 1 are the columns.
        By specifying `axis=1`, we're telling pandas to look for the `Email No.` as a column name.

In [3]:
df = pd.read_csv('data/emails.csv') # read the dataset from file and into a pandas DataFrame
df = df.drop("Email No.", axis=1) # remove the `Email No.` column since it's a unique identifier
df.groupby(by='Prediction').size() # group the emails by their prediction and count the number of rows

Prediction
0    3672
1    1500
dtype: int64

`imbalanced-learn` is based on `scikit-learn`, and it shows.
Like `scikit-learn`, you'll normally follow three steps to undersample or oversample a dataset:

1. Import the module and class
2. Instantiate the class with any parameters
3. Fit and resample (instead of fit and predict or fit and transform) the dataset

For undersampling, we'll use a random undersampler, aptly-called [`RandomUnderSampler`](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html).
This undersampler is very simple and randomly removes extra instances from the majority class.

When undersampling or oversampling with `imbalanced-learn`, we have to provide the features and the label separately.
The label allows the class to identify the majority class (when undersampling) or the minority class (when oversampling) and tweaking the dataset accordingly.

In [4]:
# step 1: import the random undersampler
from imblearn.under_sampling import RandomUnderSampler

# step 2: create the undersampler
undersample = RandomUnderSampler()

# step 3: fit and resample the data
features = df.columns[ ~df.columns.isin([ 'Prediction' ]) ] # the words, or the features in their emails, but excluding the label
X, y = df.loc[:, features], df.Prediction # extract the features (X) and the labels (y)
X_under, y_under = undersample.fit_resample(X, y) # finally, resample!

The `fit_resample` function returned two variables: the `X_under` and the `y_under`.
The `X_under` is the new data in our dataset, with more than half of the original rows of the majority removed.
The `y_under` are the remaining labels.

The next code cell loads this re-balanced data into a new DataFrame.

In [5]:
# create a new DataFrame with the column names taken from above and the re-balanced data
_df = pd.DataFrame(columns=features, data=X_under)
_df[ 'Prediction' ] = y_under # copy back the labels
_df.groupby(by='Prediction').size() # group the emails by their prediction and count the number of rows

Prediction
0    1500
1    1500
dtype: int64

As you can see, now the dataset is balanced again, with 1500 emails marked as not spam, and 1500 marked as spam.

## Oversampling

The process for oversampling is almost identical.
This time, we use the Titanics dataset, where undersampling is not an option because the dataset is so small.

In [6]:
df = pd.read_csv('data/titanic-train.csv') # read the Titanic dataset into a DataFrame
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The dataset is again very imbalanced: only 40% of Titanic passengers survived.
Since the dataset is so small, undersampling would not make sense, as we would have to remove more than 200 passengers who did not survive, or more than 20% of the dataset.

In [7]:
df.groupby(by='Survived').size() # group the passengers by their survival flag and count the rows in each group

Survived
0    549
1    342
dtype: int64

Oversampling with `imbalanced-learn` is almost identical to undersampling.
There are three steps again, this time importing and instantiating the [`RandomOverSampler`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html) instead of the `UnderSampler` class.

Again, not that we pass on the features and label separately so that the RandomOverSampler knows which class to over-sample.

In [8]:
# step 1: import the random oversampler
from imblearn.over_sampling import RandomOverSampler

# step 2: create the oversampler
oversample = RandomOverSampler()

# step 3: fit and resample the data
X, y = df.loc[:,  ~df.columns.isin([ 'Survived' ]) ], df.Survived # oversample all features except the label
X_over, y_over = oversample.fit_resample(X, y) # oversample the features and the label

Creating the DataFrame anew, note how the minority class (when the `Survived` label is 1) now has many rows as the majority class.

In [9]:
# create a new DataFrame with the column names taken from the original DataFrame and the re-balanced data
_df = pd.DataFrame(columns=df.columns, data=X_over)
_df[ 'Survived' ] = y_over # copy back the labels
_df.groupby(by='Survived').size() # group the passengers by their survival flag and count the rows in each group

Survived
0    549
1    549
dtype: int64

## SMOTE

Oversampling with SMOTE is a little bit different.
First, we have to decide what version of SMOTE is most appropriate for our data.
`imbalanced-learn` provides different implementations, including:

- [`SMOTE`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html): the basic implementation of SMOTE, which assumes that all the features are continuous
- [`SMOTEN`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTEN.html): an implementation of SMOTE that's used when the data is entirely categorical
- [`SMOTENC`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html): an implemetation of SMOTE that's used when some of the data, but not all of it, is categorical

The Titanic dataset is a mix of continuous and categorical data, so we can only use SMOTENC.

SMOTE and its variations complain about missing values.
The next code cell loads a summary of the features with missing data, namely `Age` and `Embarked`.
We'll ignore the `PassengerId`, `Ticket` and `Embarked` features.

In [10]:
_df = df.copy() # create a copy of the dataframe so we don't overwrite the original data
_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In this notebook, we'll take a very simple approach to imputation.
Since there are only 2 missing `Embarked` values, we drop all rows that do not have a value.
We'll fill in the missing `Age` values with the average.
Check out the imputation notebook for more sophisticated techniques.

In [11]:
_df = _df.dropna(subset=[ 'Embarked' ]) # drop all rows with missing values for the `Embarked` future
_df.Age = _df.Age.fillna(_df.Age.mean()) # fill in missing `Age` values with the average

The next code cell filters out unnecessary columns, and we focus instead on 7 features.
Note that there is no need for feature engineering at this stage: `imbalanced-learn` automatically notices that the `Pclass` feature (the passenger class) is an integer, so it should not take a floating-point value.
Similarly, the `Sex` and `Embarked` columns will only take valid values.

In [12]:
features = [ 'Age', 'Fare', 'Pclass', 'SibSp', 'Parch', 'Sex', 'Embarked' ] # the list of features we want to focus on
_df = _df[ features + [ 'Survived' ] ] # filter the DataFrame's column

Now, finally, we can over-sample using SMOTE.
Since we're using SMOTENC, we need to tell the class which features are categorical upon instantiation.
This parameter is a list of booleans (`True` or `False`) indicating which columns are categorical.
Aside from this extra step, everything the same.

In [13]:
# step 1: import SMOTENC
from imblearn.over_sampling import SMOTENC

# step 2: create the SMOTENC class
categorical = [ feature in [ 'Embarked', 'Sex' ] for feature in features ] # choose the categorical features
smote_nc = SMOTENC(categorical_features=categorical)

X, y = _df[ features ], _df.Survived  # oversample all features except the label
X_over, y_over = smote_nc.fit_resample(X, y) # oversample the features and the label

If we insert the data into a new DataFrame, we can observe how now, the number of passengers who survived and those who did not survive are the same.

In [14]:
# create a new DataFrame with the column names taken from the original DataFrame and the re-balanced data
__df = pd.DataFrame(columns=_df.columns, data=X_over)
__df[ 'Survived' ] = y_over # copy back the labels
__df.groupby(by='Survived').size() # group the passengers by their survival flag and count the rows in each group

Survived
0    549
1    549
dtype: int64

The next code cell shows the last 10 rows created by SMOTENC.
Do you notice something?

In [15]:
__df.tail(10)

Unnamed: 0,Age,Fare,Pclass,SibSp,Parch,Sex,Embarked,Survived
1088,49.864726,26.55,1,0,0,female,S,1
1089,27.069175,53.0712,1,1,0,female,S,1
1090,29.642093,34.862949,1,0,0,male,S,1
1091,54.305373,78.314201,1,1,0,female,C,1
1092,36.0,120.0,1,1,2,female,C,1
1093,33.198225,53.072952,1,1,0,female,S,1
1094,29.642093,7.75,3,0,0,female,Q,1
1095,41.916801,52.5542,1,1,0,female,S,1
1096,23.156523,14.246915,2,0,0,female,S,1
1097,22.32164,55.047356,1,0,0,male,S,1


## Resources

If you want to learn more about the `imbalanced-learn` library, you can visit the documentation site [here](https://imbalanced-learn.org/stable/).