# Effects of Class Imbalance
## CSC2034
### Cameron Trotter (c.trotter2@ncl.ac.uk)


In previous notebooks, we have worked with balanced synthetic data. In reaality, the data you work with will often be imbalanced, with some classes containing more examples than others. In this practical, we will examine the effect that class imbalance can have on the performance of models you create. 

### Google Colab Setup

All of the notebooks you will be running in these lab sessions are designed to be ran using [Google Colab](https://colab.research.google.com/). For setup instructions, see this repo's README. 

In order to make things work on colab, we need to clone this repo and then (in another cell because colab dictates this...) move into the repo directory.


In [None]:
!git clone https://github.com/Trotts/csc2034-ds-demos.git

In [None]:
import os
os.chdir('csc2034-ds-demos')

### Generate an unbalanced dataset

Whilst we can use `sklearn` to create datasets, these will be balanced. Here we want a dsynthetic dataset that is unbalanced. To achieve this, we can manually flip a set percentage of data point classes. First, lets recap making a dataset. 

Task: Using the notebooks previously create a balanced, linearly sparable dataset. I have provided hyperparameters for you. 

In [None]:
n_samples = 1000
n_classes = 2
n_features = 2
n_clusters_per_class = 2
n_redundant = 0
n_informative = 2
random_state = 5
flip_y = 0.1

data, labels = #...

Now that we have a dataset, lets see how many examples of each class we have. 

In [None]:
import numpy as np
from helpers import show_scatterplot

unique, counts = np.unique(labels, return_counts=True)
print(f"Number of class 0 examples: {counts[0]}")
print(f"Number of class 1 examples: {counts[1]}")

show_scatterplot(data, labels, "Balanced synthetic dataset")

Run the below checks. If any return False, take another look at the code you have written before continuing. 

In [None]:
print(f"Class balance check, number of class 0 examples: {counts[0] == 511}")
print(f"Class balance check, number of class 1 examples: {counts[1] == 489}")

As you can see, our data is mostly balanced. There are only 22 more examples of class 0 than class 1. To really create a class imbalance, I have produced some code (found in `helpers.py`) which will skew this further.

In [None]:
from helpers import create_imbalance

percentage_imbalance = 0.9
imbalanced_labels = create_imbalance(labels, percentage_imbalance)

unique, counts = np.unique(imbalanced_labels, return_counts=True)
print(f"Number of class 0 examples: {counts[0]}")
print(f"Number of class 1 examples: {counts[1]}")

show_scatterplot(data, imbalanced_labels, "Balanced synthetic dataset")

### Dealing with class imbalance


Now our dataset is heavily skewed towards class 1 examples, lets take a look at the different ways we can deal with this. 

#### Approach 1: Downsampling

Downsampling is one method for dealing with imbalanced data. With this approach, the majority class is reduced to closely match the minority class by simply removing examples of the majority class from the dataset. Code to achieve this is provided in `helpers.py`. Spend some time reading and understanding this code.

For all types of augmentation, it is important we only perform this on the training set. The test set should be left as is to ensure our classifier's evaluation metrics are not biased by any changes made to the test set. 

In [None]:
from helpers import downsample
from sklearn.model_selection import train_test_split

data_train, data_test, labels_train, labels_test = train_test_split(data,
                                                                    imbalanced_labels,
                                                                    test_size = 0.33,
                                                                    random_state = 5)


downsampled_data, downsampled_labels = downsample(data_train, labels_train)

show_scatterplot(downsampled_data, downsampled_labels, "Downsampled synthetic training set")

On the plus side, we now have a more balanced dataset. On the downside, we now have far less data to train a model on.

#### Approach 2: SMOTE

In some cases, such as the one above, we may not want to downsample our dataset as this would leave us with too few data points to train a model on. What if we could do the opposite - that is, oversample the data to create more datapoints?

SMOTE, or Synthetic Minority Over-Sampling Technique, allows you to do just that! Proposed by Chawla *et al.* in 2002, the method generates fake data for the minority class which mirrors the qualities of the real data. This can be very powerful!

If you want to learn more about SMOTE, you can read the paper [here](https://www.jair.org/index.php/jair/article/view/10302). SMOTE itself has been implemented as part of the [imbalanced-learn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html?highlight=smote#imblearn.over_sampling.SMOTE) python package. 

Task: Using the package documentation linked above, utilise SMOTE to oversample the imbalanced training data we created. Ensure the `random_state` below is used. It is fine to use default parameters for all other arguments. 

In [None]:
from imblearn.over_sampling import SMOTE

random_state = 5

smote_train_data, smote_train_labels = #...

SMOTE will attempt to create an equal number of examples for each class. Let's see if that has happened...

In [None]:
unique, counts = np.unique(smote_train_labels, return_counts=True)

print(f"After SMOTE, number of class 0 examples: {counts[0]}")
print(f"After SMOTE, number of class 1 examples: {counts[1]}")


Run the below checks. If any return False, take another look at the code you have written before continuing. 

In [None]:
print(f"SMOTE check, number of class 0 examples: {counts[0] >= 599 and counts[0] <= 615}")
print(f"SMOTE check, number of class 1 examples: {counts[0] >= 599 and counts[0] <= 615}")

Let's plot the distribution of data we have after SMOTE.

In [None]:
show_scatterplot(smote_train_data, smote_train_labels, "Synthetic dataset after SMOTE")

As you can see, SMOTE has utilised the existing datapoints for the minority class and has used them to create synthetic data that closely mimics them. We can then use data created using SMOTE in the same way as before to train models. 

Task: Using a model of your choice, train a model on the SMOTE data and generate evaluation metrics. 

In [None]:
#...