# Exercise: Imbalanced data model bias

In this exercise, we will take a closer look at *imbalanced datasets*, what effects they have on predictions and how to address that.

We will also use *confusion matrices* to evaluate the changes we make to the models.

## Data visualization

We will still use a dataset that represents different classes of objects found in snow:

In [15]:
import pandas

#Import the data from the .csv file
dataset = pandas.read_csv('Data/snow_objects.csv', delimiter="\t")

#Let's have a look at the data
dataset

Unnamed: 0,size,roughness,color,motion,label
0,50.959361,1.318226,green,0.054290,tree
1,60.008521,0.554291,brown,0.000000,tree
2,20.530772,1.097752,white,1.380464,tree
3,28.092138,0.966482,grey,0.650528,tree
4,48.344211,0.799093,grey,0.000000,tree
...,...,...,...,...,...
2195,1.918175,1.182234,white,0.000000,animal
2196,1.000694,1.332152,black,4.041097,animal
2197,2.331485,0.734561,brown,0.961486,animal
2198,1.786560,0.707935,black,0.000000,animal


Recall that we have an *imbalanced dataset*. Some classes are much more frequent than others:

In [16]:
import graphing # custom graphing code. See our GitHub repo for details

# Plot a histogram with counts for each label
graphing.multiple_histogram(dataset, label_x="label", label_group="label", title="Label distribution")

## Using binary classification

For this exercise we will build a *binary classification model*. We want to predict if objects in the snow are hikers or not.

To do that, we first need to add another column to our dataset, and set it to `True` where the original labelis `hiker`, and `False` to everything else:


In [17]:
# Add a new label with true/false values to our dataset
dataset["is_hiker"] = dataset.label == "hiker"

# Plot frequency for new label
graphing.multiple_histogram(dataset, label_x="is_hiker", label_group="is_hiker", title="Distribution for new binary label 'is_hiker'")

We now have two classes to be used as labels in our dataset, but we have made it even more imbalanced.

Let's train the random forest model using `is_hiker` as the target variable, then measure its accuracy on both *train* and *test* sets:

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# import matplotlib.pyplot as plt
# from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score

# Custom function that measures accuracy on different models and datasets
# We will use this in different parts of the exercise
def assess_accuracy(model, dataset, label):
    """
    Asesses model accuracy on different sets
    """ 
    actual = dataset[label]        
    predictions = model.predict(dataset[features])
    acc = accuracy_score(actual, predictions)
    return acc

# Split the dataset in an 70/30 train/test ratio. 
train, test = train_test_split(dataset, test_size=0.3, random_state=1, shuffle=True)

# define a random fores model
model = RandomForestClassifier(n_estimators=1, random_state=1, verbose=False)

# Define which features are to be used (leave color out for now)
features = ["size", "roughness", "motion"]

# Train the model using the binary label
model.fit(train[features], train.is_hiker)

print("Train accuracy:", assess_accuracy(model,train, "is_hiker"))
print("Test accuracy:", assess_accuracy(model,test, "is_hiker"))

Train accuracy: 0.9532467532467532
Test accuracy: 0.906060606060606


Accuracy looks good for both *train* and *test* sets, but remember this metric is not an absolute measure of success.

We can build a confusion matrix to see how the modsel is actually doing:

In [19]:
# sklearn has a very convenient utility to build confusion matrices
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix
import plotly.figure_factory as ff

# Calculate the model's accuracy on the TEST set
actual = test.is_hiker
predictions = model.predict(test[features])

# Build and print our confusion matrix, using the actual values and predictions 
# from the test set, calculated in previous cells
cm = confusion_matrix(actual, predictions, normalize=None)

# Create the list of unique labels in the test set, to use in our plot
# I.e., ['True', 'False',]
unique_targets = sorted(list(test["is_hiker"].unique()))

# Convert values to lower case so the plot code can count the outcomes
x = y = [str(s).lower() for s in unique_targets]

# Plot the matrix above as a heatmap with annotations (values) in its cells
fig = ff.create_annotated_heatmap(cm, x, y)

# Set titles and ordering
fig.update_layout(  title_text="<b>Confusion matrix</b>", 
                    yaxis = dict(categoryorder = "category descending")
                    )

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=0.5,
                        y=-0.15,
                        showarrow=False,
                        text="Predicted label",
                        xref="paper",
                        yref="paper"))

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=-0.15,
                        y=0.5,
                        showarrow=False,
                        text="Actual label",
                        textangle=-90,
                        xref="paper",
                        yref="paper"))

# We need margins so the titles fit
fig.update_layout(margin=dict(t=80, r=20, l=120, b=50))
fig['data'][0]['showscale'] = True
fig.show()


The confusion matrix shows us that, despite the reported metrics, the model is not very accurate.

Out of the 660 samples present in the *test* set (30% of the total samples), it predicted `29` *false negatives* and `33` false positives.

What happens if we used this model to make predictions on balanced sets?

Let's load a dataset with an equal number of outcomes for hikers and non-hikers, then use that data to make predictions:

In [20]:
# Load and print umbiased set
#Import the data from the .csv file
balanced_dataset = pandas.read_csv('Data/snow_objects_balanced_for_hikers.csv', delimiter="\t")

#Let's have a look at the data
balanced_dataset

Unnamed: 0,size,roughness,color,motion,label
0,32.513125,0.895530,brown,0.000000,tree
1,24.002765,0.753326,grey,0.335366,tree
2,33.549935,0.619093,black,0.210880,tree
3,18.416812,0.540221,green,0.362724,tree
4,67.211581,1.333211,white,0.000000,tree
...,...,...,...,...,...
1195,2.686824,0.906440,black,4.773764,animal
1196,1.503541,1.317080,brown,3.388021,animal
1197,1.802189,0.714832,white,7.076913,animal
1198,1.767165,0.602013,black,3.395977,animal


In [21]:

# Commented on purpose because this might only make things confusing and tyhe exercise longer
# Just jump straight into the histogram with the "is_hiker" label
# graphing.multiple_histogram(balanced_dataset, label_x="label", label_group="label", title="Label distribution in balanced dataset")

In [22]:

# Add a new label with true/false values to our dataset
balanced_dataset["is_hiker"] = balanced_dataset.label == "hiker"

# Plot frequency for "is_hiker" labels
graphing.multiple_histogram(balanced_dataset, label_x="is_hiker", label_group="is_hiker", title="Label distribution in balanced dataset")

As you can see, the `is_hiker` labels has the same frequency for both classes. We are now using a *class balanced dataset*.

Let's run predictions on this set using the previously trained model:

In [23]:
# Test the model using a balanced dataset
actual = balanced_dataset.is_hiker
predictions = model.predict(balanced_dataset[features])

# Build and print our confusion matrix, using the actual values and predictions 
# from the test set, calculated in previous cells
cm = confusion_matrix(actual, predictions, normalize=None)
print(cm)

# Print accuracy using this set
print("Balanced set accuracy:", assess_accuracy(model,balanced_dataset, "is_hiker"))

[[485 115]
 [188 412]]
Balanced set accuracy: 0.7475


As expected, we see a noticeable drop in accuracy using a different set.

Again, let's visually analyze its performance:

In [24]:
# plot new confusion matrix
# Create the list of unique labels in the test set, to use in our plot
# I.e., ['True', 'False',]
unique_targets = sorted(list(balanced_dataset["is_hiker"].unique()))

# Convert values to lower case so the plot code can count the outcomes
x = y = [str(s).lower() for s in unique_targets]

# Plot the matrix above as a heatmap with annotations (values) in its cells
fig = ff.create_annotated_heatmap(cm, x, y)

# Set titles and ordering
fig.update_layout(  title_text="<b>Confusion matrix</b>", 
                    yaxis = dict(categoryorder = "category descending")
                    )

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=0.5,
                        y=-0.15,
                        showarrow=False,
                        text="Predicted label",
                        xref="paper",
                        yref="paper"))

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=-0.15,
                        y=0.5,
                        showarrow=False,
                        text="Actual label",
                        textangle=-90,
                        xref="paper",
                        yref="paper"))

# We need margins so the titles fit
fig.update_layout(margin=dict(t=80, r=20, l=120, b=50))
fig['data'][0]['showscale'] = True
fig.show()

The confusion matrix confirms the poor accuracy using this dataset, but why is this happening when we had such excellent metrics in the earlier *train* and *test* sets?

Recall that the first model was heavily imbalanced against the "hiker" class (which made up only 22% of the labels).

When such an imbalance happens, classification models don't have enough data to "learn" the patterns for the minority __class__, and as a consequence become biased towards the __majority__ class!

Imbalanced sets can be addressed in a number of ways:

- Improvements data selection
- Resampling the dataset
- Using weighted classes

For this exercise, we will focus on the last option.

## Using class weights to balance dataset

We can assign different *weights* to the majority and minority classes, according to their distribution, and modify our training algorithm so that it takes that information into account during the training phase.

It will then penalize errors when the minority class is misclassified (at the expense of errors against the majority class), in essence "forcing" the model to to better learn their features and patterns.

To use weighted classes, we have to retrain our model using the original *train* set, but this time telling the algorithm to use weights when calculating errors:


In [25]:
# Import function used in calculating weights
from sklearn.utils import class_weight


# Leaving the code below, in case we wnat to show how to calculate weights
# If that is preferred, pass class_weight=weight_dict to the model constructor
# Using class_weight="balanced" produces slightly better results

# # Calculate class weights
# class_weights = class_weight.compute_class_weight('balanced',
#                                                     classes=train["is_hiker"].unique(),
#                                                     y=dataset["is_hiker"].to_numpy())

# # We have to build a dict with the weights so it can be used by the classifier algorithm
# # so it looks something like {False: 0.61, True: 2.75}
# weight_dict = {False: class_weights[0], True: class_weights[1]}


# Retrain model using class weights
# Using class_weight="balanced" tells the algorithm to automatically calculate weights for us
weighted_model = RandomForestClassifier(n_estimators=1, random_state=1, verbose=False, class_weight="balanced")
# Train the weighted_model using binary label
weighted_model.fit(train[features], train.is_hiker)

print("Train accuracy:", assess_accuracy(weighted_model,train, "is_hiker"))
print("Test accuracy:", assess_accuracy(weighted_model,test, "is_hiker"))

Train accuracy: 0.9525974025974026
Test accuracy: 0.9166666666666666


After using the weighted classes, the *train* accuracy remained almost the same, while the *test* accuracy showed a small improvement (roughly 1%).

Let's see if results are improved at all using the __balenced__ set for predictions again:


In [26]:
print("Balanced set accuracy:", assess_accuracy(weighted_model,balanced_dataset, "is_hiker"))

# Test the weighted_model using a balanced dataset
actual = balanced_dataset.is_hiker
predictions = weighted_model.predict(balanced_dataset[features])

# Build and print our confusion matrix, using the actual values and predictions 
# from the test set, calculated in previous cells
cm = confusion_matrix(actual, predictions, normalize=None)


Balanced set accuracy: 0.7875


The accuracy for the balanced set increased roughly 4%, but we need to visualize the results to understand if this is a significant improvement.

## Final confusion matrix

We can now plot a final confusion matrix, representing predictions for a *balanced dataset*, using a model trained on a *weighted class dataset*:

In [27]:
# Plot the matrix above as a heatmap with annotations (values) in its cells
fig = ff.create_annotated_heatmap(cm, x, y)

# Set titles and ordering
fig.update_layout(  title_text="<b>Confusion matrix</b>", 
                    yaxis = dict(categoryorder = "category descending")
                    )

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=0.5,
                        y=-0.15,
                        showarrow=False,
                        text="Predicted label",
                        xref="paper",
                        yref="paper"))

fig.add_annotation(dict(font=dict(color="black",size=14),
                        x=-0.15,
                        y=0.5,
                        showarrow=False,
                        text="Actual label",
                        textangle=-90,
                        xref="paper",
                        yref="paper"))

# We need margins so the titles fit
fig.update_layout(margin=dict(t=80, r=20, l=120, b=50))
fig['data'][0]['showscale'] = True
fig.show()

While the results might look a bit disappointing, we now have 21% wrong predictions (FNs + FPs), against 25% from the previous experiment.

Correct predictions (TPs + TNs) went from 74.7% to 78.7%.

Is an all around 4% improvement significant or not?

Remember that we had relatively little data to train the model, and the features we have available may still be so similar for different samples (for example, hikers and animals tend to be small, non-rough and move a lot), that despite our efforts, the model still has some difficulty making correct predictions.

We only had to change a single line of code to get better results, so it seems worth the effort!



## Summary

This was a long exercise, where we covered the following topics:

- Creating new label fields so we can perform *binary classification* using a dataset with multiple classes.
- How training on *imbalanced sets* can have a negative effect in perfomance, especially when using unseen data from *balanced datasets*.
- Evaluating results of *binary classification* models using a confusion matrix.
- Using weighted classes to address class imbalances when training a model and evaluating the results.

