# Exploring Class Imbalance and Overlap in Multi-Class Classification

## Tutorial Session – Data Science Africa Conference

In real-world machine learning applications — from disease diagnosis to fraud detection — data rarely arrives in perfect balance. Some classes are underrepresented, while others dominate. This imbalance can severely bias model performance, especially when it coexists with overlapping class boundaries.

This hands-on tutorial explores the effects of **class imbalance** and **class overlap** in **multi-class classification** using synthetic datasets. Participants will interactively test models under three conditions:

1. **Balanced Dataset** – Well-separated, equally distributed classes (ideal scenario)
2. **Imbalanced & Overlapping Dataset** – Reflects many real-world challenges
3. **SMOTE-Enhanced Dataset** – Demonstrates the impact of synthetic oversampling

### Key Objectives:
- Understand how imbalance and overlap affect classifier learning
- Apply `SMOTE` to mitigate imbalance
- Evaluate models using robust metrics such as **balanced accuracy** and **geometric mean**
- Compare performance across different data conditions using interactive widgets

> **Tools Used**: `scikit-learn`, `imbalanced-learn`, `RandomForestClassifier`, `SMOTE`, `ipywidgets`

This tutorial emphasizes experimentation and interpretability over algorithmic complexity. It is designed for early-career practitioners, researchers, and students seeking to build intuition and fairness into classification workflows.

Let's begin by exploring the data...


# Universal Multi-class Classification Exploration

This interactive notebook lets you explore classification behavior across:
- **Balanced dataset** (well-separated classes)
- **Imbalanced dataset** (overlapping classes)
- **SMOTE-enhanced dataset** (resampled for fairness)

You can switch between datasets and compare classifier performance live.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, balanced_accuracy_score, ConfusionMatrixDisplay
from imblearn.metrics import geometric_mean_score
from imblearn.over_sampling import SMOTE
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display, clear_output

# Load datasets
df_balanced = pd.read_csv("mock_balanced_no_overlap.csv")
df_imbalanced = pd.read_csv("mock_multiclass_overlap_imbalance.csv")

# Create third version (SMOTE-enhanced)
X_imb = df_imbalanced.drop("label", axis=1)
y_imb = df_imbalanced["label"]
X_train, _, y_train, _ = train_test_split(X_imb, y_imb, stratify=y_imb, test_size=0.3, random_state=42)
X_sm, y_sm = SMOTE(random_state=42, sampling_strategy='not majority').fit_resample(X_train, y_train)
df_smote = pd.DataFrame(X_sm, columns=X_imb.columns)
df_smote['label'] = y_sm


FileNotFoundError: [Errno 2] No such file or directory: 'mock_balanced_no_overlap.csv'

In [None]:
dataset_dropdown = widgets.Dropdown(
    options=["Balanced", "Imbalanced", "SMOTE-enhanced"],
    description="Dataset",
    layout=widgets.Layout(width='60%')
)

output = widgets.Output()

def evaluate(dataset_name):
    clear_output(wait=True)

    if dataset_name == "Balanced":
        df = df_balanced
    elif dataset_name == "Imbalanced":
        df = df_imbalanced
    else:
        df = df_smote

    X = df.drop("label", axis=1)
    y = df["label"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

    clf = RandomForestClassifier(random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(f"=== {dataset_name} Dataset ===")
    print("Balanced Accuracy:", round(balanced_accuracy_score(y_test, y_pred), 3))
    print("Geometric Mean:", round(geometric_mean_score(y_test, y_pred, average='macro'), 3))
    print(classification_report(y_test, y_pred))

    ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test, cmap="YlOrRd")
    plt.title(f"Confusion Matrix - {dataset_name}")
    plt.grid(False)
    plt.show()

widgets.interact(evaluate, dataset_name=dataset_dropdown)


interactive(children=(Dropdown(description='Dataset', layout=Layout(width='60%'), options=('Balanced', 'Imbala…

<function __main__.evaluate(dataset_name)>