# Imbalanced Medical Classification with MLPs: Predicting Type 2 Diabetes

In this tutorial-style notebook, we build a **Multilayer Perceptron (MLP)** to predict whether a person has type 2 diabetes based on clinical and lifestyle features (age, BMI, HbA1c, blood glucose, smoking history, etc.).

Our focus is **not just building a neural network**, but understanding how to handle **class imbalance** and how to choose a good **decision threshold** in a medical diagnosis context.

Specifically, we will:
- Explore the dataset and visualise important relationships.
- Show that the diabetes label is **imbalanced** (many more non-diabetic than diabetic cases).
- Build a preprocessing pipeline (scaling + one-hot encoding) using `ColumnTransformer` and `Pipeline`.
- Train a **baseline MLP** with default threshold 0.5 and discuss why accuracy can be misleading.
- Train an **MLP with class weights** to handle imbalance and compare metrics.
- Perform **threshold tuning** using precision–recall curves and metric-vs-threshold plots.
- Visualise confusion matrices at different thresholds to understand trade-offs.

This notebook is meant as a complete, commented example suitable for use as a teaching tutorial.

## 1. Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, learning_curve
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    roc_curve,
    roc_auc_score,
    precision_recall_curve,
    average_precision_score
)

sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)

print('Libraries imported.')

In [None]:
# Load the diabetes prediction dataset
# Make sure the CSV file is in the same directory as this notebook.
file_path = 'diabetes_prediction_dataset.csv'
df = pd.read_csv(file_path)

print('Shape:', df.shape)
df.head()

In [None]:
df.info()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# 2.1 Class distribution (0 = no diabetes, 1 = diabetes)
diabetes_counts = df['diabetes'].value_counts().sort_index()
print(diabetes_counts)
print('\nClass proportions:')
print((diabetes_counts / len(df)).round(3))

plt.figure()
sns.barplot(x=diabetes_counts.index.astype(str), y=diabetes_counts.values)
plt.title('Diabetes Label Distribution')
plt.xlabel('Diabetes (0 = no, 1 = yes)')
plt.ylabel('Count')
plt.show()

In [None]:
# 2.2 Histograms of key numeric features, coloured by diabetes status
numeric_cols_to_plot = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']

for col in numeric_cols_to_plot:
    plt.figure()
    sns.histplot(data=df, x=col, hue='diabetes', bins=30, kde=True, stat='density', common_norm=False)
    plt.title(f'Distribution of {col} by Diabetes Status')
    plt.show()

In [None]:
# 2.3 Categorical variables vs diabetes
plt.figure()
sns.countplot(data=df, x='gender', hue='diabetes')
plt.title('Gender vs Diabetes')
plt.show()

plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='smoking_history', hue='diabetes')
plt.title('Smoking History vs Diabetes')
plt.xticks(rotation=30)
plt.show()

In [None]:
# 2.4 Correlation heatmap for numeric features
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
corr = df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap (Numeric Features)')
plt.show()

## 3. Preprocessing and Train–Test Split

In [None]:
# Define features and target
target_col = 'diabetes'

# We will use all columns except the target as features.
feature_cols = [c for c in df.columns if c != target_col]

X = df[feature_cols].copy()
y = df[target_col].astype(int).copy()

# Identify categorical and numeric features explicitly
categorical_features = ['gender', 'smoking_history']
numeric_features = [c for c in feature_cols if c not in categorical_features]

print('Numeric features:', numeric_features)
print('Categorical features:', categorical_features)

# Stratified split to preserve class imbalance structure in train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)

In [None]:
# Build preprocessing pipeline: scale numeric features, one-hot encode categoricals.
# We force OneHotEncoder to return dense arrays to avoid issues with some estimators.

try:
    cat_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
except TypeError:
    # For older versions of scikit-learn that do not support sparse_output
    cat_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)

num_transformer = StandardScaler()

preprocess = ColumnTransformer(
    transformers=[
        ('num', num_transformer, numeric_features),
        ('cat', cat_transformer, categorical_features)
    ]
)

print('Preprocessing pipeline created.')

## 4. Baseline MLP (No Class Weighting, Default Threshold 0.5)

In [None]:
# We first train a simple MLP without any special handling for class imbalance.
# This will likely achieve high accuracy (because most patients are non-diabetic),
# but may perform poorly on the positive (diabetes=1) class.

baseline_mlp = MLPClassifier(
    hidden_layer_sizes=(64, 64),
    activation='relu',
    solver='adam',
    learning_rate_init=0.001,
    max_iter=300,
    random_state=42
)

baseline_model = Pipeline(steps=[
    ('preprocess', preprocess),
    ('mlp', baseline_mlp)
])

baseline_model.fit(X_train, y_train)

y_pred_baseline = baseline_model.predict(X_test)
y_proba_baseline = baseline_model.predict_proba(X_test)[:, 1]

acc = accuracy_score(y_test, y_pred_baseline)
prec_pos = precision_score(y_test, y_pred_baseline, pos_label=1)
rec_pos = recall_score(y_test, y_pred_baseline, pos_label=1)
f1_pos = f1_score(y_test, y_pred_baseline, pos_label=1)
roc_auc = roc_auc_score(y_test, y_proba_baseline)
ap = average_precision_score(y_test, y_proba_baseline)

print(f'Baseline MLP Accuracy: {acc:.4f}')
print(f'Baseline MLP Precision (positive class): {prec_pos:.4f}')
print(f'Baseline MLP Recall (positive class): {rec_pos:.4f}')
print(f'Baseline MLP F1 (positive class): {f1_pos:.4f}')
print(f'Baseline MLP ROC AUC: {roc_auc:.4f}')
print(f'Baseline MLP PR AUC (Average Precision): {ap:.4f}')

print('\nClassification report (Baseline):')
print(classification_report(y_test, y_pred_baseline))

In [None]:
cm_baseline = confusion_matrix(y_test, y_pred_baseline)
disp_baseline = ConfusionMatrixDisplay(confusion_matrix=cm_baseline, display_labels=[0, 1])

plt.figure()
disp_baseline.plot(values_format='d', cmap='Blues')
plt.title('Confusion Matrix - Baseline MLP (Threshold 0.5)')
plt.show()

In [None]:
# ROC curve
fpr, tpr, roc_thresholds = roc_curve(y_test, y_proba_baseline)

plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve - Baseline MLP')
plt.legend()
plt.show()

# Precision-Recall curve
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_proba_baseline)

plt.figure()
plt.plot(recall, precision, label=f'PR curve (AP = {ap:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve - Baseline MLP')
plt.legend()
plt.show()

## 5. MLP with Class Weights (Handling Imbalance)

In [None]:
# To give more importance to the minority (diabetes=1) class, we use class_weight='balanced'.
# This automatically sets weights inversely proportional to class frequencies.

weighted_mlp = MLPClassifier(
    hidden_layer_sizes=(64, 64),
    activation='relu',
    solver='adam',
    learning_rate_init=0.001,
    max_iter=300,
    random_state=42,
    class_weight='balanced'
)

weighted_model = Pipeline(steps=[
    ('preprocess', preprocess),
    ('mlp', weighted_mlp)
])

weighted_model.fit(X_train, y_train)

y_pred_weighted = weighted_model.predict(X_test)
y_proba_weighted = weighted_model.predict_proba(X_test)[:, 1]

acc_w = accuracy_score(y_test, y_pred_weighted)
prec_pos_w = precision_score(y_test, y_pred_weighted, pos_label=1)
rec_pos_w = recall_score(y_test, y_pred_weighted, pos_label=1)
f1_pos_w = f1_score(y_test, y_pred_weighted, pos_label=1)
roc_auc_w = roc_auc_score(y_test, y_proba_weighted)
ap_w = average_precision_score(y_test, y_proba_weighted)

print(f'Weighted MLP Accuracy: {acc_w:.4f}')
print(f'Weighted MLP Precision (positive class): {prec_pos_w:.4f}')
print(f'Weighted MLP Recall (positive class): {rec_pos_w:.4f}')
print(f'Weighted MLP F1 (positive class): {f1_pos_w:.4f}')
print(f'Weighted MLP ROC AUC: {roc_auc_w:.4f}')
print(f'Weighted MLP PR AUC (Average Precision): {ap_w:.4f}')

print('\nClassification report (Weighted MLP):')
print(classification_report(y_test, y_pred_weighted))

In [None]:
cm_weighted = confusion_matrix(y_test, y_pred_weighted)
disp_weighted = ConfusionMatrixDisplay(confusion_matrix=cm_weighted, display_labels=[0, 1])

plt.figure()
disp_weighted.plot(values_format='d', cmap='Blues')
plt.title('Confusion Matrix - Weighted MLP (Threshold 0.5)')
plt.show()

In [None]:
# Compare F1 scores for the positive class between baseline and weighted models
models = ['Baseline', 'Weighted']
f1_scores = [f1_pos, f1_pos_w]

plt.figure()
sns.barplot(x=models, y=f1_scores)
plt.title('F1 Score (Positive Class) - Baseline vs Weighted MLP')
plt.ylabel('F1 (diabetes=1)')
plt.show()

## 6. Threshold Tuning for the Weighted MLP

In [None]:
# The default decision threshold is 0.5. In medical tasks, we may want to trade precision vs recall.
# Here we sweep a range of thresholds and compute precision, recall and F1 for the positive class.

thresholds = np.linspace(0.1, 0.9, 17)  # 0.1, 0.15, ..., 0.9
prec_list = []
rec_list = []
f1_list = []

for th in thresholds:
    y_pred_th = (y_proba_weighted >= th).astype(int)
    prec = precision_score(y_test, y_pred_th, pos_label=1, zero_division=0)
    rec = recall_score(y_test, y_pred_th, pos_label=1, zero_division=0)
    f1 = f1_score(y_test, y_pred_th, pos_label=1, zero_division=0)
    prec_list.append(prec)
    rec_list.append(rec)
    f1_list.append(f1)

threshold_df = pd.DataFrame({
    'threshold': thresholds,
    'precision_pos': prec_list,
    'recall_pos': rec_list,
    'f1_pos': f1_list
})

threshold_df

In [None]:
# Plot precision, recall and F1 as a function of the threshold
plt.figure(figsize=(8, 5))
plt.plot(thresholds, prec_list, marker='o', label='Precision (pos)')
plt.plot(thresholds, rec_list, marker='o', label='Recall (pos)')
plt.plot(thresholds, f1_list, marker='o', label='F1 (pos)')
plt.xlabel('Decision Threshold')
plt.ylabel('Score')
plt.title('Precision, Recall and F1 vs Decision Threshold (Weighted MLP)')
plt.legend()
plt.show()

In [None]:
# Visualise confusion matrices for three representative thresholds
interesting_thresholds = [0.3, 0.5, 0.7]

for th in interesting_thresholds:
    y_pred_th = (y_proba_weighted >= th).astype(int)
    cm_th = confusion_matrix(y_test, y_pred_th)
    disp_th = ConfusionMatrixDisplay(confusion_matrix=cm_th, display_labels=[0, 1])
    plt.figure()
    disp_th.plot(values_format='d', cmap='Blues')
    plt.title(f'Confusion Matrix - Weighted MLP (Threshold = {th:.2f})')
    plt.show()

## 7. Learning Curve for the Weighted MLP

In [None]:
# To understand how performance scales with more data, we plot a learning curve.

train_sizes, train_scores, val_scores = learning_curve(
    weighted_model,
    X_train,
    y_train,
    cv=5,
    scoring='f1',  # focus on F1 for overall balance
    n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 5),
    shuffle=True,
    random_state=42
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

plt.figure()
plt.plot(train_sizes, train_mean, marker='o', label='Training F1')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2)

plt.plot(train_sizes, val_mean, marker='o', label='Validation F1')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2)

plt.title('Learning Curve - Weighted MLP (F1 Score)')
plt.xlabel('Number of training samples')
plt.ylabel('F1 score')
plt.legend()
plt.show()

## 8. Conclusion

In this notebook we:
- Confirmed that the diabetes label is **highly imbalanced**, with far fewer positive cases.
- Built a preprocessing pipeline for numeric and categorical features.
- Trained a **baseline MLP** and showed that accuracy alone hides poor recall on the positive class.
- Trained an **MLP with class weights**, improving recall and F1 for diabetic patients.
- Used **ROC and precision–recall curves** to evaluate the classifier beyond accuracy.
- Performed **threshold tuning** and visualised how precision, recall and F1 change with the decision threshold.
- Plotted confusion matrices at different thresholds to make the trade-offs concrete.
- Added a **learning curve** to see how performance scales with additional training data.

This demonstrates how Multilayer Perceptrons can be applied responsibly to imbalanced medical classification problems, with an emphasis on evaluation, threshold selection and fairness to the minority (positive) class.