# Baseline Models for Readmission Prediction

This notebook trains and evaluates four baseline models for predicting 30-day hospital readmissions:

**Unsupervised Anomaly Detectors:**
- Isolation Forest
- Autoencoder

**Supervised Baselines:**
- Decision Tree
- Random Forest

All models are trained on the same train/test split for fair comparison.

## Setup

In [None]:
from pathlib import Path
import sys

CWD = Path.cwd().resolve()
if CWD.name == "notebooks":
    PROJECT_ROOT = CWD.parent
else:
    PROJECT_ROOT = CWD

sys.path.insert(0, str(PROJECT_ROOT))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from src.config import GLOBAL_CONFIG
from src.preprocessing import load_raw_data, build_feature_matrix, train_test_split_stratified
from src.models import (
    IsolationForestDetector,
    AutoencoderDetector,
    DecisionTreeDetector,
    RandomForestDetector,
)
from src.evaluation import compute_classification_metrics, plot_roc_pr_curves

# Create results directory
results_dir = PROJECT_ROOT / 'results'
results_dir.mkdir(exist_ok=True)

## Data Preprocessing

In [None]:
# Load raw data
data_path = PROJECT_ROOT / 'data' / 'raw' / 'diabetic_data.csv'
df = load_raw_data(str(data_path))
print(f"Loaded {len(df):,} records")

In [None]:
# Build feature matrix
X, y, preprocessor = build_feature_matrix(df)
print(f"Feature matrix: {X.shape}")
print(f"Target distribution: {y.value_counts().to_dict()}")
print(f"Positive class (readmitted<30): {y.sum()} ({y.mean()*100:.2f}%)")

In [None]:
# Train/test split using config parameters
cfg = GLOBAL_CONFIG
X_train, X_test, y_train, y_test = train_test_split_stratified(
    X,
    y,
    test_size=cfg.data.test_size,
    random_state=cfg.data.random_seeds[0],
)

print(f"Train set: {X_train.shape[0]:,} samples")
print(f"Test set:  {X_test.shape[0]:,} samples")
print(f"Train positive rate: {y_train.mean():.4f}")
print(f"Test positive rate:  {y_test.mean():.4f}")

---
## Unsupervised Baselines: Isolation Forest & Autoencoder

These models treat readmission prediction as an **anomaly detection** problem:
- Trained only on **normal samples** (y=0, i.e., not readmitted within 30 days)
- Anomaly scores identify patterns that deviate from the normal training distribution
- Higher anomaly scores indicate higher risk of readmission

### Model 1: Isolation Forest

In [None]:
# Filter normal samples for training
normal_mask_if = (y_train == 0)
X_train_normal = X_train[normal_mask_if]

print(f"Training Isolation Forest on {X_train_normal.shape[0]:,} normal samples...")

if_detector = IsolationForestDetector(
    n_estimators=cfg.isolation_forest.n_estimators,
    contamination=float(y_train.mean()),  # Use actual positive rate
    random_state=cfg.isolation_forest.random_state,
)
if_detector.fit(X_train_normal)

print("Training complete!")

In [None]:
# Compute anomaly scores on test set
if_scores_test = if_detector.predict_scores(X_test)

print(f"Anomaly scores range: [{if_scores_test.min():.4f}, {if_scores_test.max():.4f}]")
print(f"Mean score: {if_scores_test.mean():.4f}")

In [None]:
# Evaluate Isolation Forest
if_metrics = compute_classification_metrics(y_test, if_scores_test, model_name="IsolationForest")

print("\nIsolation Forest - Evaluation Results")
print("="*50)
print(f"ROC-AUC: {if_metrics['roc_auc']:.4f}")
print(f"PR-AUC:  {if_metrics['pr_auc']:.4f}")
print("="*50)

In [None]:
# Plot ROC and PR curves
plot_roc_pr_curves(
    y_test, 
    if_scores_test, 
    title="Isolation Forest", 
    save_path=str(results_dir / 'nb_if_roc_pr.png'),
    show=True
)
print(f"Plot saved to: {results_dir / 'nb_if_roc_pr.png'}")

### Model 2: Autoencoder

In [None]:
print(f"Training Autoencoder on {X_train_normal.shape[0]:,} normal samples...")

ae_detector = AutoencoderDetector(
    input_dim=X_train.shape[1],
    hidden_dims=list(cfg.autoencoder.hidden_dims),
    epochs=cfg.autoencoder.epochs,
    batch_size=cfg.autoencoder.batch_size,
    learning_rate=cfg.autoencoder.learning_rate,
)

ae_detector.fit(X_train_normal)

print("Training complete!")

In [None]:
# Compute reconstruction errors on test set
ae_scores_test = ae_detector.predict_scores(X_test)

print(f"Reconstruction error range: [{ae_scores_test.min():.6f}, {ae_scores_test.max():.6f}]")
print(f"Mean reconstruction error: {ae_scores_test.mean():.6f}")

In [None]:
# Evaluate Autoencoder
ae_metrics = compute_classification_metrics(y_test, ae_scores_test, model_name="Autoencoder")

print("\nAutoencoder - Evaluation Results")
print("="*50)
print(f"ROC-AUC: {ae_metrics['roc_auc']:.4f}")
print(f"PR-AUC:  {ae_metrics['pr_auc']:.4f}")
print("="*50)

In [None]:
# Plot ROC and PR curves
plot_roc_pr_curves(
    y_test, 
    ae_scores_test, 
    title="Autoencoder", 
    save_path=str(results_dir / 'nb_ae_roc_pr.png'),
    show=True
)
print(f"Plot saved to: {results_dir / 'nb_ae_roc_pr.png'}")

#### Isolation Forest vs Autoencoder

**Comparison**:
- **Isolation Forest**: Tree-based anomaly detection, fast training, works well with tabular data
- **Autoencoder**: Neural network-based, learns complex feature interactions through reconstruction

Both models are trained on normal-only data and identify readmissions as anomalies without using labels during training.

---
## Supervised Baselines: Decision Tree & Random Forest

These models use **supervised learning** with true readmission labels:
- Trained on the full training set with both normal and positive samples
- Use class balancing (`class_weight="balanced"`) to handle imbalance
- Predict probability scores for the positive class (readmitted <30)

### Model 3: Decision Tree

In [None]:
print(f"Training Decision Tree on {X_train.shape[0]:,} samples (supervised)...")

dt_detector = DecisionTreeDetector(
    max_depth=8,
    min_samples_leaf=50,
    random_state=42,
    class_weight="balanced",
)
dt_detector.fit(X_train, y_train)

print("Training complete!")

In [None]:
# Compute prediction scores
dt_scores_test = dt_detector.predict_scores(X_test)

print(f"Prediction scores range: [{dt_scores_test.min():.4f}, {dt_scores_test.max():.4f}]")
print(f"Mean score: {dt_scores_test.mean():.4f}")

In [None]:
# Evaluate Decision Tree
dt_metrics = compute_classification_metrics(y_test, dt_scores_test, model_name="DecisionTree")

print("\nDecision Tree - Evaluation Results")
print("="*50)
print(f"ROC-AUC: {dt_metrics['roc_auc']:.4f}")
print(f"PR-AUC:  {dt_metrics['pr_auc']:.4f}")
print("="*50)

In [None]:
# Plot ROC and PR curves
plot_roc_pr_curves(
    y_test, 
    dt_scores_test, 
    title="Decision Tree (supervised)", 
    save_path=str(results_dir / 'nb_dt_roc_pr.png'),
    show=True
)
print(f"Plot saved to: {results_dir / 'nb_dt_roc_pr.png'}")

### Model 4: Random Forest

In [None]:
print(f"Training Random Forest on {X_train.shape[0]:,} samples (supervised)...")

rf_detector = RandomForestDetector(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=50,
    random_state=42,
    class_weight="balanced_subsample",
    n_jobs=-1,
)
rf_detector.fit(X_train, y_train)

print("Training complete!")

In [None]:
# Compute prediction scores
rf_scores_test = rf_detector.predict_scores(X_test)

print(f"Prediction scores range: [{rf_scores_test.min():.4f}, {rf_scores_test.max():.4f}]")
print(f"Mean score: {rf_scores_test.mean():.4f}")

In [None]:
# Evaluate Random Forest
rf_metrics = compute_classification_metrics(y_test, rf_scores_test, model_name="RandomForest")

print("\nRandom Forest - Evaluation Results")
print("="*50)
print(f"ROC-AUC: {rf_metrics['roc_auc']:.4f}")
print(f"PR-AUC:  {rf_metrics['pr_auc']:.4f}")
print("="*50)

In [None]:
# Plot ROC and PR curves
plot_roc_pr_curves(
    y_test, 
    rf_scores_test, 
    title="Random Forest (supervised)", 
    save_path=str(results_dir / 'nb_rf_roc_pr.png'),
    show=True
)
print(f"Plot saved to: {results_dir / 'nb_rf_roc_pr.png'}")

---
## Model Comparison: Supervised vs Unsupervised

**Summary Table**:

In [None]:
# Create comparison DataFrame
comparison_df = pd.DataFrame([
    {
        'Model': 'Isolation Forest',
        'Type': 'Unsupervised',
        'ROC-AUC': if_metrics['roc_auc'],
        'PR-AUC': if_metrics['pr_auc'],
    },
    {
        'Model': 'Autoencoder',
        'Type': 'Unsupervised',
        'ROC-AUC': ae_metrics['roc_auc'],
        'PR-AUC': ae_metrics['pr_auc'],
    },
    {
        'Model': 'Decision Tree',
        'Type': 'Supervised',
        'ROC-AUC': dt_metrics['roc_auc'],
        'PR-AUC': dt_metrics['pr_auc'],
    },
    {
        'Model': 'Random Forest',
        'Type': 'Supervised',
        'ROC-AUC': rf_metrics['roc_auc'],
        'PR-AUC': rf_metrics['pr_auc'],
    },
])

print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(comparison_df.to_string(index=False))
print("="*70)

**Key Insights**:

1. **Supervised models (DT, RF)** typically achieve **higher metrics** because they use readmission labels during training
   - They learn direct patterns associated with readmission
   - Require labeled data for training

2. **Unsupervised models (IF, AE)** provide a **label-free alternative**:
   - Train only on normal patient data
   - Identify readmissions as deviations from normal patterns
   - Useful when labels are scarce or unreliable

3. **Next step**: The ontology layer can enhance unsupervised models by injecting **clinical domain knowledge** without requiring additional labels, bridging the gap with supervised methods.