# CardioDetect – Systematic MLP Grid Search & Threshold Tuning

**Objective**: Explore MLP architectures to beat `mlp_v2` (Acc=0.9359, Recall=0.9190) while maintaining recall ≥ 0.9190.

**Grid**: 5 architectures × 4 alphas × 2 learning rates × 2 max_iters = 80 experiments

**Output (this notebook)**: structured CSV with metrics for all experiments.

**Note**: Obsidian notes and leaderboards are generated by a separate script (no Obsidian writes from this notebook).


In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
import time
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import (
    accuracy_score, recall_score, precision_score, 
    f1_score, roc_auc_score, confusion_matrix
)
import sys

warnings.filterwarnings('ignore', category=UserWarning)
sys.path.insert(0, str(Path.cwd().parent))
from src.mlp_tuning import load_splits, encode_categorical_features

# Where to store raw experiment results (CSV only, no Obsidian coupling)
RESULTS_DIR = Path.cwd().parent / 'output' / 'mlp_experiments'
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

print('Imports OK')
print(f'Results will be saved to: {RESULTS_DIR}')

Imports OK
Obsidian output: /Users/prajanv/CardioDetect/obsidian_notes/experiments_mlp


In [26]:
# Load data and scale features
X_train, y_train, X_val, y_val, X_test, y_test = load_splits()
X_train, X_val, X_test = encode_categorical_features(X_train, X_val, X_test)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)
X_test_s = scaler.transform(X_test)

print(f'Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}')
print(f'Features: {X_train_s.shape[1]}')

Train: 11286 | Val: 2418 | Test: 2419
Features: 179


In [27]:
# ============ CONFIGURATION ============
BASELINE_TEST_ACC = 0.9359
BASELINE_TEST_RECALL = 0.9190
RECALL_CONSTRAINT = 0.9190  # Hard constraint

# Hyperparameter search space
HIDDEN_LAYER_SIZES = [
    (128, 64, 32),
    (256, 128, 64),
    (256, 256, 128),
    (128, 64),
    (64, 32),
]
ALPHAS = [1e-5, 1e-4, 1e-3, 1e-2]
LEARNING_RATES = [0.001, 0.0007]
MAX_ITERS = [300, 500]

# Calculate total experiments
total_experiments = len(HIDDEN_LAYER_SIZES) * len(ALPHAS) * len(LEARNING_RATES) * len(MAX_ITERS)
print(f'Total experiments to run: {total_experiments}')
print(f'Baseline: Acc={BASELINE_TEST_ACC}, Recall={BASELINE_TEST_RECALL}')
print(f'Constraint: Recall >= {RECALL_CONSTRAINT}')

Total experiments to run: 80
Baseline: Acc=0.9359, Recall=0.919
Constraint: Recall >= 0.919


In [28]:
# ============ HELPER FUNCTIONS ============

def compute_metrics(y_true, y_pred, y_proba):
    """Compute all metrics for a split."""
    return {
        'accuracy': accuracy_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, zero_division=0),
        'f1': f1_score(y_true, y_pred, zero_division=0),
        'auc': roc_auc_score(y_true, y_proba),
    }

def eval_model_at_threshold(model, X, y, threshold=0.5):
    """Evaluate model at a specific threshold."""
    y_proba = model.predict_proba(X)[:, 1]
    y_pred = (y_proba >= threshold).astype(int)
    return compute_metrics(y, y_pred, y_proba)

def find_best_threshold(model, X_val, y_val, recall_floor=0.9190):
    """
    Find threshold that maximizes validation accuracy while maintaining recall >= recall_floor.
    Returns (best_threshold, best_metrics_at_threshold).
    """
    y_proba = model.predict_proba(X_val)[:, 1]
    
    best_thresh = 0.5
    best_acc = -1.0
    best_rec = 0.0
    
    for thresh in np.linspace(0.1, 0.9, 81):
        y_pred = (y_proba >= thresh).astype(int)
        rec = recall_score(y_val, y_pred)
        acc = accuracy_score(y_val, y_pred)
        
        if rec >= recall_floor:
            if acc > best_acc or (acc == best_acc and rec > best_rec):
                best_acc = acc
                best_rec = rec
                best_thresh = thresh
    
    # If no threshold meets constraint, return default 0.5
    if best_acc < 0:
        best_thresh = 0.5
    
    # Compute full metrics at best threshold
    y_pred = (y_proba >= best_thresh).astype(int)
    metrics = compute_metrics(y_val, y_pred, y_proba)
    
    return best_thresh, metrics

def get_confusion_matrix_str(y_true, y_pred):
    """Return confusion matrix as formatted string."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    return f'TP={tp}, FP={fp}, TN={tn}, FN={fn}'

print('Helper functions defined')

Helper functions defined


# ============ RESULT ROW BUILDER ============

from sklearn.metrics import confusion_matrix


def build_result_row(
    exp_id,
    config,
    train_default,
    val_default,
    test_default,
    best_thresh,
    val_tuned,
    test_tuned,
    val_cm,
    test_cm,
    train_time,
    early_stop_epoch,
    converged,
):
    """Build a single result row for the experiments CSV."""
    tn_val, fp_val, fn_val, tp_val = val_cm.ravel()
    tn_test, fp_test, fn_test, tp_test = test_cm.ravel()

    constraint_met = (val_tuned["recall"] >= RECALL_CONSTRAINT) and (
        test_tuned["recall"] >= RECALL_CONSTRAINT
    )
    is_leader = constraint_met and (test_tuned["accuracy"] > BASELINE_TEST_ACC)

    return {
        "id": exp_id,
        "hidden_layer_sizes": str(config["hidden_layer_sizes"]),
        "alpha": config["alpha"],
        "learning_rate_init": config["learning_rate_init"],
        "max_iter": config["max_iter"],
        "best_threshold": best_thresh,
        # Default threshold = 0.5 metrics
        "train_acc_05": train_default["accuracy"],
        "train_rec_05": train_default["recall"],
        "train_prec_05": train_default["precision"],
        "train_auc_05": train_default["auc"],
        "val_acc_05": val_default["accuracy"],
        "val_rec_05": val_default["recall"],
        "val_prec_05": val_default["precision"],
        "val_auc_05": val_default["auc"],
        "test_acc_05": test_default["accuracy"],
        "test_rec_05": test_default["recall"],
        "test_prec_05": test_default["precision"],
        "test_auc_05": test_default["auc"],
        # Tuned threshold metrics
        "val_acc_tuned": val_tuned["accuracy"],
        "val_rec_tuned": val_tuned["recall"],
        "val_prec_tuned": val_tuned["precision"],
        "val_auc_tuned": val_tuned["auc"],
        "test_acc_tuned": test_tuned["accuracy"],
        "test_rec_tuned": test_tuned["recall"],
        "test_prec_tuned": test_tuned["precision"],
        "test_auc_tuned": test_tuned["auc"],
        # Confusion matrices at tuned threshold
        "val_tn": tn_val,
        "val_fp": fp_val,
        "val_fn": fn_val,
        "val_tp": tp_val,
        "test_tn": tn_test,
        "test_fp": fp_test,
        "test_fn": fn_test,
        "test_tp": tp_test,
        # Meta
        "train_time": train_time,
        "early_stop_epoch": early_stop_epoch,
        "converged": bool(converged),
        "constraint_met": bool(constraint_met),
        "is_leader": bool(is_leader),
    }


print("Result row builder defined")

In [None]:
# ============ RUN FULL GRID SEARCH ============
from itertools import product

all_results = []
experiment_count = 0
start_time = datetime.now()

print(f"Starting grid search: {total_experiments} experiments")
print("=" * 60)

for hidden, alpha, lr, max_iter in product(
    HIDDEN_LAYER_SIZES, ALPHAS, LEARNING_RATES, MAX_ITERS
):
    experiment_count += 1

    # Generate experiment ID
    exp_id = f"mlp_exp_{start_time.strftime('%Y%m%d_%H%M')}_model{experiment_count:03d}"
    config = {
        "hidden_layer_sizes": hidden,
        "alpha": alpha,
        "learning_rate_init": lr,
        "max_iter": max_iter,
    }

    # Progress
    print(
        f"[{experiment_count}/{total_experiments}] {hidden}, α={alpha}, lr={lr}, max_iter={max_iter}",
        end=" ... ",
    )

    # Train model
    t0 = time.time()
    mlp = MLPClassifier(
        hidden_layer_sizes=hidden,
        alpha=alpha,
        learning_rate_init=lr,
        max_iter=max_iter,
        activation="relu",
        solver="adam",
        early_stopping=True,
        validation_fraction=0.1,
        batch_size="auto",
        random_state=42,
    )

    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always")
        mlp.fit(X_train_s, y_train)
        converged = not any("ConvergenceWarning" in str(warning.category) for warning in w)

    train_time = time.time() - t0
    early_stop_epoch = mlp.n_iter_ if hasattr(mlp, "n_iter_") else "N/A"

    # Evaluate at default threshold 0.5
    train_default = eval_model_at_threshold(mlp, X_train_s, y_train, 0.5)
    val_default = eval_model_at_threshold(mlp, X_val_s, y_val, 0.5)
    test_default = eval_model_at_threshold(mlp, X_test_s, y_test, 0.5)

    # Threshold tuning
    best_thresh, val_tuned = find_best_threshold(
        mlp, X_val_s, y_val, RECALL_CONSTRAINT
    )

    # Validation metrics & confusion at tuned threshold
    y_proba_val = mlp.predict_proba(X_val_s)[:, 1]
    y_pred_val_tuned = (y_proba_val >= best_thresh).astype(int)
    val_cm = confusion_matrix(y_val, y_pred_val_tuned)

    # Test metrics & confusion at tuned threshold
    y_proba_test = mlp.predict_proba(X_test_s)[:, 1]
    y_pred_test_tuned = (y_proba_test >= best_thresh).astype(int)
    test_tuned = compute_metrics(y_test, y_pred_test_tuned, y_proba_test)
    test_cm = confusion_matrix(y_test, y_pred_test_tuned)

    # Build result row and collect
    row = build_result_row(
        exp_id,
        config,
        train_default,
        val_default,
        test_default,
        best_thresh,
        val_tuned,
        test_tuned,
        val_cm,
        test_cm,
        train_time,
        early_stop_epoch,
        converged,
    )
    all_results.append(row)

    status = "✅" if row["is_leader"] else ("⚠️" if row["constraint_met"] else "❌")
    print(
        f"{status} Acc={row['test_acc_tuned']:.4f} Rec={row['test_rec_tuned']:.4f} ({train_time:.1f}s)"
    )

# Build DataFrame
results_df = pd.DataFrame(all_results)

total_time = (datetime.now() - start_time).total_seconds()
print("=" * 60)
print(f"Grid search complete in {total_time:.1f}s")
print(f"Total experiments: {len(results_df)}")
print(
    f"Valid candidates (Recall >= {RECALL_CONSTRAINT}): {int(results_df['constraint_met'].sum())}"
)
print(
    f"Leader candidates (beat baseline): {int(results_df['is_leader'].sum())}"
)

# Save full results as CSV
csv_path = RESULTS_DIR / "all_experiments.csv"
results_df.to_csv(csv_path, index=False)
print(f"Full results saved to: {csv_path}")

# Show top 10 valid candidates (by tuned metrics)
valid_df = results_df[results_df["constraint_met"]].copy()
if not valid_df.empty:
    valid_df = valid_df.sort_values(
        by=["val_acc_tuned", "test_acc_tuned", "test_auc_tuned"],
        ascending=[False, False, False],
    )

    top_10 = valid_df.head(10)[
        [
            "id",
            "hidden_layer_sizes",
            "alpha",
            "learning_rate_init",
            "best_threshold",
            "val_acc_tuned",
            "val_rec_tuned",
            "test_acc_tuned",
            "test_rec_tuned",
            "test_auc_tuned",
            "is_leader",
        ]
    ]
    print("\nTop 10 valid candidates (tuned threshold):")
    print(top_10.round(4).to_string(index=False))
else:
    print("\nNo valid candidates found that meet the recall constraint.")

Starting grid search: 80 experiments
[1/80] (128, 64, 32), α=1e-05, lr=0.001, max_iter=300 ... 

NameError: name 'generate_experiment_markdown' is not defined

# Obsidian leaderboard generation has been moved to a separate script:
# `scripts/generate_mlp_obsidian_notes.py`
# This cell is intentionally left empty.
pass

In [None]:
# Obsidian markdown generation is now handled by `scripts/generate_mlp_obsidian_notes.py`.
# Nothing to do in this cell.
pass


Training candidate: baseline_128-64-32_a1e-4

Training candidate: 256-128-64_a1e-4

Training candidate: 256-128-64_a1e-3

Training candidate: 128-64_a1e-3

MLP small grid – Train/Val/Test metrics
                    Name hidden_layer_sizes  alpha  Train_acc  Train_rec  Val_acc  Val_rec  Test_acc  Test_rec
baseline_128-64-32_a1e-4      (128, 64, 32) 0.0001     0.9436     0.9154   0.9049   0.8345    0.9082    0.8466
        256-128-64_a1e-4     (256, 128, 64) 0.0001     0.9472     0.9143   0.9078   0.8259    0.9124    0.8552
        256-128-64_a1e-3     (256, 128, 64) 0.0010     0.9507     0.9361   0.9119   0.8517    0.9136    0.8793
            128-64_a1e-3          (128, 64) 0.0010     0.9489     0.9191   0.9053   0.8172    0.9111    0.8414

No candidate reached Val & Test recall >= 0.919 (tuned baseline).


# Final analysis and recommendation can be performed using the
# leaderboard generated by the Obsidian script.
# See: obsidian_notes/experiments_mlp/summaries/mlp_leaderboard.md
pass

In [None]:
# Threshold tuning notes and CSV export are now produced directly
# from the results CSV by `scripts/generate_mlp_obsidian_notes.py`.
pass

Chosen threshold = 0.31 with Val Acc = 0.9032 and Val Recall >= 0.919
Test Accuracy: 0.9016 Test Recall: 0.9259


## Next Steps

1. **Open Obsidian** and point it at `CardioDetect/obsidian_notes/`
2. **Review the leaderboard** at `experiments_mlp/summaries/mlp_leaderboard.md`
3. **If a candidate beats baseline**:
   - Open its experiment note for full details
   - Update `00_complete_project_walkthrough.ipynb` with the new best model
   - Save the model to `models/` with a new name
4. **If no candidate beats baseline**:
   - Keep `mlp_v2` as the production model
   - Document findings in `MY_NOTES.md`

---

**Grid Search Summary**:
- 80 configurations tested (5 architectures × 4 alphas × 2 LRs × 2 max_iters)
- Threshold tuning applied to each with recall constraint ≥ 0.9190
- Results ranked by validation accuracy, then test accuracy, then AUC
