# Capstone Project: Model Comparison (GridSearchCV + Cross-Validation)
**Goal:** Compare Logistic Regression vs. XGBoost for the primary capstone classification task (**30-day readmission**).  
This notebook adds **cross-validation** and **GridSearchCV** to satisfy the rubric requirement for hyperparameter search.

**Primary dataset:** UCI *Diabetes 130-US hospitals (1999â€“2008)* (processed file used in this repo).  
**Target:** `readmitted_binary` (or `readmitted` if your processed file still uses that name).

---
## What this notebook produces
- ROC curves (AUC)
- Confusion matrices
- Classification reports
- A small model comparison table (CSV)
- Saved figures into `../output/readmitted_binary/` (created if missing)


In [None]:

# =========================
# 1) Imports & Setup
# =========================
import os
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from xgboost import XGBClassifier

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

OUT_DIR = os.path.join("..", "output", "readmitted_binary")
os.makedirs(OUT_DIR, exist_ok=True)

print("Output directory:", os.path.abspath(OUT_DIR))


---
## 2) Load Data
This notebook expects your processed dataset to exist under:
- `../Processed Data/processed_diabetes_data.csv` (preferred), or
- `../Processed data/processed_diabetes_data.csv` (if folder name differs)

If your processed file uses a different target column name, update `TARGET_CANDIDATES` below.


In [None]:

# =========================
# 2) Load processed data
# =========================
candidate_paths = [
    os.path.join("..", "Processed Data", "processed_diabetes_data.csv"),
    os.path.join("..", "Processed data", "processed_diabetes_data.csv"),
    os.path.join("..", "processed_diabetes_data.csv"),
]

data_path = None
for p in candidate_paths:
    if os.path.exists(p):
        data_path = p
        break

if data_path is None:
    raise FileNotFoundError(
        "Could not find processed_diabetes_data.csv. Checked:\n" + "\n".join(candidate_paths)
    )

df = pd.read_csv(data_path)
print("Loaded:", data_path)
print("Shape:", df.shape)
df.head()


---
## 3) Define Target and Features
We try common target names automatically. If your processed dataset uses a different name, add it to `TARGET_CANDIDATES`.


In [None]:

# =========================
# 3) Define target + features
# =========================
TARGET_CANDIDATES = ["readmitted_binary", "readmitted", "Readmitted", "target"]

target_col = None
for c in TARGET_CANDIDATES:
    if c in df.columns:
        target_col = c
        break

if target_col is None:
    raise ValueError(
        "Target column not found. Add your target name to TARGET_CANDIDATES. "
        f"Available columns (first 30): {list(df.columns)[:30]}"
    )

y = df[target_col].copy()

# Ensure y is binary {0,1}
if y.dtype == "object":
    y = y.map({"NO": 0, "<30": 1, ">30": 0}).fillna(y)

y = pd.to_numeric(y, errors="coerce")

print("Target column:", target_col)
print("Unique target values:", sorted(pd.Series(y).dropna().unique().tolist()))

X = df.drop(columns=[target_col]).copy()
X = X.dropna(axis=1, how="all").fillna(0)

print("X shape:", X.shape)


---
## 4) Train/Test Split
We use a **stratified** split to preserve class distribution.


In [None]:

# =========================
# 4) Train / test split
# =========================
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train:", X_train.shape, " Test:", X_test.shape)
print("Train positive rate:", float(pd.Series(y_train).mean()))
print("Test positive rate:", float(pd.Series(y_test).mean()))


---
## 5) GridSearchCV + Cross-Validation (Rubric Requirement)
To satisfy the rubric:
- **Multiple models:** Logistic Regression + XGBoost  
- **Cross-validation:** 5-fold `StratifiedKFold`  
- **GridSearchCV:** hyperparameter tuning for each model  
- **Metric:** ROC-AUC (robust for imbalanced classification and threshold-independent)


In [None]:

# =========================
# 5) GridSearchCV with Cross-Validation
# =========================
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
scoring = "roc_auc"

# Logistic Regression pipeline (scale features)
lr_pipe = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),
    ("clf", LogisticRegression(max_iter=2000, random_state=RANDOM_STATE))
])

lr_param_grid = {
    "clf__C": [0.01, 0.1, 1.0, 10.0],
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"]
}

lr_grid = GridSearchCV(
    estimator=lr_pipe,
    param_grid=lr_param_grid,
    scoring=scoring,
    cv=cv,
    n_jobs=-1
)

# XGBoost classifier
xgb = XGBClassifier(
    random_state=RANDOM_STATE,
    eval_metric="logloss",
    tree_method="hist",
)

xgb_param_grid = {
    "n_estimators": [200, 400],
    "max_depth": [3, 5],
    "learning_rate": [0.05, 0.1],
    "subsample": [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0],
    "reg_lambda": [1.0, 5.0],
}

xgb_grid = GridSearchCV(
    estimator=xgb,
    param_grid=xgb_param_grid,
    scoring=scoring,
    cv=cv,
    n_jobs=-1
)

print("Fitting Logistic Regression GridSearch...")
lr_grid.fit(X_train, y_train)
print("Best LR params:", lr_grid.best_params_)
print("Best LR CV ROC-AUC:", round(lr_grid.best_score_, 4))

print("\nFitting XGBoost GridSearch...")
xgb_grid.fit(X_train, y_train)
print("Best XGB params:", xgb_grid.best_params_)
print("Best XGB CV ROC-AUC:", round(xgb_grid.best_score_, 4))


---
## 6) Evaluate Best Models on the Held-Out Test Set
We report ROC-AUC, confusion matrices, and classification reports.  
Artifacts are saved into `../output/readmitted_binary/`.


In [None]:

# =========================
# 6) Evaluation helpers
# =========================
def save_confusion_matrix(cm, title, filename):
    plt.figure(figsize=(6, 5))
    plt.imshow(cm, interpolation="nearest")
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(2)
    plt.xticks(tick_marks, ["0", "1"])
    plt.yticks(tick_marks, ["0", "1"])
    thresh = cm.max() / 2.0
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, format(int(cm[i, j]), "d"),
                     ha="center", va="center",
                     color="white" if cm[i, j] > thresh else "black")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.tight_layout()
    plt.savefig(filename, dpi=150)
    plt.close()

def save_roc_curve(y_true, y_prob, title, filename):
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    auc_val = roc_auc_score(y_true, y_prob)
    plt.figure(figsize=(7, 6))
    plt.plot(fpr, tpr, label=f"AUC = {auc_val:.4f}")
    plt.plot([0, 1], [0, 1], linestyle="--")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title(title)
    plt.legend(loc="lower right")
    plt.tight_layout()
    plt.savefig(filename, dpi=150)
    plt.close()
    return auc_val

def evaluate_and_save(name, model, X_test, y_test):
    y_pred = model.predict(X_test)
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
    else:
        y_prob = model.decision_function(X_test)
        y_prob = (y_prob - y_prob.min()) / (y_prob.max() - y_prob.min() + 1e-9)

    auc_val = roc_auc_score(y_test, y_prob)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)

    print(f"\n=== {name} (Test Set) ===")
    print(f"ROC-AUC: {auc_val:.4f}")
    print(f"Accuracy: {acc:.4f} | Precision: {prec:.4f} | Recall: {rec:.4f} | F1: {f1:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, digits=4))

    cm = confusion_matrix(y_test, y_pred)
    cm_path = os.path.join(OUT_DIR, f"confusion_matrix_{name}.png")
    save_confusion_matrix(cm, f"Confusion Matrix - {name}", cm_path)

    roc_path = os.path.join(OUT_DIR, f"roc_{name}.png")
    save_roc_curve(y_test, y_prob, f"ROC Curve - {name}", roc_path)

    return {"Model": name, "ROC_AUC": auc_val, "Accuracy": acc, "Precision": prec, "Recall": rec, "F1": f1}

best_lr = lr_grid.best_estimator_
best_xgb = xgb_grid.best_estimator_

results = []
results.append(evaluate_and_save("LogReg_GridSearch", best_lr, X_test, y_test))
results.append(evaluate_and_save("XGB_GridSearch", best_xgb, X_test, y_test))

results_df = pd.DataFrame(results).sort_values("ROC_AUC", ascending=False)
results_df


---
## 7) Save Summary Table + Optional Feature Importance (XGBoost)


In [None]:

# =========================
# 7) Save summary + feature importance
# =========================
summary_path = os.path.join(OUT_DIR, "model_comparison_summary_gridsearch.csv")
results_df.to_csv(summary_path, index=False)
print("Saved:", os.path.abspath(summary_path))

# Feature importance (Top 15) for XGBoost
if hasattr(best_xgb, "feature_importances_"):
    importances = pd.Series(best_xgb.feature_importances_, index=X.columns).sort_values(ascending=False).head(15)
    fi_csv = os.path.join(OUT_DIR, "feature_importance_XGB_GridSearch_top15.csv")
    importances.to_csv(fi_csv, header=["importance"])

    plt.figure(figsize=(10, 6))
    importances.sort_values().plot(kind="barh")
    plt.title("Top 15 Feature Importances - XGBoost (GridSearch Best Model)")
    plt.xlabel("Importance")
    plt.tight_layout()
    fi_png = os.path.join(OUT_DIR, "feature_importance_XGB_GridSearch_top15.png")
    plt.savefig(fi_png, dpi=150)
    plt.close()

    print("Saved:", os.path.abspath(fi_csv))
    print("Saved:", os.path.abspath(fi_png))


---
## 8) Interpretation (Non-Technical Takeaway)
- **ROC-AUC** evaluates how well the model ranks high-risk patients across different thresholds (helpful when classes are imbalanced).
- **GridSearchCV + 5-fold stratified CV** reduces the risk of overfitting to a single train/test split.
- **XGBoost feature importance** supports interpretability and can guide discharge planning actions.
