# Phase 4: Model Tuning & Explainability — Predicting 30-Day Hospital Readmission

**Goal:**  
Improve baseline model performance from Phase 3 through hyperparameter tuning, probability calibration, and model explainability.

**Key steps in this notebook:**
1. **Load cleaned dataset** from Phase 2.
2. **Reuse preprocessing pipeline** from Phase 3 (encoding + scaling).
3. **Hyperparameter tuning**:
   - Logistic Regression (`GridSearchCV`, scoring by Precision-Recall AUC)
   - Random Forest (`GridSearchCV`, scoring by Precision-Recall AUC)
4. **Evaluate tuned models**:
   - ROC AUC
   - Precision-Recall AUC
   - Best decision threshold (F1-optimized)
5. **Probability calibration** (isotonic) for trustworthy risk scores.
6. **Explainability with SHAP**:
   - Global feature importance
   - Individual prediction explanations
7. **Save final model and configuration** for deployment.

**Why this matters:**  
Tuned and interpretable models are essential for healthcare use cases.  
They ensure better recall of high-risk patients while giving clinicians insights into *why* a model predicts a readmission, increasing trust and adoption.


In [None]:
import pandas as pd
from pathlib import Path
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler


df_clean = pd.read_csv(Path("../data/diabetic_data_clean.csv"))
TARGET = "readmit_30"
X = df_clean.drop(columns=[c for c in ["encounter_id","patient_nbr","readmitted", TARGET] if c in df_clean.columns])
y = df_clean[TARGET]

cat_cols = X.select_dtypes(include="object").columns.tolist()
num_cols = X.select_dtypes(exclude="object").columns.tolist()

preprocess = ColumnTransformer(
    [("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), cat_cols),
     ("num", StandardScaler(with_mean=False), num_cols)],
    remainder="drop",
    sparse_threshold=1.0
)
