# Modeling — Predicting 30-Day Hospital Readmission

**Goal:**  
Build and evaluate machine learning models to predict whether a patient will be readmitted to the hospital within 30 days of discharge, using the cleaned dataset.

**Key steps in this notebook:**
1. **Load cleaned dataset**.
2. **Separate features & target** (`readmit_30`).
3. **Remove identifiers and leakage columns**.
4. **Preprocess features**:
   - One-Hot Encode categorical variables.
   - Scale numeric features for linear models.
5. **Split data** using stratification to maintain class balance.
6. **Baseline models**:
   - Logistic Regression (`class_weight='balanced'`)
   - Random Forest (`class_weight='balanced_subsample'`)
7. **Evaluate models** using ROC AUC and Precision-Recall AUC.
8. **Tune decision threshold** to balance recall and precision.
9. **Save trained models** for future use (e.g., dashboard or app).

**Why this matters:**
Accurately identifying high-risk patients for 30-day readmission can help hospitals target interventions, reduce costs, and improve patient outcomes.


In [6]:
import pandas as pd
from pathlib import Path

# Load the cleaned dataset 
clean_path = Path("../data/diabetic_data_clean.csv")
df_clean = pd.read_csv(clean_path)

# Work on a modeling copy so df_clean stays pristine
df_model = df_clean.copy()

print(df_model.shape)
df_model.head(3)


TARGET = 'readmit_30'

# IDs / high-risk leakage columns to drop from features:
# - encounter_id & patient_nbr are unique IDs (no predictive value)
# - readmitted is the raw multi-class label we derived target from
id_or_leak_cols = [c for c in ['encounter_id','patient_nbr','readmitted'] if c in df_model.columns]

X = df_model.drop(columns=id_or_leak_cols + [TARGET])
y = df_model[TARGET]

print("X shape:", X.shape, "y positive rate:", y.mean().round(4))


(101766, 48)
X shape: (101766, 44) y positive rate: 0.1116
