# Modeling — Predicting 30-Day Hospital Readmission

**Goal:**  
Build and evaluate machine learning models to predict whether a patient will be readmitted to the hospital within 30 days of discharge, using the cleaned dataset.

**Key steps in this notebook:**
1. **Load cleaned dataset**.
2. **Separate features & target** (`readmit_30`).
3. **Remove identifiers and leakage columns**.
4. **Preprocess features**:
   - One-Hot Encode categorical variables.
   - Scale numeric features for linear models.
5. **Split data** using stratification to maintain class balance.
6. **Baseline models**:
   - Logistic Regression (`class_weight='balanced'`)
   - Random Forest (`class_weight='balanced_subsample'`)
7. **Evaluate models** using ROC AUC and Precision-Recall AUC.
8. **Tune decision threshold** to balance recall and precision.
9. **Save trained models** for future use (e.g., dashboard or app).

**Why this matters:**
Accurately identifying high-risk patients for 30-day readmission can help hospitals target interventions, reduce costs, and improve patient outcomes.


In [None]:
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score


# Load the cleaned dataset 
clean_path = Path("../data/diabetic_data_clean.csv")
df_clean = pd.read_csv(clean_path)

# Work on a modeling copy so df_clean stays pristine
df_model = df_clean.copy()

print(df_model.shape)
df_model.head(3)


TARGET = 'readmit_30'

# IDs / high-risk leakage columns to drop from features:
# - encounter_id & patient_nbr are unique IDs (no predictive value)
# - readmitted is the raw multi-class label we derived target from
id_or_leak_cols = [c for c in ['encounter_id','patient_nbr','readmitted'] if c in df_model.columns]

X = df_model.drop(columns=id_or_leak_cols + [TARGET])
y = df_model[TARGET]

print("X shape:", X.shape, "y positive rate:", y.mean().round(4))

# Column typing & encoders

# Identify column types
cat_cols = X.select_dtypes(include='object').columns.tolist()
num_cols = X.select_dtypes(exclude='object').columns.tolist()

print("Categorical:", len(cat_cols), "| Numeric:", len(num_cols))

# Build preprocessors
categorical = OneHotEncoder(handle_unknown='ignore', sparse_output=True)
numeric = StandardScaler(with_mean=False)  # with_mean=False because we may use sparse matrices

preprocess = ColumnTransformer(
    transformers=[
        ('cat', categorical, cat_cols),
        ('num', numeric, num_cols),
    ],
    remainder='drop',
    sparse_threshold=1.0
)

# Stratified split because of imbalance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

X_train.shape, X_test.shape, y_train.mean().round(4), y_test.mean().round(4)

model_lr = Pipeline(steps=[
    ('prep', preprocess),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', n_jobs=None))
])

model_lr.fit(X_train, y_train)

proba_lr = model_lr.predict_proba(X_test)[:,1]
print("LR ROC AUC:", roc_auc_score(y_test, proba_lr).round(4))
print("LR PR AUC (AP):", average_precision_score(y_test, proba_lr).round(4))


(101766, 48)
X shape: (101766, 44) y positive rate: 0.1116
Categorical: 33 | Numeric: 11


((81412, 44), (20354, 44), np.float64(0.1116), np.float64(0.1116))