## Predicting Hospital Readmissions Using Integrated Patient, Clinical, and Socioeconomic Data
 
1.2.1	üéØ Project Objective:
To develop a predictive model for 30-day hospital readmission risk by merging and cleaning patient demographics, clinical encounter data, and socioeconomic data. The goal is to help hospitals reduce readmissions, improve patient outcomes, and reduce costs.
### J. Casey Brookshier
### 7/21/2025

## "Hospital Quality Forecasting: Data-Driven Insights into Readmission Penalties"
Recommended Workflow: Clean First, Then Integrate
## In short: Clean ‚Üí Standardize ‚Üí Aggregate ‚Üí Integrate ‚Üí Analyze


In [None]:
# Hospital Readmission Risk Forecasting

## Objective
Predict hospital-level 30-day readmission risk using publicly available
CMS readmission metrics, healthcare-associated infection indicators,
and socioeconomic deprivation (ADI).

## Business Value
‚Ä¢ Identify facilities at risk of CMS readmission penalties  
‚Ä¢ Support targeted quality improvement initiatives  
‚Ä¢ Enable data-informed policy and administrative decisions


In [None]:
hospital_readmission_forecasting/
‚îÇ
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îî‚îÄ‚îÄ hospital_readmissions_analytic_table.csv   ‚Üê created earlier
‚îÇ
‚îú‚îÄ‚îÄ artifacts/
‚îÇ   ‚îú‚îÄ‚îÄ random_forest_model.pkl
‚îÇ   ‚îî‚îÄ‚îÄ feature_names.pkl
‚îÇ
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îî‚îÄ‚îÄ train_model.py   ‚Üê this code
‚îÇ
‚îî‚îÄ‚îÄ README.md



In [None]:
# ============================================================
# Hospital Readmission Forecasting ‚Äì Model Training Script
# ============================================================
# Author: J. Casey Brookshier
# Purpose: Train and evaluate readmission risk models
# Inputs: Pre-built analytic dataset
# Outputs: Trained model artifacts
# ============================================================

import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
import pickle

# ============================================================
# CONFIG (RELATIVE PATHS)
# ============================================================

PROJECT_ROOT = Path(__file__).resolve().parents[1]

DATA_DIR     = PROJECT_ROOT / "data"
ARTIFACT_DIR = PROJECT_ROOT / "artifacts"

ANALYTIC_DATA = DATA_DIR / "hospital_readmissions_analytic_table.csv"

LEAKAGE_KEYWORDS = [
    "predicted_readmission_rate",
    "expected_readmission_rate",
]

# ============================================================
# LOAD DATA
# ============================================================

df = pd.read_csv(ANALYTIC_DATA)

print(f"‚úÖ Loaded analytic dataset: {df.shape}")

# ============================================================
# MODEL PREPARATION
# ============================================================

TARGET = "composite_readmission_score"

drop_cols = (
    ["Facility ID", "Facility Name", "State", TARGET]
    + [c for c in df.columns if any(k in c for k in LEAKAGE_KEYWORDS)]
)

X = df.drop(columns=drop_cols)
y = df[TARGET]

# Drop all-null columns (sparse infection metrics)
X = X.dropna(axis=1, how="all")

# Impute remaining missing values
imputer = SimpleImputer(strategy="mean")
X_imputed = pd.DataFrame(
    imputer.fit_transform(X),
    columns=X.columns,
    index=X.index,
)

print(f"‚úÖ Modeling matrix: {X_imputed.shape}")

# ============================================================
# TRAIN / TEST SPLIT
# ============================================================

X_train, X_test, y_train, y_test = train_test_split(
    X_imputed,
    y,
    test_size=0.2,
    random_state=42,
)

# ============================================================
# TRAIN MODELS
# ============================================================

lr = LinearRegression()
lr.fit(X_train, y_train)

rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1,
)
rf.fit(X_train, y_train)

# ============================================================
# EVALUATION
# ============================================================

def evaluate(model, X_test, y_test):
    preds = model.predict(X_test)
    return {
        "RMSE": np.sqrt(mean_squared_error(y_test, preds)),
        "R2": r2_score(y_test, preds),
    }

print("\nüìä Model Performance")
print("Linear Regression:", evaluate(lr, X_test, y_test))
print("Random Forest:", evaluate(rf, X_test, y_test))

# Cross-validation (Random Forest)
cv_rmse = np.sqrt(
    -cross_val_score(
        rf,
        X_imputed,
        y,
        cv=5,
        scoring="neg_mean_squared_error",
    )
)

print("\nüìà Random Forest CV RMSE")
print("Mean:", round(cv_rmse.mean(), 4), "Std:", round(cv_rmse.std(), 4))

# ============================================================
# SAVE ARTIFACTS
# ============================================================

ARTIFACT_DIR.mkdir(exist_ok=True)

with open(ARTIFACT_DIR / "random_forest_model.pkl", "wb") as f:
    pickle.dump(rf, f)

with open(ARTIFACT_DIR / "feature_names.pkl", "wb") as f:
    pickle.dump(list(X_imputed.columns), f)

with open(ARTIFACT_DIR / "imputer.pkl", "wb") as f:
    pickle.dump(imputer, f)

print("\n‚úÖ Model artifacts saved to /artifacts")
