## Predicting Hospital Readmissions Using Integrated Patient, Clinical, and Socioeconomic Data
 
1.2.1	ðŸŽ¯ Project Objective:
To develop a predictive model for 30-day hospital readmission risk by merging and cleaning patient demographics, clinical encounter data, and socioeconomic data. The goal is to help hospitals reduce readmissions, improve patient outcomes, and reduce costs.
### J. Casey Brookshier
### 7/21/2025

## "Hospital Quality Forecasting: Data-Driven Insights into Readmission Penalties"
Recommended Workflow: Clean First, Then Integrate
## In short: Clean â†’ Standardize â†’ Aggregate â†’ Integrate â†’ Analyze


In [None]:
# Hospital Readmission Risk Forecasting

## Objective
Predict hospital-level 30-day readmission risk using publicly available
CMS readmission metrics, healthcare-associated infection indicators,
and socioeconomic deprivation (ADI).

## Business Value
â€¢ Identify facilities at risk of CMS readmission penalties  
â€¢ Support targeted quality improvement initiatives  
â€¢ Enable data-informed policy and administrative decisions


In [None]:
hospital-readmission-forecasting/
â”‚
â”œâ”€â”€ data/                      # (GitHub-safe: small metadata only)
â”‚   â”œâ”€â”€ README.md              # explains where to download CMS / ADI data
â”‚
â”œâ”€â”€ notebooks/
â”‚   â””â”€â”€ Hospital_Readmission_Forecasting.ipynb
â”‚
â”œâ”€â”€ models/
â”‚   â”œâ”€â”€ random_forest_model.pkl
â”‚   â””â”€â”€ feature_names.pkl
â”‚
â”œâ”€â”€ requirements.txt
â”œâ”€â”€ README.md
â””â”€â”€ .gitignore


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import matplotlib.pyplot as plt
import seaborn as sns
import pickle


In [None]:
# Relative paths (work on GitHub)
DATA_DIR = "../data/"
MODEL_DIR = "../models/"

READMISSIONS_PATH = DATA_DIR + "FY_2025_Hospital_Readmissions_Reduction_Program_Hospital.csv"
INFECTIONS_PATH   = DATA_DIR + "Healthcare_Associated_Infections-Hospital.csv"
ADI_PATH          = DATA_DIR + "CO_2023_ADI_9_Digit_Zip.csv"


In [None]:
# Create Data Cleaning Functions

def clean_readmissions(df):
    cols = [
        "Facility ID", "Facility Name", "State",
        "Measure Name",
        "Excess Readmission Ratio",
        "Predicted Readmission Rate",
        "Expected Readmission Rate",
        "Number of Readmissions"
    ]
    df = df[cols].copy()

    numeric_cols = cols[4:]
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")
    df = df.dropna(subset=["Excess Readmission Ratio"])

    pivot = df.pivot_table(
        index=["Facility ID", "Facility Name", "State"],
        columns="Measure Name",
        values=numeric_cols
    ).reset_index()

    pivot.columns = [
        f"{m.replace(' ', '_')}_{c.replace(' ', '').replace('-', '')}"
        if isinstance(m, str) else m
        for m, c in pivot.columns
    ]

    pivot.rename(columns={"Facility ID_": "Facility ID"}, inplace=True)
    pivot["Facility ID"] = pivot["Facility ID"].astype(str)
    return pivot


def clean_infections(df):
    df = df[["Facility ID", "Measure Name", "Score"]].copy()
    df["Score"] = pd.to_numeric(df["Score"], errors="coerce")
    df = df.dropna()

    pivot = df.pivot_table(
        index="Facility ID",
        columns="Measure Name",
        values="Score",
        aggfunc="mean"
    ).reset_index()

    pivot.columns = [
        f"Infection_{c.replace(' ', '_').replace('-', '')}"
        if c != "Facility ID" else c
        for c in pivot.columns
    ]

    pivot["Facility ID"] = pivot["Facility ID"].astype(str)
    return pivot



def clean_adi(df):
    df = df.rename(columns={
        "ZIP_4": "ZIP",
        "ADI_NATRANK": "ADI_National_Rank",
        "ADI_STATERNK": "ADI_State_Rank"
    })

    df["ZIP"] = df["ZIP"].astype(str).str.zfill(5)
    df["ADI_National_Rank"] = pd.to_numeric(df["ADI_National_Rank"], errors="coerce")
    df["ADI_State_Rank"] = pd.to_numeric(df["ADI_State_Rank"], errors="coerce")

    return (
        df.dropna()
          .groupby("ZIP", as_index=False)
          .mean()
    )


In [None]:
# Load and Merge Data

readmissions_df = pd.read_csv(READMISSIONS_PATH)
infections_df   = pd.read_csv(INFECTIONS_PATH)
adi_df          = pd.read_csv(ADI_PATH)

readmissions = clean_readmissions(readmissions_df)
infections   = clean_infections(infections_df)

merged_df = readmissions.merge(
    infections,
    on="Facility ID",
    how="left"
)


In [None]:
# Find Composite Readmission Score

excess_cols = [c for c in merged_df.columns if c.startswith("Excess_Readmission_Ratio_")]

merged_df["Composite_Readmission_Score"] = merged_df[excess_cols].mean(axis=1)
merged_df = merged_df.dropna(subset=["Composite_Readmission_Score"])


In [None]:
# Feature Selection, Leakage Prevention

leak_cols = [c for c in merged_df.columns if "Predicted_Readmission_Rate" in c
            or "Expected_Readmission_Rate" in c]

id_cols = ["Facility ID", "Facility_Name_", "State_", "Composite_Readmission_Score"]
count_cols = [c for c in merged_df.columns if c.startswith("Number_of_Readmissions")]

X = merged_df.drop(columns=leak_cols + id_cols + count_cols)
y = merged_df["Composite_Readmission_Score"]

X = X.fillna(X.mean())


In [None]:
# Modeling & Evaluation

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, random_state=42)

lr.fit(X_train, y_train)
rf.fit(X_train, y_train)

def evaluate(y_true, y_pred):
    return (
        np.sqrt(mean_squared_error(y_true, y_pred)),
        r2_score(y_true, y_pred)
    )

lr_rmse, lr_r2 = evaluate(y_test, lr.predict(X_test))
rf_rmse, rf_r2 = evaluate(y_test, rf.predict(X_test))

pd.DataFrame({
    "Model": ["Linear Regression", "Random Forest"],
    "RMSE": [lr_rmse, rf_rmse],
    "RÂ²": [lr_r2, rf_r2]
})


In [None]:
# Cross Validation

cv_rmse = np.sqrt(
    -cross_val_score(
        rf, X, y,
        cv=5,
        scoring="neg_mean_squared_error"
    )
)

cv_rmse.mean(), cv_rmse.std()


In [None]:
# Feature Importance

importance_df = pd.DataFrame({
    "Feature": X.columns,
    "Importance": rf.feature_importances_
}).sort_values("Importance", ascending=False)

importance_df.head(10)


In [None]:
# Save Model Artifacts

with open(MODEL_DIR + "random_forest_model.pkl", "wb") as f:
    pickle.dump(rf, f)

with open(MODEL_DIR + "feature_names.pkl", "wb") as f:
    pickle.dump(list(X.columns), f)


In [None]:
## Key Insights

â€¢ Random Forest achieved RÂ² â‰ˆ 0.90 with strong generalization  
â€¢ Excess readmission ratios were the dominant drivers  
â€¢ Infection control metrics meaningfully influenced readmission risk  

## Practical Use
This model can support:
â€¢ CMS penalty risk forecasting  
â€¢ Quality improvement targeting  
â€¢ Operational dashboard integration (Streamlit-ready)
