## Data Source & Scope
**Source:** CMS Hospital Provider Cost Report  
**Unit of analysis:** Hospital fiscal-year observation  
**Key identifier:** Provider CCN

This dataset contains detailed financial, utilization, and staffing information reported annually
by hospitals to CMS. Fiscal years vary by hospital and do not necessarily align with calendar years.


# 01_CMS_CostReport_EDA
## Overview
This notebook explores and prepares CMS Hospital Provider Cost Report data for use in analyzing
the relationship between hospital staffing capacity and quality outcomes.

The primary objective is to construct hospital-level staffing intensity measures that can be
linked to CMS quality ratings and regional nursing workforce data in later stages of the project.

## 1) Load and Clean 
This section loads the raw CMS cost report data and performs basic cleaning steps required for
consistent identification and numeric analysis.

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)

# Project paths
ROOT = Path.cwd()          # usually /work
SRC = ROOT / "src"

RAW_DIR = SRC / "data" / "raw"
CLEAN_DIR = SRC / "data" / "clean"

RAW_DIR.mkdir(parents=True, exist_ok=True)
CLEAN_DIR.mkdir(parents=True, exist_ok=True)

COST_RAW_PATH = RAW_DIR / "CMS_HospitalProviderCostReport.csv"
COST_CLEAN_PATH = CLEAN_DIR / "cost_report_clean.csv"

print("Reading:", COST_RAW_PATH)
print("Writing:", COST_CLEAN_PATH)
assert COST_RAW_PATH.exists(), f"Missing file: {COST_RAW_PATH}"

Reading: /voc/work/src/data/raw/CMS_HospitalProviderCostReport.csv
Writing: /voc/work/src/data/clean/cost_report_clean.csv


In [3]:
cost = pd.read_csv(COST_RAW_PATH, low_memory=False)

# Ensure CCN formatting
cost["Provider CCN"] = cost["Provider CCN"].astype(str).str.zfill(6)

# Convert numeric columns
num_cols = [
    "FTE - Employees on Payroll",
    "Total Days (V + XVIII + XIX + Unknown)",
    "Number of Beds",
]
for c in num_cols:
    cost[c] = pd.to_numeric(cost[c], errors="coerce")

## 2) Compute staffing metrics

### Interpretation of Staffing Metrics
Two staffing intensity measures are constructed:

- **FTE per 1,000 patient days**: captures staffing relative to inpatient utilization
- **FTE per bed**: captures staffing relative to hospital capacity

These measures use total employees on payroll (not nursing-specific), serving as a proxy for
overall staffing intensity. Extreme values are expected for very small hospitals or facilities
with limited inpatient volume.


In [4]:
fte_col = "FTE - Employees on Payroll"
days_col = "Total Days (V + XVIII + XIX + Unknown)"
beds_col = "Number of Beds"

cost["fte_per_1000_days"] = np.where(
    cost[days_col] > 0,
    (cost[fte_col] / cost[days_col]) * 1000,
    np.nan,
)

cost["fte_per_bed"] = np.where(
    cost[beds_col] > 0,
    cost[fte_col] / cost[beds_col],
    np.nan,
)

cost[["fte_per_1000_days", "fte_per_bed"]].describe(percentiles=[.01, .05, .5, .95, .99])

Unnamed: 0,fte_per_1000_days,fte_per_bed
count,5935.0,5941.0
mean,93.895991,10.444594
std,833.752042,120.908313
min,0.109481,0.0045
1%,4.91782,0.817261
5%,7.049028,1.573333
50%,30.475509,5.182
95%,164.329907,15.322778
99%,556.66243,26.900541
max,44720.0,6997.2


## 3. Missingness, Data Quality, and Duplicates

### Notes on Missingness
High missingness is expected for financial and utilization subcomponents that are not applicable
to all hospital types. Staffing and bed counts exhibit substantially lower missingness, supporting
their use in downstream analysis.

In [5]:
cost.isna().mean().sort_values(ascending=False).head(15)

cost["Provider CCN"].duplicated().mean()

np.float64(0.01032279206947403)

In [6]:
cost["cbsa"] = (
    cost["Medicare CBSA Number"]
    .astype(str)
    .str.replace(r"\.0$", "", regex=True)
    .str.strip()
    .replace({"nan": np.nan, "None": np.nan, "": np.nan})
)

## 4. Final Dataset for Merging

### Output Design
The cleaned dataset retains only identifiers, staffing inputs, and derived staffing metrics
required for merging with CMS quality ratings and BLS workforce data. This minimizes file size
and reduces downstream complexity.


In [7]:
cols_out = [
    "Provider CCN",
    "Hospital Name",
    "State Code",
    "Medicare CBSA Number",
    "cbsa",
    "Fiscal Year Begin Date",
    "Fiscal Year End Date",
    "FTE - Employees on Payroll",
    "Total Days (V + XVIII + XIX + Unknown)",
    "Number of Beds",
    "fte_per_1000_days",
    "fte_per_bed",
]

cost_clean = cost[cols_out].copy()

cost_clean.to_csv(COST_CLEAN_PATH, index=False)

print(cost_clean["State Code"].head(10).tolist())
print(cost_clean["State Code"].nunique())
print("Saved:", COST_CLEAN_PATH)


['GA', 'IL', 'AR', 'TX', 'IL', 'SC', 'DE', 'AZ', 'NJ', 'TX']
55
Saved: /voc/work/src/data/clean/cost_report_clean.csv
