# 02_CMS_Ratings_EDA

## Data Source & Scope
**Source:** CMS Hospital Quality Ratings (Care Compare)
**Unit of analysis:** Hospital (facility-level record)
**Key identifier:** Facility ID (CMS Certification Number / CCN)

This dataset summarizes hospital quality performance using an overall star rating (1–5) and
measure-group summaries across mortality, safety, readmissions, patient experience, and timely
and effective care. Not all hospitals receive an overall rating due to reporting and eligibility
requirements, so missingness and footnote fields must be reviewed before merging.


## 1) Load and Standardize
This section loads the raw CMS hospital ratings file and performs basic standardization steps
required for reliable analysis and downstream merging (e.g., trimming column names and confirming
expected row/column counts).

In [26]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 250)
pd.set_option("display.width", 160)

# --- Project paths (01-style, but robust) ---
ROOT = Path.cwd().resolve()

# If you're already inside /src (e.g., /work/src/notebooks), climb to /work
if ROOT.name == "notebooks" and ROOT.parent.name == "src":
    ROOT = ROOT.parents[1]          # /work
elif ROOT.name == "src":
    ROOT = ROOT.parent              # /work

SRC = ROOT / "src"
RAW_DIR = SRC / "data" / "raw"
CLEAN_DIR = SRC / "data" / "clean"

RAW_DIR.mkdir(parents=True, exist_ok=True)
CLEAN_DIR.mkdir(parents=True, exist_ok=True)

RATINGS_RAW_PATH = RAW_DIR / "CMS_HospitalRatings.csv"
RATINGS_CLEAN_PATH = CLEAN_DIR / "ratings_clean.csv"

print("CWD:", Path.cwd())
print("Resolved ROOT:", ROOT)
print("Reading:", RATINGS_RAW_PATH)
print("Writing:", RATINGS_CLEAN_PATH)

assert RATINGS_RAW_PATH.exists(), f"Missing file: {RATINGS_RAW_PATH}"


CWD: /voc/work/src/notebooks
Resolved ROOT: /voc/work
Reading: /voc/work/src/data/raw/CMS_HospitalRatings.csv
Writing: /voc/work/src/data/clean/ratings_clean.csv


In [27]:
ratings = pd.read_csv(RATINGS_RAW_PATH, low_memory=False)
ratings.columns = ratings.columns.str.strip()

ratings.shape
ratings.columns.tolist()[:40]

['Facility ID',
 'Facility Name',
 'Address',
 'City/Town',
 'State',
 'ZIP Code',
 'County/Parish',
 'Telephone Number',
 'Hospital Type',
 'Hospital Ownership',
 'Emergency Services',
 'Meets criteria for birthing friendly designation',
 'Hospital overall rating',
 'Hospital overall rating footnote',
 'MORT Group Measure Count',
 'Count of Facility MORT Measures',
 'Count of MORT Measures Better',
 'Count of MORT Measures No Different',
 'Count of MORT Measures Worse',
 'MORT Group Footnote',
 'Safety Group Measure Count',
 'Count of Facility Safety Measures',
 'Count of Safety Measures Better',
 'Count of Safety Measures No Different',
 'Count of Safety Measures Worse',
 'Safety Group Footnote',
 'READM Group Measure Count',
 'Count of Facility READM Measures',
 'Count of READM Measures Better',
 'Count of READM Measures No Different',
 'Count of READM Measures Worse',
 'READM Group Footnote',
 'Pt Exp Group Measure Count',
 'Count of Facility Pt Exp Measures',
 'Pt Exp Group Footno

## 2. Clean Key Fields and Overall Star Rating
This section standardizes the facility identifier (Facility ID / CCN) and converts the overall
star rating to a numeric field. Values such as “Not Available” are treated as missing to preserve
CMS reporting meaning and avoid introducing invalid numeric values.

In [28]:
FAC_ID = "Facility ID"
RATING = "Hospital overall rating"

ratings[FAC_ID] = ratings[FAC_ID].astype(str).str.strip().str.zfill(6)

ratings[RATING] = ratings[RATING].replace({"Not Available": np.nan, "": np.nan})
ratings[RATING] = pd.to_numeric(ratings[RATING], errors="coerce")

ratings[RATING].value_counts(dropna=False).sort_index()

Hospital overall rating
1.0     229
2.0     649
3.0     937
4.0     765
5.0     289
NaN    2552
Name: count, dtype: int64

## 3. Missingness and Rating Coverage
This section quantifies how many hospitals have a valid overall star rating and summarizes
missingness patterns. Understanding rating coverage is important because hospitals without a rating
may differ systematically (e.g., reporting exclusions), which can affect interpretation and sample size.

In [29]:
print("Rows:", len(ratings))
print("Unique facilities:", ratings[FAC_ID].nunique())
print("Missing overall rating %:", ratings[RATING].isna().mean())

Rows: 5421
Unique facilities: 5421
Missing overall rating %: 0.4707618520568161


## 4. Footnote Fields and Eligibility Indicators
This section reviews CMS footnote fields that explain missing ratings or measure-group exclusions.
These indicators are retained to support sensitivity checks and to document why hospitals may be
excluded from CMS rating calculations.

In [30]:
footnote_cols = [c for c in ratings.columns if "footnote" in c.lower()]
footnote_cols

['Hospital overall rating footnote',
 'MORT Group Footnote',
 'Safety Group Footnote',
 'READM Group Footnote',
 'Pt Exp Group Footnote',
 'TE Group Footnote']

In [31]:
# Most common values per footnote column (first few)
for c in footnote_cols[:6]:
    print("\n", c)
    print(ratings[c].value_counts(dropna=False).head(10))


 Hospital overall rating footnote
Hospital overall rating footnote
NaN       2857
16        1676
19         795
5           47
22          32
17           7
23           5
16, 23       2
Name: count, dtype: int64

 MORT Group Footnote
MORT Group Footnote
NaN     3643
5.0      950
19.0     795
22.0      32
23.0       1
Name: count, dtype: int64

 Safety Group Footnote
Safety Group Footnote
NaN     3350
5.0     1238
19.0     795
22.0      32
23.0       6
Name: count, dtype: int64

 READM Group Footnote
READM Group Footnote
NaN     4271
19.0     795
5.0      323
22.0      32
Name: count, dtype: int64

 Pt Exp Group Footnote
Pt Exp Group Footnote
NaN     3154
5.0     1440
19.0     795
22.0      32
Name: count, dtype: int64

 TE Group Footnote
TE Group Footnote
NaN     4493
19.0     795
5.0      101
22.0      32
Name: count, dtype: int64


## 5. Final Field Selection for Merging
This section keeps only the variables required for merging and interpretation (Facility ID, overall
rating, hospital characteristics, and selected measure-group summaries). The resulting dataset is
intentionally minimal to reduce noise and keep merges deterministic.

In [32]:
keep_cols = [
    "Facility ID",
    "Facility Name",
    "State",
    "ZIP Code",
    "Hospital Type",
    "Hospital Ownership",
    "Emergency Services",
    "Meets criteria for birthing friendly designation",
    "Hospital overall rating",
    "Hospital overall rating footnote",
]

# Add group count columns if present
group_cols = [c for c in ratings.columns if "Group Measure Count" in c]
keep_cols += group_cols

# Keep only columns that actually exist
keep_cols = [c for c in keep_cols if c in ratings.columns]

ratings_clean = ratings[keep_cols].copy()
ratings_clean = ratings_clean.dropna(subset=[FAC_ID])

ratings_clean.shape


(5421, 15)

## 6. Save Clean Output for Downstream Merging
This section writes the cleaned hospital ratings dataset to disk for use in the merge notebook.
Output: `ratings_clean.csv`

In [33]:
ratings_clean.to_csv(RATINGS_CLEAN_PATH, index=False)
print("Saved:", RATINGS_CLEAN_PATH)

Saved: /voc/work/src/data/clean/ratings_clean.csv
