# 01 â€” Data Collection

Download and assemble all raw datasets for the Pennsylvania healthcare access analysis.

**Data sources:**
- CMS Provider of Services (facility locations)
- HRSA Health Center data
- US Census American Community Survey (ACS) via API
- CDC Social Vulnerability Index (SVI)
- Census TIGER/Line shapefiles (tracts & roads)

In [None]:
import sys
sys.path.insert(0, "..")

import pandas as pd

from src.config import DATA_RAW, STATE_FIPS, CENSUS_API_KEY
from src.data_collection import (
    download_cms_data,
    download_hrsa_data,
    fetch_acs_data,
    download_svi_data,
    download_tiger_shapefiles,
)

## 1.1 Healthcare Facility Data (CMS)

In [None]:
cms_df = download_cms_data(DATA_RAW)
cms_df.head(), cms_df.shape

## 1.2 HRSA Health Centers

In [None]:
hrsa_df = download_hrsa_data(DATA_RAW)
hrsa_df.head(), hrsa_df.shape

## 1.3 Census ACS Demographics

In [None]:
if not CENSUS_API_KEY:
    raise ValueError("CENSUS_API_KEY is missing. Add it to .env or src/config.py")

acs_df = fetch_acs_data(state_fips=STATE_FIPS, year=2022, output_dir=DATA_RAW)
acs_df.head(), acs_df.shape

## 1.4 CDC Social Vulnerability Index

In [None]:
svi_df = download_svi_data(state_fips=STATE_FIPS, output_dir=DATA_RAW)
svi_df.head(), svi_df.shape

## 1.5 TIGER/Line Shapefiles (Tracts & Roads)

In [None]:
tiger_data = download_tiger_shapefiles(state_fips=STATE_FIPS, output_dir=DATA_RAW)
tracts_gdf = tiger_data["tracts"]
roads_gdf = tiger_data["roads"]

tracts_gdf.head(2), roads_gdf.head(2), (tracts_gdf.shape, roads_gdf.shape)

## 1.6 Initial Data Quality Report

In [None]:
quality_report = pd.DataFrame(
    [
        {"dataset": "CMS", "rows": len(cms_df), "cols": cms_df.shape[1]},
        {"dataset": "HRSA", "rows": len(hrsa_df), "cols": hrsa_df.shape[1]},
        {"dataset": "ACS", "rows": len(acs_df), "cols": acs_df.shape[1]},
        {"dataset": "SVI", "rows": len(svi_df), "cols": svi_df.shape[1]},
        {"dataset": "Tracts", "rows": len(tracts_gdf), "cols": tracts_gdf.shape[1]},
        {"dataset": "Roads", "rows": len(roads_gdf), "cols": roads_gdf.shape[1]},
    ]
)
quality_report