- Source dataset: HHS Medicaid Provider Spending
| Name | Type | Description |
|---|---|---|
BILLING_PROVIDER_NPI_NUM |
string | National Provider Identifier of the billing provider |
SERVICING_PROVIDER_NPI_NUM |
string | National Provider Identifier of the servicing provider |
HCPCS_CODE |
string | Healthcare Common Procedure Coding System code for the service |
CLAIM_FROM_MONTH |
date | Month for which claims are aggregated (YYYY-MM-01 format) |
TOTAL_UNIQUE_BENEFICIARIES |
integer | Count of unique beneficiaries for this provider/procedure/month |
TOTAL_CLAIMS |
integer | Total number of claims for this provider/procedure/month |
TOTAL_PAID |
float | Total amount paid by Medicaid (in USD) |
data/
unresolved_identifiers.csv
README.md # detailed provenance and usage notes for data folders
raw/
medicaid/ # downloaded source dataset
nppes/
monthly/ # extracted monthly NPPES bundle(s)
weekly/ # extracted weekly incremental NPPES bundle(s)
cpt/
archives/ # optional CPT/HCPCS source zips downloaded via --cpt-zip-url
cpt_hcpcs_fallback.csv # derived local fallback used for unresolved HCPCS/CPT lookups
manually_cleaned/ # version-controlled manual CSV/MD corrections used as overlays
cache/
npi/npi_provider_cache.sqlite
hcpcs/hcpcs_code_cache.sqlite
mappings/
npi/npi_provider_mapping.csv
hcpcs/hcpcs_code_mapping.csv
output/
npi.parquet
hcpcs.parquet
./download.shDefault Medicaid dataset URL:
https://stopendataprod.blob.core.windows.net/datasets/medicaid-provider-spending/2026-02-09/medicaid-provider-spending.parquet
download.sh will:
- create organized subfolders under
data/ - download the Medicaid source file into
data/raw/medicaid/ - download latest NPPES monthly V2 and weekly V2 bundles into
data/raw/nppes/ - extract NPPES zips and keep the raw bundles
- recursively extract any zip archives under
data/raw/cpt/(including nested zips) - normalize CPT extracted folder names when source archives include trailing spaces
- apply manual overlays from
data/manually_cleaned/PFREV26AR_nonQP/{csv,md}into extracted CPT folders - build
data/raw/cpt/cpt_hcpcs_fallback.csvfrom extracted CPT/HCPCS files when parsable - reuse existing downloaded archives/files when present (downloads only missing/empty targets)
- optionally run Rust pipeline if
--run-rustis passed
Useful options:
--skip-nppes(skip monthly/weekly NPPES downloads)--nppes-monthly-url <url>/--nppes-weekly-url <url>(pin exact bundle URLs)--cpt-index-url <url>(override CPT page used for fallback zip auto-discovery)--cpt-zip-url <url>(repeatable; downloads CPT/HCPCS source archives intodata/raw/cpt/archives/)
If data/raw/cpt/ has no local CPT/HCPCS source data and no --cpt-zip-url is provided,
download.sh auto-discovers a default CMS CPT archive URL from the configured --cpt-index-url.
./build_datasets.shDefault outputs:
- NPI mapping CSV:
data/mappings/npi/npi_provider_mapping.csv - NPI lookup cache DB:
data/cache/npi/npi_provider_cache.sqlite - HCPCS mapping CSV:
data/mappings/hcpcs/hcpcs_code_mapping.csv - HCPCS lookup cache DB:
data/cache/hcpcs/hcpcs_code_cache.sqlite - NPI resolved identifier dataset (bulk+API):
data/output/npi.parquet - HCPCS resolved identifier dataset (cache+fallback+API):
data/output/hcpcs.parquet - Unresolved identifiers report:
data/unresolved_identifiers.csv
Behavior:
- each dataset build step is skipped only when outputs exist and cache coverage is complete (no unresolved IDs/codes for the current input)
- use
--rebuild-mapto rebuild using existing cache/API - use
--reset-mapto delete mappings + caches and start fresh - interrupted runs resume from cache state
- NPI mapping preloads from the newest local NPPES monthly + weekly CSV first, then uses API only for unresolved NPIs
- pass
--skip-nppes-bulkto disable local NPPES preload - HCPCS API lookups are batched with OR queries (
q=code:(... OR ...)) to reduce request count - tune HCPCS batch size with
--hcpcs-batch-size(default100) - on HCPCS batch request failure, the pipeline falls back to single-code HCPCS requests for that batch
- local CPT/HCPCS fallback rows are loaded from
data/raw/cpt/cpt_hcpcs_fallback.csv(or--hcpcs-fallback-csv) and seeded into cache before API lookup - if an HCPCS API lookup is
not_found, the resolver checks local CPT fallback rows before writingnot_found - local fallback logic applies to HCPCS/CPT only; NPI resolution still uses NPPES bulk + NPI API
- NPI + HCPCS map-building runs in parallel, with one live progress bar per API
- progress bars include elapsed time, throughput, and ETA during API lookups
- pressing Ctrl-C triggers a graceful stop: current in-flight work finishes, caches/maps are saved, then process exits
- resolved identifier datasets capture full API payloads when requests occurred, otherwise a synthetic payload derived from bulk/fallback sources (plus URL/params/errors) and are written as deduped one-row-per-identifier tables
- resolved identifier dataset columns are ordered for readability: primary fields first, metadata/raw payload fields last
- end-of-run unresolved report includes unresolved NPIs/HCPCS with status (
not_found,error,missing_cache) and last fetch timestamp - override unresolved report path with
--unresolved-report-csv
Optional log file output:
./build_datasets.sh \
--log-file ./logs/build_map.logNote:
- in an interactive terminal,
--log-filepreserves live progress bars while also writing logs
Two additional datasets are produced during map building (and are safe to re-export from the cache DB without re-querying the APIs):
-
data/output/npi.parquet- one row per NPI API lookup (deduped)
- includes nested response structures as JSON-string columns (
addresses,taxonomies, etc.) - includes
urland full payload inresponse_json - column order:
npibasic,addresses,practice_locations,taxonomies,identifiers,other_names,endpointsurl,error_message,api_run_id,requested_at_utc,request_params,results,response_json
-
data/output/hcpcs.parquet- one row per HCPCS API lookup (deduped)
- includes columns derived from
response_jsonplusurl - includes full payload in
response_json - column order:
hcpcs_codeef_short_desc,ef_long_desc,ef_add_dt,ef_act_eff_dt,ef_term_dt,ef_obsolete,ef_is_nocresponse_codes,response_display,response_extra_fieldsurl,error_message,api_run_id,requested_at_utc,request_params,response_json
Note:
- if
--skip-apiis set, these response datasets will not gain new rows (existing cached rows are still exported)
./download.sh --run-rustForward extra args to Rust after --:
./download.sh --run-rust -- \
--reset-map \
--requests-per-second 2 \
--concurrency 2 \
--hcpcs-batch-size 100Upload is opt-in in Rust and only happens when upload flags are provided:
./build_datasets.sh \
--hf-token "hf_..." \
--hf-repo-id "mkieffer/Medicaid-Provider-Spending" \
--hf-upload-mapping \
--hf-upload-hcpcs-mapping \
--hf-upload-npi \
--hf-upload-hcpcsOptional destination overrides for API response datasets:
--hf-npi-path-in-repo--hf-hcpcs-path-in-repo
You can still use the standalone upload helper:
./upload_medicaid_to_hf.sh \
--token "hf_..." \
--repo-id "mkieffer/Medicaid-Provider-Spending" \
--file "./data/raw/medicaid/medicaid-provider-spending.parquet" \
--path-in-repo "data/spending.parquet"Or with the split/config dataset upload mode:
./upload_medicaid_to_hf.sh \
--token "hf_..." \
--repo-id "mkieffer/Medicaid-Provider-Spending" \
--upload-split-datasetTo update just the API response splits + dataset card without re-uploading the large spending parquet:
./upload_medicaid_to_hf.sh \
--token "hf_..." \
--repo-id "mkieffer/Medicaid-Provider-Spending" \
--upload-split-dataset \
--skip-spending-uploadHugging Face config usage:
from datasets import load_dataset
spending = load_dataset("mkieffer/Medicaid-Provider-Spending", "spending")["spending"]
npi = load_dataset("mkieffer/Medicaid-Provider-Spending", "npi")["npi"]
hcpcs = load_dataset("mkieffer/Medicaid-Provider-Spending", "hcpcs")["hcpcs"]Dataset URL:
- NPPES registry homepage notice about hourly query limits:
https://npiregistry.cms.hhs.gov/
- NPPES API documentation:
https://npiregistry.cms.hhs.gov/api-page
- CMS NPPES Data Dissemination overview:
https://www.cms.gov/medicare/regulations-guidance/administrative-simplification/data-dissemination
- CMS NPPES downloadable files index:
http://download.cms.gov/nppes/NPI_Files.html
- HCPCS API documentation (Clinical Tables):
https://clinicaltables.nlm.nih.gov/apidoc/hcpcs/v3/doc.html
- Clinical Tables FAQ (rate limits and usage guidance):
https://clinicaltables.nlm.nih.gov/faq.html
The API does not publish a numeric per-hour cap, so defaults are conservative:
--requests-per-second 2--concurrency 2
HCPCS API batching notes:
- uses
count=500(the documented maximum per request) - default
--hcpcs-batch-sizeis100to keep query URLs manageable while significantly reducing request count
For large bulk workloads, CMS recommends NPPES dissemination files instead of high-volume NPI API querying.