# CLEAR

Chargemaster Location-based Exploration for Affordability & Reform

This is the notebook file primarily responsible for pre-processing data, attaching important information, and generating database files for the github page. Below you can find all information about how data is processed from the downloaded `.csv` files found on most hospital sites. This is an exploratory project focused on creating interactive visualzations and tools to better inform people about their healthcare. The repo can always be maintained by downloading the most current year data for the specific hospital and putting it through the scripts. It should be noted that this is NOT a comprehensive list, but it can potentially be scaled to a full working-standalone site with enough time. 

All pre-processing code is written in python. See the `.html` files for how the D3 visualizations work. 

## How it works (Copied from README)

Hospitals that have been added to this 'web-app' are stored in a `.csv` file for quick look up and ease of access. This points to the loc of it's Charge Master `.json` file which is then queried for the specific procedure. Hospitals are gathered from the CSV list based on a radius look-up provided by the user. If a hospital in the radius does not offer the service, it will not display the price point compared to others in the radius. 

Currently limited to 500 procedures due to file size limits and me not wanting to set up a server/database for this. Parquet only works server side so i can't do iterative testing before publishing to pages, and pages deployments can take a while. Will consider moving to parquet system after front-end is stable and working as envisioned.

## List of Hospitals

These are the hospital's which data has been gathered and processed for thus far:

| State    | Hospital Name                     | Zipcode     | Date                 | File Size    | Link                                                            |
|----------|--------------------------------|-------------|-------------------|-------------|------------------------------------------------|
| NC        | Duke University Hospital     |     27710    |      09/2025      |   3.32 GB   |    [Link](https://www.dukehealth.org/paying-for-care/what-duke-charges-services) |
| NC        | AdventHealth (Hendersonville)   |     28792   |   09/2025    |      1.48 GB           |                                                                 |
| NC | UNC Rex Hospital | 27606 | 09/2025 | 121 MB | [Link](https://www.unchealth.org/records-insurance/standard-charges) |
| NC | WakeMed North Hospital | 27614 | 09/2025 | 56.1 MB | [Link](https://www.wakemed.org/sites/default/files/PricingTransparency/566017737_wakemed-raleigh-campus-and-north-hospital_standardcharges.csv) |
| SC        | MUSC Health-University Medical Center (Charleston) |   29425   | 09/2025 | 11.8 MB |  [Link](https://muschealth.org/patients-visitors/billing/price-transparency) |
| VA | Inova Fairfax Hospital (Falls Church) | 22042 | 09/2025 | 11.3 MB | [Link](https://www.inova.org/patient-and-visitor-information/hospital-charges) |

#### Top Hospitals in Every State 

A list of hospitals that should be added at a later date.

- Alaska: Providence Alaska Medical Center (Anchorage) and Fairbanks Memorial Hospital
- Alabama: University of Alabama at Birmingham Hospital        
- Arizona: Mayo Clinic-Phoenix        
- Arkansas: Washington Regional Medical Center (Fayetteville)        
- California: Cedars-Sinai Medical Center (Los Angeles), UCLA Medical Center (Los Angeles), Stanford Health Care-Stanford Hospital (Palo Alto), UC San Diego Health-LaJolla and Hillcrest Hospitals, and UCSF Health-UCSF Medical Center (San Francisco)
- Colorado: UCHealth University of Colorado Hospital (Aurora)        
- Connecticut: Yale New Haven Hospital        
- Delaware: ChristianaCare Hospitals (Newark)        
- Florida: Mayo Clinic-Jacksonville        
- Georgia: Emory University Hospital (Atlanta)        
- Hawaii: Queen’s Medical Center (Honolulu)        
- Idaho: St. Luke’s Regional Medical Center (Boise)        
- Illinois: Northwestern Medicine-Northwestern Memorial Hospital (Chicago) and Rush University Hospital (Chicago)         
- Indiana: Indiana University Health Medical Center (Indianapolis)        
- Iowa: University of Iowa Hospitals and Clinics (Iowa City)        
- Kansas: University of Kansas Hospital (Kansas City)        
- Kentucky: University of Kentucky Albert B. Chandler Hospital (Lexington) 
- Louisiana: Ochsner Medical Center (New Orleans)        
- Maine: Maine Medical Center (Portland)        
- Maryland: Johns Hopkins Hospital (Baltimore)        
- Massachusetts: Massachusetts General Hospital (Boston) and Brigham and Women’s Hospital (Boston)       
- Michigan: University of Michigan Health-Ann Arbor        
- Minnesota: Mayo Clinic (Rochester)        
- Mississippi: Mississippi Baptist Medical Center (Jackson)        
- Missouri: Barnes-Jewish Hospital (St. Louis)        
- Montana: Billings Clinic        
- Nebraska: Nebraska Medicine-Nebraska Medical Center (Omaha)        
- Nevada: Renown Regional Medical Center (Reno)        
- New Hampshire: Dartmouth Hitchcock Medical Center (Lebanon)        
- New Jersey: Hackensack University Medical Center at Hackensack University Health      
- New Mexico: Presbyterian Hospital (Albuquerque)        
- New York: NYU Langone Hospitals (New York City), New York-Presbyterian Hospital-Columbia and Cornell (New York City), Mount Sinai Hospital (New York City), and North - Shore University Hospital at Northwell Health (Manhasset)        
- ~~North Carolina: Duke University Hospital (Durham)~~       
- North Dakota: Sanford Medical Center Fargo        
- Ohio: Cleveland Clinic        
- Oklahoma: St. Francis Hospital-Tulsa        
- Oregon: OHSU Hospital (Portland)        
- Pennsylvania: Hospitals of the University of Pennsylvania-Penn Presbyterian (Philadelphia)        
- Rhode Island: Miriam Hospital (Providence)        
- ~~South Carolina: MUSC Health-University Medical Center (Charleston)~~        
- South Dakota: Sanford USD Medical Center (Sioux Falls)        
- Tennessee: Vanderbilt University Medical Center (Nashville)        
- Texas: Houston Methodist Hospital and UT Southwestern Medical Center (Dallas)
- Utah: University of Utah Hospital (Salt Lake City)        
- Vermont: University of Vermont Medical Center (Burlington)        
- ~~Virginia: Inova Fairfax Hospital (Falls Church)~~        
- Washington: UW Medicine-University of Washington Medical Center (Seattle)        
- West Virginia: West Virginia University Hospitals (Morgantown)        
- Wisconsin: UW Health University Hospital (Madison)

As by Becker https://www.beckershospitalreview.com/rankings-and-ratings/us-news-top-hospitals-by-state-for-2023-24/

## Outside Sources Used

- zip_centroids.csv courtesy of SimpleMaps data https://simplemaps.com/data/us-zips.
- CMS.gov data 
    - for top 200 HCPCS and CPT codes billed for 2024 & top 100 lab codes. [Link](https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-fee-for-service-parts-a-b/medicare-utilization-part-b)
    - PFALL25 
    - PFS (Physician scheduling fee) for mapping HCPCS/CPT codes to medicare rates [Link](https://www.cms.gov/medicare/payment/fee-schedules/physician/national-payment-amount-file)
    - ASC Rates for mapping HCPCS/CPT codes to ambulatory rates [Link](https://www.cms.gov/medicare/payment/prospective-payment-systems/ambulatory-surgical-center-asc/asc-payment-rates-addenda)




***

## MAJOR CHANGES

- Moved all HCPCS/CPT top 200 & lab codes into a single file
- removed depreciated codes from 2024 to 2025 since we're working with 2025 CMs
- pulled CPT code descriptions from CMS RVU data for early 2025 (jan)
- validated 2024 codes against 2025 to ensure all are being caught properly
- update bundler to reflect this

data processing for hospitals will need to be updated to reflect these changes

## Data Processing

CSV files are too large to store on github, thus they are downloaded locally, converted to the necessary format, then uploaded. If you want to perform conversions yourself you will need to find the specific hospital chargemaster and document in the notebook accordingly.

Not all Charge Masters (CM) are formatted the same, as such, to keep this notebook from growing too large, custom python scripts will be made for unique CM's. This matters beccause some hospitals are regional or statewide 'chains' but can vary prices between locations. For example, 

**AdventHealth**
- AdventHealth Orlando
- AdventHealth Tampa
- AdventHealth Hendersonville

all are AdventHealth hospitals, but their prices and available procedures vary per location. However, the same script to clean and process their CM's works because the file structure doesn't change from loc to loc. Normally CM structure only changes from hospital to hospital (brand-wise), but I haven't looked at the majority of US hospitals so this statement might need to be amended. 

Think of this file as more of a "**Controller**" for the cleaning, while the cleaning process is performed by imported functions. Subsections from here on are labeled by State, be sure to check which Hospitals are in each subsection before uploading data. 



***
## Payer & Plan Names

Naming conventions for payer/plans differ across hospitals, making this a pain. Like is an exhaustive regex section to hopefully simplify this so that the functionality of the .html page remains. 

Idk where this fits in, I'll add it later to documentation.

use this to create a comprehensive list of all current payer names.

***
## Data Preloading Tasks

In [None]:
# hospitals.csv updater/editor
import hashlib
import requests
import json
from geopy.geocoders import Nominatim
import pandas as pd
import time
import os
import nbformat
from scripts.cleaners import apply_payer_standardization_to_json, standardize_payer_name
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

geolocator = Nominatim(user_agent="CLEAR-geoapi-2025")
csv_file = 'docs/data/hospitals.csv'
df = pd.read_csv(csv_file)

# construct address for geocoding only (don't modify original data)
def construct_geocoding_address(row):
    # Build clean address from original components
    address = f"{row['address']}, {row['city']}, {row['state']} {row['zip']}"
    return address

# get lat/lon from address with increased timeout and retry/delay
def get_lat_lon(address, max_retries=3, delay=2):
    for attempt in range(max_retries):
        try:
            location = geolocator.geocode(address, timeout=5)
            if location:
                return location.latitude, location.longitude
            else:
                return None, None
        except Exception as e:
            print(f"Error geocoding {address} (attempt {attempt+1}): {e}")
            time.sleep(delay)
    return None, None

# generate short unique ID based on ['hospital'] + full composite address (base36, 8 chars)
def generate_short_id(row):
    full_address = construct_geocoding_address(row)
    unique_string = f"{row['name']}_{full_address}"
    hash_int = int(hashlib.md5(unique_string.encode()).hexdigest(), 16)
    short_id = base36encode(hash_int)[:8]
    return short_id

# base36 encoding for shorter IDs
def base36encode(number):
    chars = '0123456789abcdefghijklmnopqrstuvwxyz'
    if number == 0:
        return '0'
    result = ''
    while number > 0:
        number, i = divmod(number, 36)
        result = chars[i] + result
    return result

# Add lat/lon and short_id to dataframe, set json_path to be '/data/prices/['state']/['id'].json'
def update_dataframe(df):
    
    # Don't modify the address column - just use it for geocoding
    def lat_lon_with_delay(row):
        geocoding_address = construct_geocoding_address(row)
        lat, lon = get_lat_lon(geocoding_address)
        time.sleep(1)  # 1 second delay per request
        return pd.Series([lat, lon])
    
    df[['lat', 'lon']] = df.apply(lat_lon_with_delay, axis=1)
    df['id'] = df.apply(generate_short_id, axis=1)
    df['json_path'] = df.apply(lambda row: f"docs/data/prices/{row['state']}/{row['id']}.json", axis=1)
    df.to_csv(csv_file, index=False)
    
    return

update_dataframe(df)

In [None]:
# now we need to create comparison df's for the top 200 HCPCS and CMS codes billed for 2024 & top 100 lab codes
# first load the codes from the .csv files
hcpcs_codes = pd.read_csv('docs/data/hcpcs_lvl2_top_200_codes_2024.csv')
lab_codes = pd.read_csv('docs/data/lab_top_100_codes_2024.csv')
cpt_codes = pd.read_csv('docs/data/cpt_lvl1_top_200_codes_2024.csv')

# new master file for all codes
all_codes = pd.read_csv('docs/data/top_codes_master_dictionary_v4.csv')

In [None]:
# RUN TO LOAD HOSPITALS CSV
import os

# csv's are stored locally outside of CLEAR repo
# set up one folder then into 'ChargeMaster_Project/csv_files/'
# get path to csv_files folder outside CLEAR repo
workspace_root = os.path.dirname(os.path.abspath('CLEAR.ipynb'))
csv_folder = os.path.join(workspace_root, '..', 'ChargeMaster_Project', 'csv_files')
csv_folder = os.path.abspath(csv_folder)

# define path to hospitals.csv
hospitals_csv = os.path.join(workspace_root, 'docs', 'data', 'hospitals.csv')
hospitals_csv = os.path.abspath(hospitals_csv)

# read hospitals.csv to get list of hospitals and their file paths
hospitals_df = pd.read_csv(hospitals_csv)

In [None]:

# finish implementation later and move
def update_hospital_table():
    # Update the first markdown cell's hospital table with new hospitals from hospitals.csv
    notebook_path = 'CLEAR.ipynb'
    with open(notebook_path, 'r', encoding='utf-8') as f:
        nb = nbformat.read(f, as_version=4)

    # Find the first markdown cell with the hospital table
    table_md_header = "| State    | Hospital Name"
    for cell in nb.cells:
        if cell.cell_type == 'markdown' and table_md_header in cell.source:
            lines = cell.source.splitlines()
            # Find start and end of the table
            table_start = next(i for i, l in enumerate(lines) if l.strip().startswith(table_md_header))
            table_rows = lines[table_start+2:]  # skip header and separator
            existing_names = set()
            for row in table_rows:
                parts = row.split('|')
                if len(parts) > 2:
                    existing_names.add(parts[2].strip())
            # Add new hospitals not already in the table
            new_rows = []
            for _, row in hospitals_df.iterrows():
                if row['name'] not in existing_names:
                    new_rows.append(f"| {row['state']} | {row['name']} | {row['zip']} |  |  |")
            # Insert new rows after the last table row
            updated_lines = lines[:table_start+2] + table_rows + new_rows
            cell.source = "\n".join(updated_lines)
            break

    # Save the updated notebook
    with open(notebook_path, 'w', encoding='utf-8') as f:
        nbformat.write(nb, f)


***
## North Carolina Hospitals

In [None]:

# ======================================================================
# --------------- DUKE HOSPITAL TESTING ----------------
# ======================================================================

# Grab row for Duke Hospital in Durham, NC
hos_name = 'Duke University Hospital'
matching_hospitals = hospitals_df[hospitals_df['name'] == hos_name]
if not matching_hospitals.empty:
    duke_row = matching_hospitals.iloc[0]
else:
    print(f"Hospital '{hos_name}' not found in the dataset")
    duke_row = None

# grab json path for Duke Hospital
duke_json_path = duke_row['json_path']

# load a single csv file from csv_folder for testing
test_csv_path = os.path.join(csv_folder, 'DukeHospital_Durham.csv')
duke_df = pd.read_csv(test_csv_path)

# Change all mixed type columns to string to avoid dtype issues
for col in duke_df.columns:
    if duke_df[col].dtype == 'object':
        duke_df[col] = duke_df[col].astype(str)


# remove duke_df Hospital, City, State, Address columns before converting to parquet
duke_df = duke_df.drop(columns=['Hospital', 'City', 'State', 'Address'])


#code_cols = ['code_1', 'code_2', 'code_3', 'code_4']
# Check matches for each code column against hcpcs_codes, cpt_codes, and lab_codes, iteratively
# for col in code_cols:
#     print(f"Checking matches for column: {col}")
#     hcpcs_matches = duke_df[duke_df[col].isin(hcpcs_codes['HCPCS Code'])]
#     cpt_matches = duke_df[duke_df[col].isin(cpt_codes['HCPCS Code'])]
#     lab_matches = duke_df[duke_df[col].isin(lab_codes['HCPCS Code'])]
#     print(f"  HCPCS matches: {len(hcpcs_matches)}")
#     print(f"  CPT matches: {len(cpt_matches)}")
#     print(f"  Lab matches: {len(lab_matches)}")

"""

    This actually shows that code_2 contains HCPCS codes and code_3 contains CPT codes
    Checking matches for column: code_1
        HCPCS matches: 0
        CPT matches: 0
        Lab matches: 0
    Checking matches for column: code_2
        HCPCS matches: 76966
        CPT matches: 0
        Lab matches: 19987
    Checking matches for column: code_3
        HCPCS matches: 0
        CPT matches: 772
        Lab matches: 0
    Checking matches for column: code_4
        HCPCS matches: 0
        CPT matches: 0
        Lab matches: 0

"""

# Duke Hospital CM Structure
# code_2/code_3 [columns 3, 5 --> 4, 6 contain type] contain HCPCS and CPT codes, so we use those for comparison against the top 200 lists
# Columns 13-24 contain payer, plan, and pricing info, so we want all of those as well as column 0 which is the 
# description of the code [used for regex matching on the front end]
# final columns to keep: 0, 3-6, 13-24
duke_df = duke_df.iloc[:, [0] + list(range(3, 7)) + list(range(13, 25))]

# actually lets go ahead and drop some columns to conserve space
duke_df = duke_df.drop(columns=['standard_charge_algorithm', 'additional_generic_notes'])

# now we can search duke_df['code_2'] and duke_df['code_2_type'] against hcpcs_codes , cpt_codes, and lab_codes
# first search hcpcs_codes
hcpcs_matches = duke_df[duke_df['code_2'].isin(hcpcs_codes['HCPCS Code'])]
cpt_matches = duke_df[duke_df['code_3'].isin(cpt_codes['HCPCS Code'])]
lab_matches = duke_df[duke_df['code_2'].isin(lab_codes['HCPCS Code'])]

# Combine all matches into one dataframe, drop duplicates
match_dfs = [df for df in [hcpcs_matches, cpt_matches, lab_matches] if not df.empty]

# Apply standardization to all_matches if not empty
if match_dfs:
    all_matches = pd.concat(match_dfs, ignore_index=True).drop_duplicates()
    all_matches['payer_name'] = all_matches['payer_name'].apply(standardize_payer_name)

    # redrop possible duplicates
    all_matches = all_matches.drop_duplicates()
else:
    # Create empty DataFrame with same structure as duke_df if no matches
    all_matches = pd.DataFrame(columns=duke_df.columns)

# There are some duplicate issues, mainly rows where no est. price are given, so lets remove enteries that don't have est. prices
all_matches = all_matches[all_matches['estimated_amount'].notna() & (all_matches['estimated_amount'] != '')]

# Now we need to combine the columns, code_2 is generally more important so we save that over 3 if they're both present
# Combine code_2/code_2_type and code_3/code_3_type into 'code' and 'type'
def select_code(row):
    if pd.notna(row['code_2']) and row['code_2'] != '':
        return pd.Series({'code': row['code_2'], 'type': row['code_2_type']})
    elif pd.notna(row['code_3']) and row['code_3'] != '':
        return pd.Series({'code': row['code_3'], 'type': row['code_3_type']})
    else:
        return pd.Series({'code': None, 'type': None})

all_matches[['code', 'type']] = all_matches.apply(select_code, axis=1)
all_matches = all_matches.drop(columns=['code_2', 'code_2_type', 'code_3', 'code_3_type'])
all_matches = all_matches.drop_duplicates()

# Save output data to json file for Duke json path
all_matches.to_json(duke_json_path, orient='records', lines=True)

# drop file/df from memory to save space
del duke_df
del duke_row
del test_csv_path
del all_matches

print("Duke Hospital test processing complete.")
# ======================================================================


  duke_df = pd.read_csv(test_csv_path)


Duke Hospital test processing complete.


In [None]:
# ======================================================================
# --------------- ADVENTHEALTH HOSPITAL  ----------------
# ======================================================================

# Load AdventHealth Hendersonville, NC paths
hos_name = 'AdventHealth'
city_name = 'Hendersonville'
state_name = 'NC'
matching_hospitals = hospitals_df[
    (hospitals_df['name'] == hos_name) &
    (hospitals_df['state'] == state_name) &
    (hospitals_df['city'] == city_name)
]
if not matching_hospitals.empty:
    adv_nc = matching_hospitals.iloc[0]
else:
    print(f"Hospital '{hos_name}' not found in the dataset")
    adv_nc = None

# grab json path for AdventHealth Hendersonville, NC
adv_nc_json_path = adv_nc['json_path']

# load a single csv file from csv_folder for testing
adv_nc_csv_path = os.path.join(csv_folder, 'AdventHealth_Hendersonville_CM.csv')

# load AdventHealth Hendersonville, NC csv
adv_nc_df = pd.read_csv(adv_nc_csv_path)

# AdventHealth CM Structure
# ['description', 'drug_information', 'code', 'type',
#    'standard_charge_min', 'standard_charge_max', 'gross_charge',
#    'discounted_cash', 'setting', 'payer_name', 'plan_name',
#    'standard_charge_dollar', 'standard_charge_percentage',
#    'estimated_amount', 'methodology', 'standard_charge_algorithm',
#    'Hospital', 'City', 'State', 'Address']

# Check matches for code column against hcpcs_codes, cpt_codes, and lab_codes
# Output: 
# HCPCS matches: 20845
#   CPT matches: 2884
#   Lab matches: 13555

# Lets drop unneeded columns, and rename some before grabing the matches and saving to json
# NOTE: common naming convention needs to be added before renaming cols
cols_to_drop = ['methodology', 'drug_information', 'standard_charge_algorithm', 'Hospital', 'City', 'State', 'Address']
adv_nc_df = adv_nc_df.drop(columns=cols_to_drop)

# now grab matches
hcpcs_matches = adv_nc_df[adv_nc_df['code'].isin(hcpcs_codes['HCPCS Code'])]
cpt_matches = adv_nc_df[adv_nc_df['code'].isin(cpt_codes['HCPCS Code'])]
lab_matches = adv_nc_df[adv_nc_df['code'].isin(lab_codes['HCPCS Code'])]

# Combine all matches into one dataframe, drop duplicates
match_dfs = [df for df in [hcpcs_matches, cpt_matches, lab_matches] if not df.empty]

if match_dfs:
    all_matches = pd.concat(match_dfs, ignore_index=True).drop_duplicates()
    all_matches['payer_name'] = all_matches['payer_name'].apply(standardize_payer_name)

    # redrop possible duplicates
    all_matches = all_matches.drop_duplicates()

else:
    # Create empty DataFrame with same structure as duke_df if no matches
    all_matches = pd.DataFrame(columns=duke_df.columns)

# There are some duplicate issues, mainly rows where no est. price are given, so lets remove enteries that don't have est. prices
all_matches = all_matches[all_matches['estimated_amount'].notna() & (all_matches['estimated_amount'] != '')]

# Save output data to json file for AdventHealth Hendersonville, NC json path
all_matches.to_json(adv_nc_json_path, orient='records', lines=True)

# drop file/df from memory to save space
del adv_nc_df
del adv_nc
del adv_nc_csv_path
del all_matches

print("AdventHealth Hendersonville, NC test processing complete.")
# ======================================================================

  adv_nc_df = pd.read_csv(adv_nc_csv_path)


AdventHealth Hendersonville, NC test processing complete.


In [None]:
# ======================================================================
# ---------------  UNC REX HOSPITAL  ----------------
# ======================================================================
# Lets start adding info to the CSV in python rather than manual edits each time
# Load UNC Rex Hospital in Raleigh, NC paths
from scripts.cleaners import transform_wide_to_long_format
from scripts.bundle_validation import ValidateJSON

hos_name = 'UNC Rex Hospital'

# add hospital info for UNC Rex to hospitals.csv
address = '4420 Lake Boone Trail'
city_name = 'Raleigh'
state_name = 'NC'
zip_code = '27607'

# update hospitals_df with new entry if it doesn't already exist
# move this to be a function later
if hospitals_df[
    (hospitals_df['name'] == hos_name) &
    (hospitals_df['state'] == state_name) &
    (hospitals_df['city'] == city_name)
].empty:
    new_entry = {
        'name': hos_name,
        'address': address,
        'city': city_name,
        'state': state_name,
        'zip': zip_code
    }
    hospitals_df = pd.concat([hospitals_df, pd.DataFrame([new_entry])], ignore_index=True)
    hospitals_df.to_csv(hospitals_csv, index=False)  # Save updated CSV
    print(f"Added new hospital entry for '{hos_name}' to hospitals.csv")

    # now run the update_dataframe function to add lat/lon, id, and json_path
    update_dataframe(hospitals_df)

# now lets grab all the paths
unc_rex_json_path = hospitals_df[
    (hospitals_df['name'] == hos_name) & 
    (hospitals_df['state'] == state_name) & 
    (hospitals_df['city'] == city_name)
].iloc[0]['json_path']
unc_rex_csv_path = os.path.join(csv_folder, 'UNCREX_CM.csv')

# load UNC Rex Hospital csv
unc_rex_df = pd.read_csv(unc_rex_csv_path)

# UNC Rex CM Structure: COLUMNS -->
# description, code|1, code|1|type, code|2, code|2|type, code|3, code|3|type, billing_class, setting,
# drug_unit_of_measurement, drug_type_of_measurement, modifiers, standard_charge|gross,
# standard_charge|discounted_cash, standard_charge|min, standard_charge|max, additional_generic_notes,
# standard_charge|AETNA|CHOICE POS|negotiated_dollar, standard_charge|AETNA|CHOICE
# POS|negotiated_percentage, standard_charge|AETNA|CHOICE POS|negotiated_algorithm,
# standard_charge|AETNA|CHOICE POS|methodology, estimated_amount|AETNA|CHOICE POS,
# additional_payer_notes|AETNA|CHOICE POS, ... etc for other payers/plans

# here you can see we have a new type of CM structure where rather than tons of row enteries for each code/payer/plan
# we have one row per code with multiple columns for each payer/plan combination, so we need to alter our approach to
# processing the data

# Apply the transformation
print("Transforming UNC Rex data from wide to long format...")
unc_rex_transformed = transform_wide_to_long_format(unc_rex_df)

print(f"Original shape: {unc_rex_df.shape}")
print(f"Transformed shape: {unc_rex_transformed.shape}")

# Great now we can start searching the codes against our top 200 lists
# now grab matches

"""
Column code_1: HCPCS matches: 0, CPT matches: 128, Lab matches: 0
Column code_2: HCPCS matches: 2746, CPT matches: 0, Lab matches: 3719
Column code_3: HCPCS matches: 0, CPT matches: 0, Lab matches: 0
"""
unc_rex_transformed.drop(['code_3', 'code_3_type'], axis=1, inplace=True)  # drop unused code_3 columns

# lets rename the code being grabbed to be code, and type accordingly during the match process
# HCPCS matches: filter on code_2, set code='code_2', type='code_2_type'
hcpcs_matches = unc_rex_transformed[unc_rex_transformed['code_2'].isin(hcpcs_codes['HCPCS Code'])].copy()
hcpcs_matches['code'] = hcpcs_matches['code_2']
hcpcs_matches['type'] = hcpcs_matches['code_2_type']
hcpcs_matches = hcpcs_matches.drop(columns=['code_1', 'code_1_type', 'code_2', 'code_2_type'])

# CPT matches: filter on code_1, set code='code_1', type='code_1_type'
cpt_matches = unc_rex_transformed[unc_rex_transformed['code_1'].isin(cpt_codes['HCPCS Code'])].copy()
cpt_matches['code'] = cpt_matches['code_1']
cpt_matches['type'] = cpt_matches['code_1_type']
cpt_matches = cpt_matches.drop(columns=['code_1', 'code_1_type', 'code_2', 'code_2_type'])

# Lab matches: filter on code_2, set code='code_2', type='code_2_type'
lab_matches = unc_rex_transformed[unc_rex_transformed['code_2'].isin(lab_codes['HCPCS Code'])].copy()
lab_matches['code'] = lab_matches['code_2']
lab_matches['type'] = lab_matches['code_2_type']
lab_matches = lab_matches.drop(columns=['code_1', 'code_1_type', 'code_2', 'code_2_type'])

# Combine all matches into one dataframe, drop duplicates
match_dfs = [df for df in [hcpcs_matches, cpt_matches, lab_matches] if not df.empty]

if match_dfs:
    all_matches = pd.concat(match_dfs, ignore_index=True).drop_duplicates()
    all_matches['payer_name'] = all_matches['payer_name'].apply(standardize_payer_name)

    # redrop possible duplicates
    all_matches = all_matches.drop_duplicates()

# now lets drop some more columns to save space
cols_to_drop = ['billing_class', 'drug_unit_of_measurement', 'drug_type_of_measurement', 
                'modifiers', 'standard_charge_algorithm', 'additional_generic_notes', 'methodology']
all_matches = all_matches.drop(columns=cols_to_drop, errors='ignore')


# Now we can save to json
all_matches.to_json(unc_rex_json_path, orient='records', lines=True)

# Lets validate the json file 
validator = ValidateJSON(unc_rex_json_path)

# drop file/df from memory to save space
del unc_rex_df
del unc_rex_transformed
del all_matches
del validator
del unc_rex_csv_path
del unc_rex_json_path

print("UNC Rex Hospital test processing complete.")

Transforming UNC Rex data from wide to long format...
Original shape: (160860, 215)
Transformed shape: (1057313, 24)
Original shape: (160860, 215)
Transformed shape: (1057313, 24)
Matched 8312 / 8503 rows.
service_config.json NOT updated (no misses requiring changes).
Skipped 191 non-plausible codes (unique 2). Examples: 731.0, 790.0
UNC Rex Hospital test processing complete.
Matched 8312 / 8503 rows.
service_config.json NOT updated (no misses requiring changes).
Skipped 191 non-plausible codes (unique 2). Examples: 731.0, 790.0
UNC Rex Hospital test processing complete.


In [None]:
# ======================================================================
# ---------------  WAKE MED NORTH HOSPITAL  ----------------
# ======================================================================

# New added helper functions should reduce code bloat and need to rerun the initial cells
import os
import pandas as pd
from scripts.cleaners import transform_wide_to_long_format
from scripts.helpers import add_hospital_entry
from scripts.code_matcher import get_matches
from scripts.cleaners import standardize_payer_name

workspace_root = os.path.dirname(os.path.abspath('CLEAR.ipynb'))
csv_folder = os.path.join(workspace_root, '..', 'ChargeMaster_Project', 'csv_files')
csv_folder = os.path.abspath(csv_folder)

new_hos_dict = {
    'hospital_name': 'WakeMed North Hospital',
    'city_name': 'Raleigh',
    'state_name': 'NC',
    'address': '10000 Falls of Neuse Rd',
    'zip_code': '27614',
}
# add hospital info for WakeMed North Hospital to hospitals.csv if it doesn't already exist
hospitals_df = add_hospital_entry(new_hos_dict)

# now lets grab all the paths
wake_med_json_path = hospitals_df[hospitals_df['name'] == new_hos_dict['hospital_name']]['json_path'].values[0]
wake_med_csv_path = os.path.join(csv_folder, 'WAKEMED_NORTH_CM.csv')

# load WakeMed North Hospital csv
wake_med_df = pd.read_csv(wake_med_csv_path)

# WakeMed North CM Structure is in wide format, need to transform to long format first
wake_med_long_df = transform_wide_to_long_format(wake_med_df, verbose=True)

# Now grab matches using the new get_matches function
# NOTE: WakeMed North has next to no estimated_amount data
all_matches = get_matches(wake_med_long_df, verbose=True)

# Now apply the payer standardization
all_matches['payer_name'] = all_matches['payer_name'].apply(standardize_payer_name)

# Now we can save to json
all_matches.to_json(wake_med_json_path, orient='records', lines=True)

# drop file/df from memory to save space
del wake_med_df
del wake_med_long_df
del all_matches
del wake_med_csv_path
del wake_med_json_path
del new_hos_dict
del hospitals_df

print("WakeMed North Hospital test processing complete.")

  wake_med_df = pd.read_csv(wake_med_csv_path)


Original DataFrame shape: 104112 rows × 211 columns
Available base columns: 19 out of 20 possible
Found 160 payer-specific columns
Found 32 unique payer/plan combinations
Columns being dropped: 181
Expected columns in result: 30
Final DataFrame shape: 681680 rows × 26 columns
Row multiplication factor: 6.55x (from 104112 to 681680)
Actual columns dropped: 185
Data efficiency: 0.81 (final data points / original data points)
Original total data points: 21,967,632
Final total data points: 17,723,680
Loaded the following code sets:
HCPCS Codes: 200 entries
Lab Codes: 100 entries
CPT Codes: 200 entries
All Codes: 469 entries
Identified code columns: ['code_1', 'code_2', 'code_3', 'code_4']
Column 'code_1': Found 1773 HCPCS matches, 940 Lab matches, 770 CPT matches.
Reassigning code_1 to 'code' and 'code_1_type' to 'type' for non-empty matches DataFrames.
Column 'code_1': Combined matches shape: (3483, 28)
Column 'code_1': Unique matches after dropping duplicates: 3154
Column 'code_2': Found

In [None]:
# test for matches
codes = ['code_1', 'code_2', 'code_3', 'code_4']

for col in codes:
    print(f"Checking matches for column: {col}")
    hcpcs_matches = wake_med_long_df[wake_med_long_df[col].isin(hcpcs_codes['HCPCS Code'])]
    cpt_matches = wake_med_long_df[wake_med_long_df[col].isin(cpt_codes['HCPCS Code'])]
    lab_matches = wake_med_long_df[wake_med_long_df[col].isin(lab_codes['HCPCS Code'])]
    print(f"  HCPCS matches: {len(hcpcs_matches)}")
    print(f"  CPT matches: {len(cpt_matches)}")
    print(f"  Lab matches: {len(lab_matches)}")

Checking matches for column: code_1
  HCPCS matches: 1773
  CPT matches: 770
  Lab matches: 940
Checking matches for column: code_2
  HCPCS matches: 9384
  CPT matches: 0
  Lab matches: 3248
Checking matches for column: code_3
  HCPCS matches: 0
  CPT matches: 32
  Lab matches: 0
Checking matches for column: code_4
  HCPCS matches: 0
  CPT matches: 0
  Lab matches: 0


In [None]:
wake_med_long_df['estimated_amount'].notna().sum()

39044

***
## South Carolina Hospitals

**OH BOY AM I PISSED ALREADY**

#### What’s Happening

* **APC codes (`code|1` with type = APC):**
  These rows have payer names, plan names, and estimated amounts. That’s why they’re the only rows showing price info. APC = Ambulatory Payment Classification, a CMS grouping for outpatient procedures.

* **HCPCS/CPT codes (in `code|2`, `code|3`, etc. with type = CPT/HCPCS):**
  These rows often have no payer, plan, or estimated amount attached. Instead, they are mapped *into* the APC buckets, which then carry the pricing/plan info.

* In other words: the hospital publishes payer-specific negotiated rates only at the APC level, while keeping CPT/HCPCS rows as “mappings” without dollar amounts.

#### How We need to Work Around It

1. **Build a crosswalk (mapping):**

   * Use the `description` and `code|n` columns to connect CPT/HCPCS rows to their parent APC row (same description or grouping).
   * Then join those CPT/HCPCS codes to the APC rows that actually carry pricing.
     → This gives you a lookup where searching by CPT/HCPCS leads you to the APC (and thus the estimated amounts and plan names).

2. **Validate mapping:**

   * In practice, hospitals often list the CPT/HCPCS that roll up into each APC.
   * You’ll need to check whether identical descriptions (e.g., “Inj, aflibercept hd, 1 mg”) appear across APC-coded and CPT-coded rows, and merge them.

3. **Practical solution in analysis:**

   * Search by CPT -> Find matching description -> Get its APC -> Pull plan names and estimated amounts from that APC row.

#### On Legality


  Not necessarily illegal. CMS’s **price transparency rule (2021–)** requires hospitals to publish:

  * Gross charges
  * Discounted cash prices
  * Payer-specific negotiated charges
  * De-identified min/max negotiated charges
  * For at least 300 shoppable services (including CPT/HCPCS).

Many hospitals comply only at the APC level (grouping multiple CPTs). This practice has been criticized as undermining the intent of transparency, but hospitals often argue it’s compliant because APCs are “billing codes.” Enforcement has been light, though CMS has fined some hospitals for noncompliance.

## Addendums

Link to addendums for crosswalking https://www.cms.gov/medicare/payment/prospective-payment-systems/hospital-outpatient-pps/quarterly-addenda-updates

In [None]:
# ======================================================================
# --------------- MUSC HEALTH  ---------------- ADDENDUM B NEEDED
# ======================================================================

import re
from scripts.cleaners import standardize_payer_name
from scripts.cleaners import apply_payer_standardization_to_json
from scripts.bundle_validation import ValidateJSON
from scripts.merge_cpt_to_apc import map_prices_to_hcpcs, load_addendum_b

# Grab row for MUSC Health in Charleston, SC
hos_name = 'MUSC Health'
matching_hospitals = hospitals_df[hospitals_df['name'] == hos_name]
if not matching_hospitals.empty:
    musc_row = matching_hospitals.iloc[0]
else:
    print(f"Hospital '{hos_name}' not found in the dataset")
    musc_row = None

# grab json path for MUSC Health
musc_json_path = musc_row['json_path']

# load a single csv file from csv_folder for testing
musc_csv_path = os.path.join(csv_folder, 'MUSC_Health_Medical_Center_CM.csv')
musc_df = pd.read_csv(musc_csv_path)

# Change all mixed type columns to string to avoid dtype issues (focus codes columns only)
code_cols = [c for c in musc_df.columns if c.startswith("code|") or c.startswith("code_")]
musc_df[code_cols] = musc_df[code_cols].astype(str)

# ======================================================
# ADDENDUM B LOADING
# ======================================================

# load addendum b for mapping, save folder as csv files are stored locally outside of CLEAR repo
addendum_b_path = os.path.join(csv_folder, '2025_Web_Addendum_B.csv')

# load addendum b
addendum_b = load_addendum_b(addendum_b_path)

# Map prices to hcpcs codes in musc_df
musc_df = map_prices_to_hcpcs(musc_df, addendum_b, expand=True)

# replace column name instances with | to _, code|1 becomes code_1, etc
musc_df.columns = [re.sub(r'\|', '_', col) for col in musc_df.columns]

# Code structure is similar to Duke with there being multiple code columns
"""
Checking matches for column: code_1
  HCPCS matches: 0
  CPT matches: 0
  Lab matches: 0
Checking matches for column: code_2
  HCPCS matches: 0
  CPT matches: 10
  Lab matches: 0
Checking matches for column: code_3
  HCPCS matches: 8
  CPT matches: 0
  Lab matches: 323
Checking matches for column: code_4
  HCPCS matches: 0
  CPT matches: 0
  Lab matches: 0

Unique values in code_1_type: ['APC' 'CDM' 'MS-DRG' 'NDC']
Unique values in code_2_type: ['nan' 'RC']
Unique values in code_3_type: ['nan' 'HCPCS']
Unique values in code_4_type: ['nan' 'NDC']

"""

# now grab matches
hcpcs_matches = musc_df[musc_df['code_3'].isin(hcpcs_codes['HCPCS Code'])]
cpt_matches = musc_df[musc_df['code_2'].isin(cpt_codes['HCPCS Code'])]
lab_matches = musc_df[musc_df['code_3'].isin(lab_codes['HCPCS Code'])]

match_dfs = [df for df in [hcpcs_matches, cpt_matches, lab_matches] if not df.empty]  

all_matches = pd.concat(match_dfs, ignore_index=True)

# drop all rows where payer_name is null/empty, for now dont worry about est. price, drop duplicates after
all_matches = all_matches[(all_matches['payer_name'].notna() & (all_matches['payer_name'] != ''))]
all_matches = all_matches.drop_duplicates()

# due to the mapping code types are a bit trickier now, espeically to keep records unique, for 
# now lets just use code_3/code_3_type if present, else code_2/code_2_type for code/type columns
def select_code(row):
    if pd.notna(row['code_3']) and row['code_3'] != '':
        return pd.Series({'code': row['code_3'], 'type': row['code_3_type']})
    elif pd.notna(row['code_2']) and row['code_2'] != '':
        return pd.Series({'code': row['code_2'], 'type': row['code_2_type']})
    else:
        return pd.Series({'code': None, 'type': None})
    
if not all_matches.empty:
    all_matches[['code', 'type']] = all_matches.apply(select_code, axis=1)
    all_matches = all_matches.drop(columns=['code_2', 'code_2_type', 'code_3', 'code_3_type'])

# now lets apply payer standardization
all_matches['payer_name'] = all_matches['payer_name'].apply(standardize_payer_name)

# Now lets remove enteries that don't have est. prices
all_matches = all_matches[all_matches['estimated_amount'].notna() & (all_matches['estimated_amount'] != '')]

# finally export to json
all_matches.to_json(musc_json_path, orient='records', lines=True)

# Validate the final json
validator = ValidateJSON(musc_json_path)

# drop file/df from memory to save space
del musc_df
del musc_row
del musc_csv_path
del all_matches

print("MUSC Health test processing complete.")

# ======================================================================

Matched 1711 / 2015 rows.
service_config.json UPDATED (1 bundle(s) changed). Backup saved.
MUSC Health test processing complete.


***
## Virgina

This will be the firs time tesing the all in one code book, where we crosswalked with the RVU to get info about CPT codes then updated their description and compiled into a singular doc. Here is the output from testing:

#### Original Method:
```python
# column 1 matches
hcpcs_matches = inova_transformed[inova_transformed['code_1'].isin(hcpcs_codes['HCPCS Code'])]
cpt_matches = inova_transformed[inova_transformed['code_1'].isin(cpt_codes['HCPCS Code'])]
lab_matches = inova_transformed[inova_transformed['code_1'].isin(lab_codes['HCPCS Code'])]

# column 2 matches
hcpcs_matches_2 = inova_transformed[inova_transformed['code_2'].isin(hcpcs_codes['HCPCS Code'])]
cpt_matches_2 = inova_transformed[inova_transformed['code_2'].isin(cpt_codes['HCPCS Code'])]
lab_matches_2 = inova_transformed[inova_transformed['code_2'].isin(lab_codes['HCPCS Code'])]

# print out match counts for each
print(f"Column code_1: HCPCS matches: {len(hcpcs_matches)}, CPT matches: {len(cpt_matches)}, Lab matches: {len(lab_matches)}")
print(f"Column code_2: HCPCS matches: {len(hcpcs_matches_2)}, CPT matches: {len(cpt_matches_2)}, Lab matches: {len(lab_matches_2)}")
```
- Column code_1: HCPCS matches: 0, CPT matches: 156, Lab matches: 0
- Column code_2: HCPCS matches: 8054, CPT matches: 4275, Lab matches: 327

#### New Spreadsheet Method:

```python
# column 1 matches
hcpcs_filtered_codes = all_codes[all_codes['source'] == 'HCPCS']
cpt_filtered_codes = all_codes[all_codes['source'] == 'CPT']
lab_filtered_codes = all_codes[all_codes['source'] == 'Lab']

# column 1 matches
hcpcs_matches = inova_transformed[inova_transformed['code_1'].isin(hcpcs_filtered_codes['code'])]
cpt_matches = inova_transformed[inova_transformed['code_1'].isin(cpt_filtered_codes['code'])]
lab_matches = inova_transformed[inova_transformed['code_1'].isin(lab_filtered_codes['code'])]

# column 2 matches
hcpcs_matches_2 = inova_transformed[inova_transformed['code_2'].isin(hcpcs_filtered_codes['code'])]
cpt_matches_2 = inova_transformed[inova_transformed['code_2'].isin(cpt_filtered_codes['code'])]
lab_matches_2 = inova_transformed[inova_transformed['code_2'].isin(lab_filtered_codes['code'])]

# print out match counts for each
print(f"Column code_1: HCPCS matches: {len(hcpcs_matches)}, CPT matches: {len(cpt_matches)}, Lab matches: {len(lab_matches)}")
print(f"Column code_2: HCPCS matches: {len(hcpcs_matches_2)}, CPT matches: {len(cpt_matches_2)}, Lab matches: {len(lab_matches_2)}")
```
- Column code_1: HCPCS matches: 0, CPT matches: 0, Lab matches: 0
- Column code_2: HCPCS matches: 8054, CPT matches: 2345, Lab matches: 288

We can see tha the original is catching more CPT matches as well as Lab, but there is a chance that it's catching codes that are depreciated. This needs to be investigated further but I already have enough on my plate.

In [None]:
# =====================================================================
#  Inova Fairfax Hospital 
# =====================================================================

from scripts.cleaners import transform_wide_to_long_format

hos_name = 'Inova Fairfax Hospital'
address = '3300 Gallows Rd'
city_name = 'Falls Church'
state_name = 'VA'
zip_code = '22042'

# update hospitals_df with new entry if it doesn't already exist
# move this to be a function later
if hospitals_df[
    (hospitals_df['name'] == hos_name) &
    (hospitals_df['state'] == state_name) &
    (hospitals_df['city'] == city_name)
].empty:
    new_entry = {
        'name': hos_name,
        'address': address,
        'city': city_name,
        'state': state_name,
        'zip': zip_code
    }
    hospitals_df = pd.concat([hospitals_df, pd.DataFrame([new_entry])], ignore_index=True)
    hospitals_df.to_csv(hospitals_csv, index=False)  # Save updated CSV
    print(f"Added new hospital entry for '{hos_name}' to hospitals.csv")

    # now run the update_dataframe function to add lat/lon, id, and json_path
    update_dataframe(hospitals_df)

# now lets grab all the paths
inova_json_path = hospitals_df[hospitals_df['name'] == hos_name]['json_path'].values[0]
inova_csv_path = os.path.join(csv_folder, 'INOVA_FAIRFAX_CM.csv')

# load Inova Fairfax Hospital csv
inova_df = pd.read_csv(inova_csv_path)

# Inova is another wide format CM structure, so we need to transform it
# Apply the transformation
print("Transforming Inova Fairfax data from wide to long format...")
inova_transformed = transform_wide_to_long_format(inova_df, )

# Original shape: (19293, 248)
# Transformed shape: (145308, 16)

# Now to grab matches
# The bulk of the matches are in code_2/code_2_type, so we will just use those for now due to limited space
# HCPCS matches
hcpcs_matches = inova_transformed[inova_transformed['code_2'].isin(hcpcs_codes['HCPCS Code'])].copy()
hcpcs_matches['code'] = hcpcs_matches['code_2']
hcpcs_matches['type'] = hcpcs_matches['code_2_type']
hcpcs_matches = hcpcs_matches.drop(columns=['code_1', 'code_1_type', 'code_2', 'code_2_type'])

# CPT matches
cpt_matches = inova_transformed[inova_transformed['code_1'].isin(cpt_codes['HCPCS Code'])].copy()
cpt_matches['code'] = cpt_matches['code_2']
cpt_matches['type'] = cpt_matches['code_2_type']
cpt_matches = cpt_matches.drop(columns=['code_1', 'code_1_type', 'code_2', 'code_2_type'])

# Lab matches
lab_matches = inova_transformed[inova_transformed['code_2'].isin(lab_codes['HCPCS Code'])].copy()
lab_matches['code'] = lab_matches['code_2']
lab_matches['type'] = lab_matches['code_2_type']
lab_matches = lab_matches.drop(columns=['code_1', 'code_1_type', 'code_2', 'code_2_type'])

# Combine all matches into one dataframe, drop duplicates
match_dfs = [df for df in [hcpcs_matches, cpt_matches, lab_matches] if not df.empty]
if match_dfs:
    all_matches = pd.concat(match_dfs, ignore_index=True).drop_duplicates()
    all_matches['payer_name'] = all_matches['payer_name'].apply(standardize_payer_name)

    # redrop possible duplicates
    all_matches = all_matches.drop_duplicates()

# now lets drop some more columns to save space
cols_to_drop = ['billing_class', 'drug_unit_of_measurement', 'drug_type_of_measurement', 
                'modifiers', 'standard_charge_algorithm', 'additional_generic_notes', 'methodology']
all_matches = all_matches.drop(columns=cols_to_drop, errors='ignore')

# Now we can save to json
all_matches.to_json(inova_json_path, orient='records', lines=True)

# drop file/df from memory to save space
del inova_df
del inova_transformed
del all_matches
del inova_csv_path
del inova_json_path

print("Inova Fairfax Hospital test processing complete.")

Transforming Inova Fairfax data from wide to long format...
Inova Fairfax Hospital test processing complete.
Inova Fairfax Hospital test processing complete.


***
## Medicare Pricing Integration

Testing the new enhanced pricing reader that can automatically parse CMS pricing files and match codes from our top 200 lists to Medicare rates. This includes ASC, ASP, CLFS, DMEPOS, PFALL pricing sources plus calculated anesthesia rates.

In [None]:
# Test the enhanced pricing reader functionality
import importlib
import sys

# Reload the module to pick up changes
if 'scripts.enhanced_pricing_reader' in sys.modules:
    importlib.reload(sys.modules['scripts.enhanced_pricing_reader'])

from scripts.enhanced_pricing_reader import match_codes_to_pricing, test_pricing_reader

# First, let's test the individual pricing file readers
print("Testing pricing file readers...")
pricing_folder_path = os.path.join(workspace_root, '..', 'ChargeMaster_Project', 'pricing_info')
pricing_folder_path = os.path.abspath(pricing_folder_path)

# Run the test function
test_pricing_reader(pricing_folder_path)

Testing pricing file readers...
Testing individual parsers...
ASC test: 4723 records loaded
  Sample: 0101T = $240.17
ASP test: 871 records loaded
  Sample: 90653 = $83.49
CLFS test: 1926 records loaded
  Sample: 0001U = $720.0
DMEPOS test: 2059 records loaded
  Sample: A4216 = $0.97
PFALL test: 7481 records loaded
  Sample: G0011 = $31.56
Anesthesia test: 5 base units + 68 minutes = $208.62
PFALL test: 7481 records loaded
  Sample: G0011 = $31.56
Anesthesia test: 5 base units + 68 minutes = $208.62


In [None]:
# Now let's match our loaded codes to Medicare pricing
print("\n" + "="*60)
print("MATCHING CODES TO MEDICARE PRICING")
print("="*60)

# Prepare the list of code dataframes
code_dataframes = [hcpcs_codes, lab_codes, cpt_codes, all_codes]

# Match codes to pricing and create unified output
output_path = os.path.join(workspace_root, 'docs', 'data', 'medicare_pricing_matched.csv')
matched_pricing = match_codes_to_pricing(
    code_dataframes=code_dataframes,
    pricing_folder=pricing_folder_path,
    output_file=output_path,
    include_anesthesia=True
)

print(f"\nMatched pricing data shape: {matched_pricing.shape}")
if not matched_pricing.empty:
    print(f"Price range: ${matched_pricing['price'].min():.2f} - ${matched_pricing['price'].max():.2f}")
    print(f"Average price: ${matched_pricing['price'].mean():.2f}")
    
    # Show sample of results
    print(f"\nSample matched results:")
    print(matched_pricing.head(10)[['code', 'price', 'source']].to_string(index=False))


MATCHING CODES TO MEDICARE PRICING
Starting code matching process...
Parsing all pricing files...
ASC: 4723 records
ASP: 871 records
CLFS: 1926 records
DMEPOS: 2059 records
PFALL: 7481 records
Total unique codes to match: 478
PFALL: 7481 records
Total unique codes to match: 478

Matching Results:
Total codes processed: 478
Successfully matched: 450
Unmatched codes: 28
Match rate: 94.1%

Pricing sources used:
  PFALL: 189 codes
  ASC: 96 codes
  CLFS: 79 codes
  ASP: 48 codes
  DMEPOS: 38 codes

Results saved to: c:\Users\jcing\OneDrive\Desktop\MADS\DATA 760\Project\CLEAR\docs\data\medicare_pricing_matched.csv
Unmatched codes saved to: c:\Users\jcing\OneDrive\Desktop\MADS\DATA 760\Project\CLEAR\docs\data\medicare_pricing_matched_unmatched.csv

Matched pricing data shape: (450, 4)
Price range: $0.01 - $14070.52
Average price: $457.20

Sample matched results:
 code   price source
87801  70.200   CLFS
Q5101   0.360    ASC
J0461   0.107    ASP
J7209   1.240    ASC
A6216   0.090 DMEPOS
A506