# CLEAR

Chargemaster Location-based Exploration for Affordability & Reform

This is the notebook file primarily responsible for pre-processing data, attaching important information, and generating database files for the github page. Below you can find all information about how data is processed from the downloaded `.csv` files found on most hospital sites. This is an exploratory project focused on creating interactive visualzations and tools to better inform people about their healthcare. The repo can always be maintained by downloading the most current year data for the specific hospital and putting it through the scripts. It should be noted that this is NOT a comprehensive list, but it can potentially be scaled to a full working-standalone site with enough time. 

All pre-processing code is written in python. See the `.html` files for how the D3 visualizations work. 

## How it works (Copied from README)

Hospitals that have been added to this 'web-app' are stored in a `.csv` file for quick look up and ease of access. This points to the loc of it's Charge Master `.json` file which is then queried for the specific procedure. Hospitals are gathered from the CSV list based on a radius look-up provided by the user. If a hospital in the radius does not offer the service, it will not display the price point compared to others in the radius. 

Currently limited to 500 procedures due to file size limits and me not wanted to set up a database for this. Parquet only works server side so i can't do iterative testing before publishing to pages, and pages deployments can take a while.  

## List of Hospitals

These are the hospital's which data has been gathered and processed for thus far:

| State    | Hospital Name                     | Zipcode     | Date                 | File Size    | Link                                                            |
|----------|--------------------------------|-------------|-------------------|-------------|------------------------------------------------|
| NC        | Duke University Hospital     |     27710    |      09/2025      |   3.32 GB   |                                                                   |
| NC        | Wake Med                           |                   |                          |                   |                                                                 |
| NC        | REX UNC                             |                   |                          |                   |                                                                 |

## Outside Sources Used

- zip_centroids.csv courtesy of SimpleMaps data https://simplemaps.com/data/us-zips.
- CMS.gov data for top 200 HCPCS and CPT codes billed for 2024 & top 100 lab codes. [Link](https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-fee-for-service-parts-a-b/medicare-utilization-part-b)



***

## Data Processing

CSV files are too large to store on github, thus they are downloaded locally, converted to the necessary format, then uploaded. If you want to perform conversions yourself you will need to find the specific hospital chargemaster and document in the notebook accordingly.

Not all Charge Masters (CM) are formatted the same, as such, to keep this notebook from growing too large, custom python scripts will be made for unique CM's. This matters beccause some hospitals are regional or statewide 'chains' but can vary prices between locations. For example, 

**AdventHealth**
- AdventHealth Orlando
- AdventHealth Tampa
- AdventHealth Hendersonville

all are AdventHealth hospitals, but their prices and available procedures vary per location. However, the same script to clean and process their CM's works because the file structure doesn't change from loc to loc. Normally CM structure only changes from hospital to hospital (brand-wise), but I haven't looked at the majority of US hospitals so this statement might need to be amended. 

Think of this file as more of a "**Controller**" for the cleaning, while the cleaning process is performed by imported functions. Subsections from here on are labeled by State, be sure to check which Hospitals are in each subsection before uploading data. 



***
## Payer & Plan Names

Naming conventions for payer/plans differ across hospitals, making this a pain. Like is an exhaustive regex section to hopefully simplify this so that the functionality of the .html page remains. 

Idk where this fits in, I'll add it later to documentation.

In [42]:
# imports necessary for data regex searching and manipulation
import pandas as pd
import numpy as np
import re
import os

# csv's are stored locally outside of CLEAR repo
# set up one folder then into 'ChargeMaster_Project/csv_files/'
# get path to csv_files folder outside CLEAR repo
workspace_root = os.path.dirname(os.path.abspath('CLEAR.ipynb'))
csv_folder = os.path.join(workspace_root, '..', 'ChargeMaster_Project', 'csv_files')
csv_folder = os.path.abspath(csv_folder)

# lets load all CM files into pandas dataframes, grab the unique values for 'payer_name' and 'plan_name' columns only
# then drop everything else to save space
for file in os.listdir(csv_folder):
    if file.endswith('.csv'):
        file_path = os.path.join(csv_folder, file)
        df = pd.read_csv(file_path, dtype=str)  # read all columns as strings to avoid dtype issues
        if 'payer_name' in df.columns and 'plan_name' in df.columns:
            unique_payers = df['payer_name'].dropna().unique()
            unique_plans = df['plan_name'].dropna().unique()
            # save unique payers and plans to text files for later use
            with open(os.path.join(csv_folder, f'{file}_unique_payers.txt'), 'w') as f:
                for payer in unique_payers:
                    f.write(f"{payer}\n")
            with open(os.path.join(csv_folder, f'{file}_unique_plans.txt'), 'w') as f:
                for plan in unique_plans:
                    f.write(f"{plan}\n")
        del df  # drop dataframe to save memory
        print(f"Processed {file}. Dropping dataframe from memory...")
print("Unique payer and plan names extracted and saved.")

Processed AdventHealth_Hendersonville_CM.csv. Dropping dataframe from memory...
Processed AdventHealth_Orlando_CM.csv. Dropping dataframe from memory...
Processed AdventHealth_Tampa_CM.csv. Dropping dataframe from memory...
Processed DukeHospital_Durham.csv. Dropping dataframe from memory...
Unique payer and plan names extracted and saved.


In [None]:
# now load all the text files from the same csv_files folder for each hospital
# and combine them into one master list of unique payers and plans
all_unique_payers = set()
all_unique_plans = set()
for file in os.listdir(csv_folder):
    if file.endswith('_unique_payers.txt'):
        with open(os.path.join(csv_folder, file), 'r') as f:
            payers = f.read().splitlines()
            all_unique_payers.update(payers)
    elif file.endswith('_unique_plans.txt'):
        with open(os.path.join(csv_folder, file), 'r') as f:
            plans = f.read().splitlines()
            all_unique_plans.update(plans)

# 

In [45]:
# Enhanced Payer name standardization function
def standardize_payer_name(payer_name):
    """
    Standardize payer names while preserving important distinctions like state-specific plans
    """
    if pd.isna(payer_name) or payer_name == '':
        return payer_name
    
    # Convert to string and strip whitespace
    name = str(payer_name).strip()
    
    # Remove trailing underscores and extra spaces
    name = re.sub(r'_+$', '', name)
    name = re.sub(r'\s+', ' ', name)
    
    # Standardize common insurance company names while preserving state distinctions
    standardization_patterns = [
        # Aetna variations
        (r'\bAETNA\b.*?(?:\[[\d]+\])?', 'AETNA'),
        (r'\bAetna.*?Health\b', 'AETNA'),
        (r'\bAetna.*?Better.*?Health\b', 'AETNA'),
        
        # Cigna variations
        (r'\bCIGNA\b.*?(?:\[[\d]+\])?', 'CIGNA'),
        (r'\bCigna.*?HealthCare\b', 'CIGNA'),
        (r'\bCigna.*?Health.*?Care\b', 'CIGNA'),
        
        # UHC/United variations
        (r'\bUHC\b.*?(?:\[[\d]+\])?', 'UNITED HEALTHCARE'),
        (r'\bUNITED\s+HEALTHCARE\b', 'UNITED HEALTHCARE'),
        (r'\bUNITED\s+HEALTH\s+GROUP\b', 'UNITED HEALTHCARE'),
        (r'\bUNITED\s+MEDICAL\s+RESOURCES.*CONTRACT\b', 'UNITED HEALTHCARE'),
        (r'\bUMR\b.*?(?:\[[\d]+\])?', 'UNITED HEALTHCARE'),
        (r'\bUNITED\s+OF\s+OMAHA\b', 'UNITED OF OMAHA'),  # Different company
        
        # Humana variations
        (r'\bHUMANA\b.*?(?:\[[\d]+\])?', 'HUMANA'),
        (r'\bHumana.*?Inc\b', 'HUMANA'),
        
        # Anthem/BCBS Anthem variations
        (r'\bANTHEM\b.*?(?:\[[\d]+\])?', 'ANTHEM'),
        (r'\bAnthem.*?Blue.*?Cross\b', 'ANTHEM BLUE CROSS'),
        
        # Kaiser variations
        (r'\bKAISER\b.*?(?:\[[\d]+\])?', 'KAISER PERMANENTE'),
        (r'\bKaiser.*?Permanente\b', 'KAISER PERMANENTE'),
        
        # Wellcare variations
        (r'\bWELLCARE\b.*?(?:\[[\d]+\])?', 'WELLCARE'),
        (r'\bWell.*?Care\b', 'WELLCARE'),
        
        # Molina variations
        (r'\bMOLINA\b.*?(?:\[[\d]+\])?', 'MOLINA HEALTHCARE'),
        (r'\bMolina.*?Healthcare\b', 'MOLINA HEALTHCARE'),
        
        # Blue Cross Blue Shield - preserve state distinctions
        (r'\bBlue_Cross_&_Blue_Shield_of_([A-Za-z_]+)_?', r'BLUE CROSS BLUE SHIELD OF \1'),
        (r'\bBLUE\s+CROSS\s+BLUE\s+SHIELD\s+OF\s+([A-Z\s]+)', r'BLUE CROSS BLUE SHIELD OF \1'),
        (r'\bBCBS\s+OF\s+([A-Z\s]+)', r'BLUE CROSS BLUE SHIELD OF \1'),
        (r'\bBCBS\b', 'BLUE CROSS BLUE SHIELD'),
        
        # Medicare/Medicaid variations
        (r'\bMEDICARE\b.*?(?:\[[\d]+\])?', 'MEDICARE'),
        (r'\bMEDICAID\b.*?(?:\[[\d]+\])?', 'MEDICAID'),
        (r'\bCMS\b.*?(?:\[[\d]+\])?', 'MEDICARE'),
        
        # Tricare variations
        (r'\bTRICARE\b.*?(?:\[[\d]+\])?', 'TRICARE'),
        (r'\bTRI.*?CARE\b', 'TRICARE'),
        
        # Workers Compensation variations
        (r'\bWORKERS.*?COMP\b', 'WORKERS COMPENSATION'),
        (r'\bWORKERS.*?COMPENSATION\b', 'WORKERS COMPENSATION'),
        (r'\bWC\b(?!\s+\d)', 'WORKERS COMPENSATION'),  # Not followed by numbers
        
        # Auto Insurance variations
        (r'\bAUTO\s+INSURANCE\b', 'AUTO INSURANCE'),
        (r'\bMOTOR\s+VEHICLE\b', 'AUTO INSURANCE'),
        (r'\bPIP\b(?!\s+\d)', 'AUTO INSURANCE PIP'),
        
        # Self Pay variations
        (r'\bSELF.*?PAY\b', 'SELF PAY'),
        (r'\bCASH\b(?!\s+\d)', 'SELF PAY'),
        (r'\bSELF.*?INSURED\b', 'SELF PAY'),
        
        # Remove ID numbers in brackets at the end
        (r'\s*\[[\d]+\]\s*$', ''),
        
        # Standardize specific plans and smaller insurers
        (r'\bDUKE\s+PLUS\b', 'DUKE PLUS'),
        (r'\bMAIL\s+HANDLERS\b.*?(?:\[[\d]+\])?', 'MAIL HANDLERS'),
        (r'\bNALC\s+HEALTH\s+BENEFIT\s+PLAN\b.*?(?:\[[\d]+\])?', 'NALC HEALTH BENEFIT PLAN'),
        (r'\bFIRST\s+HEALTH\b.*?(?:\[[\d]+\])?', 'FIRST HEALTH'),
        (r'\bGOLDEN\s+RULE\s+INSURANCE\s+COMPANY\b.*?(?:\[[\d]+\])?', 'GOLDEN RULE INSURANCE'),
        (r'\bOXFORD\s+HEALTH\s+PLANS\b.*?(?:\[[\d]+\])?', 'OXFORD HEALTH PLANS'),
        (r'\bHEALTH\s+NET\b.*?(?:\[[\d]+\])?', 'HEALTH NET'),
        (r'\bAMBETTER\b.*?(?:\[[\d]+\])?', 'AMBETTER'),
        (r'\bCENTENE\b.*?(?:\[[\d]+\])?', 'CENTENE'),
        
        # Federal Employee plans
        (r'\bFEHB\b', 'FEDERAL EMPLOYEE HEALTH BENEFITS'),
        (r'\bFEDERAL\s+EMPLOYEE.*?HEALTH.*?BENEFITS\b', 'FEDERAL EMPLOYEE HEALTH BENEFITS'),
        (r'\bGEHA\b', 'GOVERNMENT EMPLOYEES HEALTH ASSOCIATION'),
        
        # Convert underscores to spaces for better readability
        (r'_', ' '),
        
        # Clean up multiple spaces
        (r'\s+', ' '),
        
        # Fix common OCR/data entry errors
        (r'\b0\b', 'O'),  # Replace standalone 0 with O
        (r'\bl\b', 'I'),  # Replace standalone l with I
    ]
    
    # Apply standardization patterns
    for pattern, replacement in standardization_patterns:
        name = re.sub(pattern, replacement, name, flags=re.IGNORECASE)
    
    return name.strip().upper()

# Enhanced test with more examples including common healthcare payers
test_payers = [
    # Original examples
    'Blue_Cross_&_Blue_Shield_of_Florida',
    'Blue_Cross_&_Blue_Shield_of_Florida_',
    'Cigna_HealthCare',
    'Longevity',
    'Aetna_Health',
    'CIGNA [1107150]',
    'UHC',
    'AETNA [1107164]',
    'UHC [1107151]',
    'NALC HEALTH BENEFIT PLAN [1001268]',
    'DUKE PLUS',
    'MAIL HANDLERS [1001414]',
    'AETNA',
    'FIRST HEALTH [1107113]',
    'GOLDEN RULE INSURANCE COMPANY [1001209]',
    'OXFORD HEALTH PLANS [1001285]',
    'UNITED MEDICAL RESOURCES CONTRACT [1107140]',
    'UMR [1107154]',
    'CIGNA',
    
    # Additional common variations
    'Humana Inc',
    'HUMANA [123456]',
    'Kaiser Permanente',
    'KAISER [789012]',
    'Anthem Blue Cross',
    'ANTHEM [345678]',
    'Wellcare',
    'Well Care Health Plans',
    'Molina Healthcare',
    'MOLINA [901234]',
    'BCBS OF NORTH CAROLINA',
    'BCBS',
    'Medicare',
    'MEDICARE [567890]',
    'Medicaid',
    'TRICARE',
    'Tri-Care',
    'Workers Comp',
    'WORKERS COMPENSATION',
    'WC',
    'Auto Insurance',
    'Motor Vehicle',
    'PIP',
    'Self Pay',
    'CASH',
    'Self Insured',
    'FEHB',
    'Federal Employee Health Benefits',
    'GEHA',
    'Health Net',
    'AMBETTER',
    'United of Omaha'
]

print("Testing enhanced payer name standardization:")
print("-" * 60)
for payer in test_payers:
    standardized = standardize_payer_name(payer)
    print(f"{payer:<45} -> {standardized}")

# Also create a function to apply standardization to actual data
def apply_payer_standardization_to_csv(csv_file_path):
    """
    Apply payer standardization to a CSV file and return the modified dataframe
    """
    df = pd.read_csv(csv_file_path, dtype=str)
    
    if 'payer_name' in df.columns:
        print(f"Standardizing payer names in {csv_file_path}")
        df['payer_name'] = df['payer_name'].apply(standardize_payer_name)
        
        # Show before/after unique counts
        print(f"Unique payer names after standardization: {df['payer_name'].nunique()}")
    
    return df

print("\n" + "="*60)
print("Standardization function ready for use on your CSV files!")
print("Use apply_payer_standardization_to_csv(filepath) to apply to actual data.")

Testing enhanced payer name standardization:
------------------------------------------------------------
Blue_Cross_&_Blue_Shield_of_Florida           -> BLUE CROSS BLUE SHIELD OF FLORIDA
Blue_Cross_&_Blue_Shield_of_Florida_          -> BLUE CROSS BLUE SHIELD OF FLORIDA
Cigna_HealthCare                              -> CIGNA
Longevity                                     -> LONGEVITY
Aetna_Health                                  -> AETNA
CIGNA [1107150]                               -> CIGNA
UHC                                           -> UNITED HEALTHCARE
AETNA [1107164]                               -> AETNA
UHC [1107151]                                 -> UNITED HEALTHCARE
NALC HEALTH BENEFIT PLAN [1001268]            -> NALC HEALTH BENEFIT PLAN
DUKE PLUS                                     -> DUKE PLUS
MAIL HANDLERS [1001414]                       -> MAIL HANDLERS
AETNA                                         -> AETNA
FIRST HEALTH [1107113]                        -> FIRST HEALTH
GO

In [None]:
# Apply standardization to all existing CSV files and update the unique payers/plans files
def update_all_csv_files_with_standardization():
    """
    Apply payer name standardization to all CSV files in the csv_folder
    and regenerate the unique payers/plans text files
    """
    updated_files = []
    
    for file in os.listdir(csv_folder):
        if file.endswith('.csv'):
            file_path = os.path.join(csv_folder, file)
            
            # Load the CSV
            df = pd.read_csv(file_path, dtype=str)
            
            # Apply standardization if payer_name column exists
            if 'payer_name' in df.columns:
                original_unique_count = df['payer_name'].nunique()
                df['payer_name'] = df['payer_name'].apply(standardize_payer_name)
                new_unique_count = df['payer_name'].nunique()
                
                # Save the updated CSV
                df.to_csv(file_path, index=False)
                
                # Regenerate unique payers file
                unique_payers = df['payer_name'].dropna().unique()
                with open(os.path.join(csv_folder, f'{file}_unique_payers.txt'), 'w') as f:
                    for payer in sorted(unique_payers):  # Sort for easier reading
                        f.write(f"{payer}\n")
                
                # Also update unique plans if exists
                if 'plan_name' in df.columns:
                    unique_plans = df['plan_name'].dropna().unique()
                    with open(os.path.join(csv_folder, f'{file}_unique_plans.txt'), 'w') as f:
                        for plan in sorted(unique_plans):
                            f.write(f"{plan}\n")
                
                updated_files.append({
                    'file': file,
                    'original_payers': original_unique_count,
                    'standardized_payers': new_unique_count,
                    'reduction': original_unique_count - new_unique_count
                })
                
                print(f"Updated {file}: {original_unique_count} -> {new_unique_count} unique payers "
                      f"({original_unique_count - new_unique_count} consolidated)")
    
    return updated_files

# Uncomment the line below to run the standardization on all your CSV files
# results = update_all_csv_files_with_standardization()

print("Standardization update function ready!")
print("Uncomment the last line to apply standardization to all CSV files.")

***
## Data Preloading Tasks

In [2]:
# hospitals.csv updater/editor
import hashlib
import requests
import json
from geopy.geocoders import Nominatim
import pandas as pd
import time
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

geolocator = Nominatim(user_agent="CLEAR-geoapi-2025")
csv_file = 'docs/data/hospitals.csv'
df = pd.read_csv(csv_file)

# construct address for geocoding only (don't modify original data)
def construct_geocoding_address(row):
    # Build clean address from original components
    address = f"{row['address']}, {row['city']}, {row['state']} {row['zip']}"
    return address

# get lat/lon from address with increased timeout and retry/delay
def get_lat_lon(address, max_retries=3, delay=2):
    for attempt in range(max_retries):
        try:
            location = geolocator.geocode(address, timeout=5)
            if location:
                return location.latitude, location.longitude
            else:
                return None, None
        except Exception as e:
            print(f"Error geocoding {address} (attempt {attempt+1}): {e}")
            time.sleep(delay)
    return None, None

# generate short unique ID based on ['hospital'] + full composite address (base36, 8 chars)
def generate_short_id(row):
    full_address = construct_geocoding_address(row)
    unique_string = f"{row['name']}_{full_address}"
    hash_int = int(hashlib.md5(unique_string.encode()).hexdigest(), 16)
    short_id = base36encode(hash_int)[:8]
    return short_id

# base36 encoding for shorter IDs
def base36encode(number):
    chars = '0123456789abcdefghijklmnopqrstuvwxyz'
    if number == 0:
        return '0'
    result = ''
    while number > 0:
        number, i = divmod(number, 36)
        result = chars[i] + result
    return result

# Add lat/lon and short_id to dataframe, set json_path to be '/data/prices/['state']/['id'].json'
def update_dataframe(df):
    
    # Don't modify the address column - just use it for geocoding
    def lat_lon_with_delay(row):
        geocoding_address = construct_geocoding_address(row)
        lat, lon = get_lat_lon(geocoding_address)
        time.sleep(1)  # 1 second delay per request
        return pd.Series([lat, lon])
    
    df[['lat', 'lon']] = df.apply(lat_lon_with_delay, axis=1)
    df['id'] = df.apply(generate_short_id, axis=1)
    df['json_path'] = df.apply(lambda row: f"docs/data/prices/{row['state']}/{row['id']}.json", axis=1)
    df.to_csv(csv_file, index=False)
    
    return

update_dataframe(df)

In [None]:
# now we need to create comparison df's for the top 200 HCPCS and CMS codes billed for 2024 & top 100 lab codes
# first load the codes from the .csv files
hcpcs_codes = pd.read_csv('docs/data/hcpcs_lvl2_top_200_codes_2024.csv')
lab_codes = pd.read_csv('docs/data/lab_top_100_codes_2024.csv')
cpt_codes = pd.read_csv('docs/data/cpt_lvl1_top_200_codes_2024.csv')

Unnamed: 0,Rank by Charges,HCPCS Code,Allowed Charges,Allowed Services,Unnamed: 4
0,1,99214,12493376407,103756876,
1,2,99213,5914372895,69301624,
2,3,99233,2693744916,22975112,
3,4,99232,2676454801,34687153,
4,5,99215,2166116667,12926784,


In [28]:
# RUN TO LOAD HOSPITALS CSV
import os

# csv's are stored locally outside of CLEAR repo
# set up one folder then into 'ChargeMaster_Project/csv_files/'
# get path to csv_files folder outside CLEAR repo
workspace_root = os.path.dirname(os.path.abspath('CLEAR.ipynb'))
csv_folder = os.path.join(workspace_root, '..', 'ChargeMaster_Project', 'csv_files')
csv_folder = os.path.abspath(csv_folder)

# define path to hospitals.csv
hospitals_csv = os.path.join(workspace_root, 'docs', 'data', 'hospitals.csv')
hospitals_csv = os.path.abspath(hospitals_csv)

# read hospitals.csv to get list of hospitals and their file paths
hospitals_df = pd.read_csv(hospitals_csv)

***
## North Carolina Hospitals

In [34]:

# ======================================================================
# --------------- DUKE HOSPITAL TESTING ----------------
# ======================================================================

# Grab row for Duke Hospital in Durham, NC
hos_name = 'Duke University Hospital'
matching_hospitals = hospitals_df[hospitals_df['name'] == hos_name]
if not matching_hospitals.empty:
    duke_row = matching_hospitals.iloc[0]
else:
    print(f"Hospital '{hos_name}' not found in the dataset")
    duke_row = None

# grab json path for Duke Hospital
duke_json_path = duke_row['json_path']

# load a single csv file from csv_folder for testing
test_csv_path = os.path.join(csv_folder, 'DukeHospital_Durham.csv')
duke_df = pd.read_csv(test_csv_path)

# remove duke_df Hospital, City, State, Address columns before converting to parquet
duke_df = duke_df.drop(columns=['Hospital', 'City', 'State', 'Address'])


#code_cols = ['code_1', 'code_2', 'code_3', 'code_4']
# Check matches for each code column against hcpcs_codes, cpt_codes, and lab_codes, iteratively
# for col in code_cols:
#     print(f"Checking matches for column: {col}")
#     hcpcs_matches = duke_df[duke_df[col].isin(hcpcs_codes['HCPCS Code'])]
#     cpt_matches = duke_df[duke_df[col].isin(cpt_codes['HCPCS Code'])]
#     lab_matches = duke_df[duke_df[col].isin(lab_codes['HCPCS Code'])]
#     print(f"  HCPCS matches: {len(hcpcs_matches)}")
#     print(f"  CPT matches: {len(cpt_matches)}")
#     print(f"  Lab matches: {len(lab_matches)}")

"""

    This actually shows that code_2 contains HCPCS codes and code_3 contains CPT codes
    Checking matches for column: code_1
        HCPCS matches: 0
        CPT matches: 0
        Lab matches: 0
    Checking matches for column: code_2
        HCPCS matches: 76966
        CPT matches: 0
        Lab matches: 19987
    Checking matches for column: code_3
        HCPCS matches: 0
        CPT matches: 772
        Lab matches: 0
    Checking matches for column: code_4
        HCPCS matches: 0
        CPT matches: 0
        Lab matches: 0

"""

# Duke Hospital CM Structure
# code_2/code_3 [columns 3, 5 --> 4, 6 contain type] contain HCPCS and CPT codes, so we use those for comparison against the top 200 lists
# Columns 13-24 contain payer, plan, and pricing info, so we want all of those as well as column 0 which is the 
# description of the code [used for regex matching on the front end]
# final columns to keep: 0, 3-6, 13-24
duke_df = duke_df.iloc[:, [0] + list(range(3, 7)) + list(range(13, 25))]

# actually lets go ahead and drop some columns to conserve space
duke_df = duke_df.drop(columns=['standard_charge_algorithm', 'additional_generic_notes'])

# now we can search duke_df['code_2'] and duke_df['code_2_type'] against hcpcs_codes , cpt_codes, and lab_codes
# first search hcpcs_codes
hcpcs_matches = duke_df[duke_df['code_2'].isin(hcpcs_codes['HCPCS Code'])]
cpt_matches = duke_df[duke_df['code_3'].isin(cpt_codes['HCPCS Code'])]
lab_matches = duke_df[duke_df['code_2'].isin(lab_codes['HCPCS Code'])]

# Combine all matches into one dataframe, drop duplicates
match_dfs = [df for df in [hcpcs_matches, cpt_matches, lab_matches] if not df.empty]

if match_dfs:
    all_matches = pd.concat(match_dfs, ignore_index=True).drop_duplicates()
else:
    # Create empty DataFrame with same structure as duke_df if no matches
    all_matches = pd.DataFrame(columns=duke_df.columns)

# There are some duplicate issues, mainly rows where no est. price are given, so lets remove enteries that don't have est. prices
all_matches = all_matches[all_matches['estimated_amount'].notna() & (all_matches['estimated_amount'] != '')]

# Save output data to json file for Duke json path
all_matches.to_json(duke_json_path, orient='records', lines=True)

# drop file/df from memory to save space
del duke_df
del duke_row
del test_csv_path

# ======================================================================


  duke_df = pd.read_csv(test_csv_path)


In [41]:
# ======================================================================
# --------------- ADVENTHEALTH HOSPITAL  ----------------
# ======================================================================

# Load AdventHealth Hendersonville, NC paths
hos_name = 'AdventHealth'
city_name = 'Hendersonville'
state_name = 'NC'
matching_hospitals = hospitals_df[
    (hospitals_df['name'] == hos_name) &
    (hospitals_df['state'] == state_name) &
    (hospitals_df['city'] == city_name)
]
if not matching_hospitals.empty:
    adv_nc = matching_hospitals.iloc[0]
else:
    print(f"Hospital '{hos_name}' not found in the dataset")
    adv_nc = None

# grab json path for AdventHealth Hendersonville, NC
adv_nc_json_path = adv_nc['json_path']

# load a single csv file from csv_folder for testing
adv_nc_csv_path = os.path.join(csv_folder, 'AdventHealth_Hendersonville_CM.csv')

# load AdventHealth Hendersonville, NC csv
adv_nc_df = pd.read_csv(adv_nc_csv_path)

# AdventHealth CM Structure
# ['description', 'drug_information', 'code', 'type',
#    'standard_charge_min', 'standard_charge_max', 'gross_charge',
#    'discounted_cash', 'setting', 'payer_name', 'plan_name',
#    'standard_charge_dollar', 'standard_charge_percentage',
#    'estimated_amount', 'methodology', 'standard_charge_algorithm',
#    'Hospital', 'City', 'State', 'Address']

# Check matches for code column against hcpcs_codes, cpt_codes, and lab_codes
# Output: 
# HCPCS matches: 20845
#   CPT matches: 2884
#   Lab matches: 13555

# Lets drop unneeded columns, and rename some before grabing the matches and saving to json
# NOTE: common naming convention needs to be added before renaming cols
cols_to_drop = ['methodology', 'drug_information', 'standard_charge_algorithm', 'Hospital', 'City', 'State', 'Address']
adv_nc_df = adv_nc_df.drop(columns=cols_to_drop)

# now grab matches
hcpcs_matches = adv_nc_df[adv_nc_df['code'].isin(hcpcs_codes['HCPCS Code'])]
cpt_matches = adv_nc_df[adv_nc_df['code'].isin(cpt_codes['HCPCS Code'])]
lab_matches = adv_nc_df[adv_nc_df['code'].isin(lab_codes['HCPCS Code'])]

# Combine all matches into one dataframe, drop duplicates
match_dfs = [df for df in [hcpcs_matches, cpt_matches, lab_matches] if not df.empty]

if match_dfs:
    all_matches = pd.concat(match_dfs, ignore_index=True).drop_duplicates()
else:
    # Create empty DataFrame with same structure as duke_df if no matches
    all_matches = pd.DataFrame(columns=duke_df.columns)

# There are some duplicate issues, mainly rows where no est. price are given, so lets remove enteries that don't have est. prices
all_matches = all_matches[all_matches['estimated_amount'].notna() & (all_matches['estimated_amount'] != '')]

# Save output data to json file for AdventHealth Hendersonville, NC json path
all_matches.to_json(adv_nc_json_path, orient='records', lines=True)

# drop file/df from memory to save space
del adv_nc_df
del adv_nc
del adv_nc_csv_path

# ======================================================================

  adv_nc_df = pd.read_csv(adv_nc_csv_path)
