# ARCOS Prescription Data - Create County-Year Dataset with FIPS

This notebook processes the filtered ARCOS data to create a county-year aggregated dataset.

## Goals:
1. Load the filtered parquet file (`arcos_filtered_2006_2015.parquet`) from RAW directory
2. Clean county names
3. Aggregate to county-year level
4. Add FIPS codes for merging with population data
5. Save final processed file

## Input:
- `data/raw/arcos_filtered_2006_2015.parquet` (pre-filtered for 2006-2015 and selected states, includes Dosage_Strength)

## Output:
- `data/processed/arcos_county_year_with_fips.parquet` (county-year aggregated with FIPS codes)

## Step 1: Import Libraries and Configure Settings

In [1]:
import polars as pl
import pandas as pd
import os

# Configure Polars to use fewer threads (27% of 22 available)
os.environ["POLARS_MAX_THREADS"] = "6"

print("Libraries imported successfully!")
print(f"Polars configured to use 6 threads (out of {os.cpu_count()} available)")

Libraries imported successfully!
Polars configured to use 6 threads (out of 22 available)


## Step 2: Load Filtered ARCOS Data

In [2]:
# Updated to use RAW file which contains Dosage_Strength column
filtered_file = '../data/raw/arcos_filtered_2006_2015.parquet'

print(f"Loading filtered ARCOS data from: {filtered_file}")
print("=" * 60)

df_arcos = pl.read_parquet(filtered_file)

print(f"\n‚úì Data loaded successfully!")
print(f"  Rows: {df_arcos.shape[0]:,}")
print(f"  Columns: {df_arcos.shape[1]}")
print(f"  Column names: {df_arcos.columns}")

print("\n" + "=" * 60)
print("First 5 rows:")
print(df_arcos.head(5))

print("=" * 60)
print("FILTERING FOR CDC-COMPARABLE MME VALUES")
print("=" * 60)

print(f"\nBefore filtering: {df_arcos.shape[0]:,} rows")

# Define retail buyer types (CDC-comparable)
RETAIL_BUYER_TYPES = [
    'CHAIN PHARMACY',
    'RETAIL PHARMACY', 
    'HOSPITAL/CLINIC',
    'HOSP/CLINIC-VA',
    'PRACTITIONER',
    'PRACTITIONER-DW/30',
    'PRACTITIONER-DW/100',
    'PRACTITIONER-DW/275',
    'MLP-NURSE PRACTITIONER',
    'MAINT & DETOX',
    'COMP/MAINT/DETOX',
    'CENTRAL FILL PHARMACY'
]

rows_before = df_arcos.shape[0]

# Apply filters
df_arcos = df_arcos.filter(
    # Only sales (not returns 'R', transfers 'P', etc.)
    (pl.col('TRANSACTION_CODE') == 'S') &
    
    # Only retail buyers (pharmacies, hospitals, practitioners)
    (pl.col('BUYER_BUS_ACT').is_in(RETAIL_BUYER_TYPES)) &
    
    # Remove extreme outliers (per-transaction MME > 1 million is suspicious)
    (pl.col('MME') <= 1_000_000) &
    
    # Also filter out null/invalid counties
    (pl.col('BUYER_COUNTY').is_not_null())
)

print(f"After filtering: {df_arcos.shape[0]:,} rows")
print(f"Removed: {rows_before - df_arcos.shape[0]:,} rows")

# Show BUYER_BUS_ACT distribution after filtering
print("\nBuyer types in filtered data:")
buyer_counts = df_arcos.group_by('BUYER_BUS_ACT').agg(pl.len().alias('count')).sort('count', descending=True)
print(buyer_counts)

print("\n" + "=" * 60)
print("‚úì Data filtered for CDC-comparable analysis")
print("=" * 60)



Loading filtered ARCOS data from: ../data/raw/arcos_filtered_2006_2015.parquet

‚úì Data loaded successfully!
  Rows: 218,477,461
  Columns: 15
  Column names: ['BUYER_BUS_ACT', 'BUYER_STATE', 'BUYER_COUNTY', 'TRANSACTION_CODE', 'DRUG_CODE', 'NDC_NO', 'DRUG_NAME', 'Measure', 'MME_Conversion_Factor', 'Dosage_Strength', 'TRANSACTION_DATE', 'CALC_BASE_WT_IN_GM', 'DOSAGE_UNIT', 'MME', 'year']

First 5 rows:
shape: (5, 15)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ BUYER_BUS_ ‚îÜ BUYER_STAT ‚îÜ BUYER_COUN ‚îÜ TRANSACTIO ‚îÜ ‚Ä¶ ‚îÜ CALC_BASE ‚îÜ DOSAGE_UN ‚îÜ MME       ‚îÜ year ‚îÇ
‚îÇ ACT        ‚îÜ E          ‚îÜ TY         ‚îÜ N_CODE     ‚îÜ   ‚îÜ _WT_IN_GM ‚îÜ IT        ‚îÜ ---       ‚îÜ ---  ‚îÇ
‚îÇ ---        ‚îÜ ---        ‚îÜ

## Step 3: Initial Data Quality Check

In [3]:
print("=" * 60)
print("STEP 3: INITIAL DATA QUALITY CHECK")
print("=" * 60)

# 1. Check data shape
print(f"\n1. Data Shape:")
print(f"   Rows: {df_arcos.shape[0]:,}")
print(f"   Columns: {df_arcos.shape[1]}")

# 2. Check column names and types
print(f"\n2. Column Information:")
print(df_arcos.schema)

# 3. Check for null values
print(f"\n3. Null Values Check:")
null_counts = df_arcos.null_count()
print(null_counts)

# 4. Check year range
print(f"\n4. Year Range:")
year_stats = df_arcos.select([
    pl.col("year").min().alias("min_year"),
    pl.col("year").max().alias("max_year"),
    pl.col("year").n_unique().alias("unique_years")
])
print(year_stats)

# 5. Check states
print(f"\n5. States in Data:")
states = df_arcos.select("BUYER_STATE").unique().sort("BUYER_STATE")
print(f"   Unique states: {states.shape[0]}")
print(f"   States: {states.to_series().to_list()}")

# 6. Check for negative or zero values in MME calculation columns
print(f"\n6. Value Ranges Check (for MME calculation components):")
value_checks = df_arcos.select([
    (pl.col("Dosage_Strength") <= 0).sum().alias("Dosage_Strength_zero_or_neg"),
    (pl.col("DOSAGE_UNIT") <= 0).sum().alias("DOSAGE_UNIT_zero_or_neg"),
    (pl.col("MME_Conversion_Factor") <= 0).sum().alias("MME_Conv_Factor_zero_or_neg"),
    pl.col("Dosage_Strength").min().alias("Dosage_Strength_min"),
    pl.col("Dosage_Strength").max().alias("Dosage_Strength_max"),
    pl.col("DOSAGE_UNIT").min().alias("DOSAGE_min"),
    pl.col("DOSAGE_UNIT").max().alias("DOSAGE_max"),
    pl.col("MME_Conversion_Factor").min().alias("MME_Conv_min"),
    pl.col("MME_Conversion_Factor").max().alias("MME_Conv_max")
])
print(value_checks)

print(f"\n{'=' * 60}")
print("‚úì Initial quality check complete!")


STEP 3: INITIAL DATA QUALITY CHECK

1. Data Shape:
   Rows: 189,593,775
   Columns: 15

2. Column Information:
Schema({'BUYER_BUS_ACT': String, 'BUYER_STATE': String, 'BUYER_COUNTY': String, 'TRANSACTION_CODE': String, 'DRUG_CODE': String, 'NDC_NO': String, 'DRUG_NAME': String, 'Measure': String, 'MME_Conversion_Factor': Float32, 'Dosage_Strength': Float32, 'TRANSACTION_DATE': String, 'CALC_BASE_WT_IN_GM': Float32, 'DOSAGE_UNIT': Float32, 'MME': Float32, 'year': Int16})

3. Null Values Check:
shape: (1, 15)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ BUYER_BUS_A ‚îÜ BUYER_STATE ‚îÜ BUYER_COUNT ‚îÜ TRANSACTION ‚îÜ ‚Ä¶ ‚îÜ CALC_BASE_ ‚îÜ DOSAGE_UNI ‚îÜ MME ‚îÜ year ‚îÇ
‚îÇ CT          ‚îÜ ---         ‚îÜ Y           ‚îÜ _CODE      

## Step 4: Clean County Names

In [4]:
print("=" * 60)
print("CLEAN COUNTY NAMES")
print("=" * 60)

# Show sample of original county names
print("\n1. Sample of ORIGINAL county names (before cleaning):")
sample_counties_before = df_arcos.select("BUYER_COUNTY").unique().sort("BUYER_COUNTY").head(20)
print(sample_counties_before)

# Check for various issues
print("\n2. County Name Issues:")
county_checks = df_arcos.select([
    (pl.col("BUYER_COUNTY").str.contains("(?i)county")).sum().alias("has_county_suffix"),
    (pl.col("BUYER_COUNTY").str.strip_chars() != pl.col("BUYER_COUNTY")).sum().alias("has_whitespace"),
    pl.col("BUYER_COUNTY").n_unique().alias("unique_counties_before")
])
print(county_checks)

# Clean county names
print("\n3. Cleaning county names...")
df_arcos = df_arcos.with_columns([
    pl.col("BUYER_COUNTY")
    .str.strip_chars()  # Remove leading/trailing whitespace
    .str.to_uppercase()  # Convert to uppercase for consistency
    .str.replace(r"(?i)\s+COUNTY\s*$", "")  # Remove "COUNTY" suffix (case-insensitive)
    .str.strip_chars()  # Remove any trailing whitespace after removal
    .alias("BUYER_COUNTY")
])

print("‚úì Cleaning applied!")

# Show sample of cleaned county names
print("\n4. Sample of CLEANED county names (after cleaning):")
sample_counties_after = df_arcos.select("BUYER_COUNTY").unique().sort("BUYER_COUNTY").head(20)
print(sample_counties_after)

# Verify cleaning results
print("\n5. Verification:")
county_checks_after = df_arcos.select([
    (pl.col("BUYER_COUNTY").str.contains("(?i)county")).sum().alias("still_has_county_suffix"),
    pl.col("BUYER_COUNTY").n_unique().alias("unique_counties_after")
])
print(county_checks_after)

print(f"\n{'=' * 60}")
print("‚úì County name cleaning complete!")

CLEAN COUNTY NAMES

1. Sample of ORIGINAL county names (before cleaning):
shape: (20, 1)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ BUYER_COUNTY ‚îÇ
‚îÇ ---          ‚îÇ
‚îÇ str          ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ ABBEVILLE    ‚îÇ
‚îÇ ACCOMACK     ‚îÇ
‚îÇ ADAMS        ‚îÇ
‚îÇ AIKEN        ‚îÇ
‚îÇ AITKIN       ‚îÇ
‚îÇ ‚Ä¶            ‚îÇ
‚îÇ ALPINE       ‚îÇ
‚îÇ AMADOR       ‚îÇ
‚îÇ AMELIA       ‚îÇ
‚îÇ AMHERST      ‚îÇ
‚îÇ AMITE        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

2. County Name Issues:
shape: (1, 3)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ has_county_suffix ‚îÜ has_whitespace ‚îÜ unique_counties_before ‚îÇ
‚îÇ ---               ‚îÜ ---            ‚îÜ ---                    ‚îÇ
‚îÇ u32               ‚îÜ u32            ‚îÜ u32                    ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

## Step 5: Handle Invalid/Missing Values

In [5]:
print("=" * 60)
print("HANDLE INVALID/MISSING VALUES")
print("=" * 60)

print(f"\n1. Initial row count: {df_arcos.shape[0]:,}")

# Check for null/missing values in critical columns
print("\n2. Checking for null values in critical columns:")
null_check = df_arcos.select([
    pl.col("BUYER_STATE").is_null().sum().alias("state_nulls"),
    pl.col("BUYER_COUNTY").is_null().sum().alias("county_nulls"),
    pl.col("year").is_null().sum().alias("year_nulls"),
    pl.col("Dosage_Strength").is_null().sum().alias("Dosage_Strength_nulls"),
    pl.col("DOSAGE_UNIT").is_null().sum().alias("DOSAGE_UNIT_nulls"),
    pl.col("MME_Conversion_Factor").is_null().sum().alias("MME_Conv_Factor_nulls")
])
print(null_check)

# Check for zero or negative values in key columns for MME calculation
print("\n3. Checking for invalid values (zero/negative) in MME calculation columns:")
invalid_check = df_arcos.select([
    (pl.col("Dosage_Strength") <= 0).sum().alias("Dosage_Strength_invalid"),
    (pl.col("DOSAGE_UNIT") <= 0).sum().alias("DOSAGE_UNIT_invalid"),
    (pl.col("MME_Conversion_Factor") <= 0).sum().alias("MME_Conv_Factor_invalid")
])
print(invalid_check)

# Filter out invalid records
print("\n4. Removing rows with invalid values...")

rows_before = df_arcos.shape[0]

df_arcos = df_arcos.filter(
    (pl.col("BUYER_STATE").is_not_null()) &
    (pl.col("BUYER_COUNTY").is_not_null()) &
    (pl.col("year").is_not_null()) &
    (pl.col("Dosage_Strength").is_not_null()) &
    (pl.col("DOSAGE_UNIT").is_not_null()) &
    (pl.col("MME_Conversion_Factor").is_not_null()) &
    (pl.col("Dosage_Strength") > 0) &
    (pl.col("DOSAGE_UNIT") > 0) &
    (pl.col("MME_Conversion_Factor") > 0)
)


rows_after = df_arcos.shape[0]
rows_removed = rows_before - rows_after

print(f"   Rows before: {rows_before:,}")
print(f"   Rows after: {rows_after:,}")
print(f"   Rows removed: {rows_removed:,} ({(rows_removed/rows_before)*100:.2f}%)")

# Verify no invalid values remain
print("\n5. Verification - checking for remaining issues:")
verification = df_arcos.select([
    pl.col("BUYER_STATE").is_null().sum().alias("state_nulls"),
    pl.col("BUYER_COUNTY").is_null().sum().alias("county_nulls"),
    pl.col("year").is_null().sum().alias("year_nulls"),
    (pl.col("Dosage_Strength") <= 0).sum().alias("Dosage_Strength_invalid"),
    (pl.col("DOSAGE_UNIT") <= 0).sum().alias("DOSAGE_UNIT_invalid"),
    (pl.col("MME_Conversion_Factor") <= 0).sum().alias("MME_Conv_invalid")
])
print(verification)

print(f"\n{'=' * 60}")
print("‚úì Invalid values handled!")
print("‚úì Ready for CDC-standard MME calculation")


HANDLE INVALID/MISSING VALUES

1. Initial row count: 189,593,775

2. Checking for null values in critical columns:
shape: (1, 6)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ state_nulls ‚îÜ county_nulls ‚îÜ year_nulls ‚îÜ Dosage_Strength_ ‚îÜ DOSAGE_UNIT_null ‚îÜ MME_Conv_Factor_ ‚îÇ
‚îÇ ---         ‚îÜ ---          ‚îÜ ---        ‚îÜ nulls            ‚îÜ s                ‚îÜ nulls            ‚îÇ
‚îÇ u32         ‚îÜ u32          ‚îÜ u32        ‚îÜ ---              ‚îÜ ---              ‚îÜ ---              ‚îÇ
‚îÇ             ‚îÜ              ‚îÜ            ‚îÜ u32              ‚îÜ u32              ‚îÜ u32              ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï

## Step 6: Calculate Correct MME Values

**IMPORTANT:** We need to calculate MME correctly using the CDC-standard formula:

**MME = Dosage_Strength (mg/pill) √ó DOSAGE_UNIT (# pills) √ó MME_Conversion_Factor**

The pre-calculated `MME` column in ARCOS uses base-weight in grams, which leads to values 1000x too high when used directly.


In [6]:
print("=" * 60)
print("MME CALCULATION - VERIFYING DATA AVAILABILITY")
print("=" * 60)

print("\n‚úì Checking that we have all required columns for CDC-standard MME calculation:")
print("   1. Dosage_Strength (mg per pill)")
print("   2. DOSAGE_UNIT (number of pills)")
print("   3. MME_Conversion_Factor (opioid-to-morphine conversion)")

# Check availability of required columns
required_cols = ["Dosage_Strength", "DOSAGE_UNIT", "MME_Conversion_Factor"]
for col in required_cols:
    if col in df_arcos.columns:
        non_null = df_arcos.select(pl.col(col).is_not_null().sum()).item()
        non_zero = df_arcos.select((pl.col(col) > 0).sum()).item() if col != "MME_Conversion_Factor" else "N/A"
        print(f"   ‚úì {col}: Present ({non_null:,} non-null, {non_zero} non-zero)")
    else:
        print(f"   ‚úó {col}: MISSING")

# Sample a few records to show the calculation
print("\nüìä Sample calculation (first 5 valid records):")
sample = df_arcos.filter(
    (pl.col("Dosage_Strength") > 0) & 
    (pl.col("DOSAGE_UNIT") > 0) & 
    (pl.col("MME_Conversion_Factor") > 0)
).select([
    "Dosage_Strength",
    "DOSAGE_UNIT", 
    "MME_Conversion_Factor",
    (pl.col("Dosage_Strength") * pl.col("DOSAGE_UNIT") * pl.col("MME_Conversion_Factor")).alias("Calculated_MME")
]).head(5)
print(sample)

print("\n" + "=" * 60)
print("‚úì All required columns present for CDC-standard MME calculation")
print("  Formula: Dosage_Strength √ó DOSAGE_UNIT √ó MME_Conversion_Factor")
print("=" * 60)


MME CALCULATION - VERIFYING DATA AVAILABILITY

‚úì Checking that we have all required columns for CDC-standard MME calculation:
   1. Dosage_Strength (mg per pill)
   2. DOSAGE_UNIT (number of pills)
   3. MME_Conversion_Factor (opioid-to-morphine conversion)


   ‚úì Dosage_Strength: Present (163,403,713 non-null, 163403713 non-zero)
   ‚úì DOSAGE_UNIT: Present (163,403,713 non-null, 163403713 non-zero)
   ‚úì MME_Conversion_Factor: Present (163,403,713 non-null, N/A non-zero)

üìä Sample calculation (first 5 valid records):
shape: (5, 4)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Dosage_Strength ‚îÜ DOSAGE_UNIT ‚îÜ MME_Conversion_Factor ‚îÜ Calculated_MME ‚îÇ
‚îÇ ---             ‚îÜ ---         ‚îÜ ---                   ‚îÜ ---            ‚îÇ
‚îÇ f32             ‚îÜ f32         ‚îÜ f32                   ‚îÜ f32            ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 60.0      

In [7]:
print("=" * 60)
print("FINAL DATA SUMMARY (Memory-Efficient)")
print("=" * 60)

# Overall statistics
print("\n1. Data Dimensions:")
print(f"   Total rows: {df_arcos.shape[0]:,}")
print(f"   Total columns: {df_arcos.shape[1]}")

# Summary statistics for key variables (no groupby, just aggregations)
print("\n2. Summary Statistics for MME Calculation Components:")
summary_stats = df_arcos.select([
    pl.col("Dosage_Strength").mean().alias("avg_dosage_strength_mg"),
    pl.col("Dosage_Strength").median().alias("median_dosage_strength_mg"),
    pl.col("DOSAGE_UNIT").sum().alias("total_pills"),
    pl.col("DOSAGE_UNIT").mean().alias("avg_pills_per_transaction"),
    pl.col("MME_Conversion_Factor").mean().alias("avg_MME_conversion_factor")
])
print(summary_stats)

# Sample of clean data (no heavy operations)
print("\n3. Sample of Cleaned Data (first 10 rows):")
print(df_arcos.select(["BUYER_STATE", "BUYER_COUNTY", "year", "Dosage_Strength", "DOSAGE_UNIT", "MME_Conversion_Factor"]).head(10))

print(f"\n{'=' * 60}")
print("‚úì Data is clean and ready for aggregation!")
print("  (Skipping heavy groupby operations to prevent memory issues)")


FINAL DATA SUMMARY (Memory-Efficient)

1. Data Dimensions:
   Total rows: 163,403,713
   Total columns: 15

2. Summary Statistics for MME Calculation Components:
shape: (1, 5)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ avg_dosage_strength ‚îÜ median_dosage_stre ‚îÜ total_pills ‚îÜ avg_pills_per_tran ‚îÜ avg_MME_conversion ‚îÇ
‚îÇ _mg                 ‚îÜ ngth_mg            ‚îÜ ---         ‚îÜ saction            ‚îÜ _factor            ‚îÇ
‚îÇ ---                 ‚îÜ ---                ‚îÜ f32         ‚îÜ ---                ‚îÜ ---                ‚îÇ
‚îÇ f32                 ‚îÜ f32                ‚îÜ             ‚îÜ f32                ‚îÜ f32                ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ï

## Step 7: Aggregate to County-Year Level

In [8]:
# Aggregate to county-year level using CORRECTED MME calculation
print("Aggregating to county-year level...")
print("=" * 60)

print("\n‚ö†Ô∏è  IMPORTANT: Using CORRECTED MME Calculation")
print("   Formula: Dosage_Strength √ó DOSAGE_UNIT √ó MME_Conversion_Factor")
print("   This is the CDC-standard method (not the base-weight method)")
print("")

# Group by state, county, and year
# Calculate MME correctly: dosage strength (mg/pill) √ó number of pills √ó conversion factor
df_county_year = (
    df_arcos
    .group_by(["BUYER_STATE", "BUYER_COUNTY", "year"])
    .agg([
        (pl.col("Dosage_Strength") * pl.col("DOSAGE_UNIT") * pl.col("MME_Conversion_Factor"))
            .sum()
            .alias("opioid_shipments_mme"),
        pl.col("DOSAGE_UNIT").sum().alias("total_pills")
    ])
)

# Rename columns for clarity
df_county_year = df_county_year.rename({
    "BUYER_STATE": "state",
    "BUYER_COUNTY": "county_name"
})

print(f"‚úì Aggregation complete!")
print(f"  Aggregated rows: {df_county_year.shape[0]:,}")
print(f"  Columns: {df_county_year.shape[1]}")
print(f"\n  Included columns:")
print(f"    - opioid_shipments_mme: CDC-standard method (Dosage_Strength √ó DOSAGE_UNIT √ó MME_Conversion_Factor)")
print(f"    - total_pills: Total dosage units")

# Quick sanity check on MME values
print(f"\n  MME Sanity Check:")
mme_check = df_county_year.select([
    pl.col("opioid_shipments_mme").min().alias("min_mme"),
    pl.col("opioid_shipments_mme").mean().alias("mean_mme"),
    pl.col("opioid_shipments_mme").max().alias("max_mme")
])
print(mme_check)
print(f"  ‚úì Values should be in reasonable range (not millions/billions)")

print("\n" + "=" * 60)
print("Sample of aggregated data:")
print(df_county_year.head(20))


Aggregating to county-year level...

‚ö†Ô∏è  IMPORTANT: Using CORRECTED MME Calculation
   Formula: Dosage_Strength √ó DOSAGE_UNIT √ó MME_Conversion_Factor
   This is the CDC-standard method (not the base-weight method)

‚úì Aggregation complete!
  Aggregated rows: 10,248
  Columns: 5

  Included columns:
    - opioid_shipments_mme: CDC-standard method (Dosage_Strength √ó DOSAGE_UNIT √ó MME_Conversion_Factor)
    - total_pills: Total dosage units

  MME Sanity Check:
shape: (1, 3)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ min_mme ‚îÜ mean_mme     ‚îÜ max_mme  ‚îÇ
‚îÇ ---     ‚îÜ ---          ‚îÜ ---      ‚îÇ
‚îÇ f32     ‚îÜ f32          ‚îÜ f32      ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 60.0    ‚îÜ 1.01305984e8 ‚îÜ 4.3277e9 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚

## Step 8: Add FIPS Codes for Geographic Matching

In [9]:
print("Adding FIPS codes to prescription data...")
print("=" * 60)

# Load FIPS reference file
fips_file = '../reference/fips.txt'
print(f"\n1. Loading FIPS reference from: {fips_file}")

# State FIPS code mapping
state_fips_map = {
    '01': 'AL', '02': 'AK', '04': 'AZ', '05': 'AR', '06': 'CA',
    '08': 'CO', '09': 'CT', '10': 'DE', '11': 'DC', '12': 'FL',
    '13': 'GA', '15': 'HI', '16': 'ID', '17': 'IL', '18': 'IN',
    '19': 'IA', '20': 'KS', '21': 'KY', '22': 'LA', '23': 'ME',
    '24': 'MD', '25': 'MA', '26': 'MI', '27': 'MN', '28': 'MS',
    '29': 'MO', '30': 'MT', '31': 'NE', '32': 'NV', '33': 'NH',
    '34': 'NJ', '35': 'NM', '36': 'NY', '37': 'NC', '38': 'ND',
    '39': 'OH', '40': 'OK', '41': 'OR', '42': 'PA', '44': 'RI',
    '45': 'SC', '46': 'SD', '47': 'TN', '48': 'TX', '49': 'UT',
    '50': 'VT', '51': 'VA', '53': 'WA', '54': 'WV', '55': 'WI',
    '56': 'WY'
}

fips_data = []
with open(fips_file, 'r') as f:
    for line in f:
        # Look for lines with 5-digit FIPS codes (format: "    01001        Autauga County")
        stripped = line.strip()
        if len(stripped) >= 6 and stripped[:5].isdigit():
            fips_code = stripped[:5]
            place_name = stripped[5:].strip()
            
            # Skip state-level codes (ending in 000) and header rows
            if fips_code.endswith('000') or not place_name:
                continue
            
            # Extract state abbreviation from FIPS code
            state_code = fips_code[:2]
            state_abbrev = state_fips_map.get(state_code)
            
            if state_abbrev:
                # Clean county name: remove "County", "Parish", "Borough", etc.
                county_name = place_name.upper()
                for suffix in [' COUNTY', ' PARISH', ' BOROUGH', ' CENSUS AREA', 
                              ' CITY AND BOROUGH', ' MUNICIPALITY', ' CITY']:
                    if county_name.endswith(suffix):
                        county_name = county_name[:-len(suffix)].strip()
                        break
                
                fips_data.append({
                    'fips': fips_code,
                    'county_name': county_name,
                    'state': state_abbrev
                })

df_fips = pl.DataFrame(fips_data)

# Add manual mappings for common naming variations
manual_mappings = [
    # Florida
    {'fips': '12086', 'county_name': 'MIAMI-DADE', 'state': 'FL'},
    {'fips': '12109', 'county_name': 'SAINT JOHNS', 'state': 'FL'},
    {'fips': '12103', 'county_name': 'SAINT LUCIE', 'state': 'FL'},
    # Minnesota
    {'fips': '27137', 'county_name': 'SAINT LOUIS', 'state': 'MN'},
    # Alabama
    {'fips': '01049', 'county_name': 'DE KALB', 'state': 'AL'},
    {'fips': '01115', 'county_name': 'SAINT CLAIR', 'state': 'AL'},
    # Virginia Independent Cities
    {'fips': '51770', 'county_name': 'ROANOKE CITY', 'state': 'VA'},
    {'fips': '51600', 'county_name': 'FAIRFAX CITY', 'state': 'VA'},
    {'fips': '51760', 'county_name': 'RICHMOND CITY', 'state': 'VA'},
    {'fips': '51550', 'county_name': 'CHESAPEAKE CITY', 'state': 'VA'},
    {'fips': '51660', 'county_name': 'HARRISONBURG CITY', 'state': 'VA'},
    {'fips': '51840', 'county_name': 'WINCHESTER CITY', 'state': 'VA'},
    {'fips': '51595', 'county_name': 'EMPORIA CITY', 'state': 'VA'},
    {'fips': '51683', 'county_name': 'MANASSAS CITY', 'state': 'VA'},
    {'fips': '51710', 'county_name': 'NORFOLK CITY', 'state': 'VA'},
    {'fips': '51800', 'county_name': 'SUFFOLK CITY', 'state': 'VA'},
    {'fips': '51540', 'county_name': 'CHARLOTTESVILLE CITY', 'state': 'VA'},
    {'fips': '51640', 'county_name': 'GALAX CITY', 'state': 'VA'},
    {'fips': '51610', 'county_name': 'FALLS CHURCH CITY', 'state': 'VA'},
    {'fips': '51720', 'county_name': 'NORTON CITY', 'state': 'VA'},
    {'fips': '51735', 'county_name': 'POQUOSON CITY', 'state': 'VA'},
    {'fips': '51740', 'county_name': 'PORTSMOUTH CITY', 'state': 'VA'},
    {'fips': '51670', 'county_name': 'HOPEWELL CITY', 'state': 'VA'},
    {'fips': '51520', 'county_name': 'BRISTOL CITY', 'state': 'VA'},
    {'fips': '51570', 'county_name': 'COLONIAL HEIGHTS CITY', 'state': 'VA'},
    {'fips': '51620', 'county_name': 'FRANKLIN CITY', 'state': 'VA'},
    {'fips': '51690', 'county_name': 'MARTINSVILLE CITY', 'state': 'VA'},
    {'fips': '51730', 'county_name': 'PETERSBURG CITY', 'state': 'VA'},
    {'fips': '51775', 'county_name': 'SALEM CITY', 'state': 'VA'},
    {'fips': '51820', 'county_name': 'WAYNESBORO CITY', 'state': 'VA'},
    {'fips': '51830', 'county_name': 'WILLIAMSBURG CITY', 'state': 'VA'},
    {'fips': '51700', 'county_name': 'NEWPORT NEWS CITY', 'state': 'VA'},
    {'fips': '51678', 'county_name': 'LEXINGTON CITY', 'state': 'VA'},
    {'fips': '51580', 'county_name': 'COVINGTON CITY', 'state': 'VA'},
    {'fips': '51630', 'county_name': 'FREDERICKSBURG CITY', 'state': 'VA'},
    {'fips': '51650', 'county_name': 'HAMPTON CITY', 'state': 'VA'},
    {'fips': '51590', 'county_name': 'DANVILLE CITY', 'state': 'VA'},
    {'fips': '51810', 'county_name': 'VIRGINIA BEACH CITY', 'state': 'VA'},
    {'fips': '51790', 'county_name': 'STAUNTON CITY', 'state': 'VA'},
    {'fips': '51685', 'county_name': 'MANASSAS PARK CITY', 'state': 'VA'},
    {'fips': '51680', 'county_name': 'LYNCHBURG CITY', 'state': 'VA'},
    {'fips': '51530', 'county_name': 'BUENA VISTA CITY', 'state': 'VA'},
    {'fips': '51510', 'county_name': 'ALEXANDRIA CITY', 'state': 'VA'},
    # Other states
    {'fips': '32510', 'county_name': 'CARSON CITY', 'state': 'NV'},
    {'fips': '12027', 'county_name': 'DE SOTO', 'state': 'FL'},
    {'fips': '08014', 'county_name': 'BROOMFIELD', 'state': 'CO'},
]

df_manual = pl.DataFrame(manual_mappings)
df_fips = pl.concat([df_fips, df_manual])

print(f"   ‚úì Loaded {len(df_fips)} FIPS codes (including manual mappings)")
print(f"   Sample FIPS data:")
print(df_fips.head(10))

# Merge FIPS codes with prescription data
print(f"\n2. Merging FIPS codes with prescription data...")
print(f"   Prescription data: {df_county_year.shape[0]} rows")

df_county_year_with_fips = df_county_year.join(
    df_fips,
    on=['state', 'county_name'],
    how='left'
)

# Check merge results
matched = df_county_year_with_fips.filter(pl.col('fips').is_not_null()).shape[0]
unmatched = df_county_year_with_fips.filter(pl.col('fips').is_null()).shape[0]
match_rate = (matched / df_county_year_with_fips.shape[0]) * 100

print(f"\n3. Merge Results:")
print(f"   Total rows: {df_county_year_with_fips.shape[0]:,}")
print(f"   Matched: {matched:,} ({match_rate:.1f}%)")
print(f"   Unmatched: {unmatched:,}")

if unmatched > 0:
    print(f"\n4. Sample of unmatched counties:")
    unmatched_counties = df_county_year_with_fips.filter(
        pl.col('fips').is_null()
    ).select(['state', 'county_name']).unique()
    print(unmatched_counties.head(20))
    
print("\n" + "=" * 60)
print("‚úì FIPS codes added!")

Adding FIPS codes to prescription data...

1. Loading FIPS reference from: ../reference/fips.txt
   ‚úì Loaded 3190 FIPS codes (including manual mappings)
   Sample FIPS data:
shape: (10, 3)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ fips  ‚îÜ county_name ‚îÜ state ‚îÇ
‚îÇ ---   ‚îÜ ---         ‚îÜ ---   ‚îÇ
‚îÇ str   ‚îÜ str         ‚îÜ str   ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 01001 ‚îÜ AUTAUGA     ‚îÜ AL    ‚îÇ
‚îÇ 01003 ‚îÜ BALDWIN     ‚îÜ AL    ‚îÇ
‚îÇ 01005 ‚îÜ BARBOUR     ‚îÜ AL    ‚îÇ
‚îÇ 01007 ‚îÜ BIBB        ‚îÜ AL    ‚îÇ
‚îÇ 01009 ‚îÜ BLOUNT      ‚îÜ AL    ‚îÇ
‚îÇ 01011 ‚îÜ BULLOCK     ‚îÜ AL    ‚îÇ
‚îÇ 01013 ‚îÜ BUTLER      ‚îÜ AL    ‚îÇ
‚îÇ 01015 ‚îÜ CALHOUN     ‚îÜ AL    ‚îÇ
‚îÇ 01017 ‚îÜ CHAMBERS    ‚îÜ AL    ‚îÇ
‚îÇ 01019 ‚îÜ CHEROKEE    ‚îÜ AL    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

2. Mer

## Step 9: Validate and Save Final Dataset

In [10]:
print("=" * 60)
print("FINAL VALIDATION AND SAVE")
print("=" * 60)

# Reorder columns to put FIPS first
df_final = df_county_year_with_fips.select([
    'fips', 'state', 'county_name', 'year', 
    'opioid_shipments_mme', 'total_pills'
])

# Check unique combinations before deduplication
unique_before = df_final.select(['fips', 'year']).unique().shape[0]
print(f"\nBefore deduplication:")
print(f"   Total rows: {df_final.shape[0]:,}")
print(f"   Unique FIPS-Year combinations: {unique_before:,}")

# If there are duplicates, aggregate them
if df_final.shape[0] > unique_before:
    duplicates_count = df_final.shape[0] - unique_before
    print(f"   Found {duplicates_count} duplicate FIPS-Year entries. Aggregating...")
    
    # Aggregate by FIPS and Year, summing the values
    df_final = df_final.group_by(['fips', 'year']).agg([
        pl.col('state').first(),
        pl.col('county_name').first(),
        pl.col('opioid_shipments_mme').sum(),
        pl.col('total_pills').sum()
    ]).sort(['state', 'county_name', 'year'])
    
    print(f"   After aggregation: {df_final.shape[0]:,} rows")

# Reorder columns
df_final = df_final.select([
    'fips', 'state', 'county_name', 'year', 
    'opioid_shipments_mme', 'total_pills'
])

# Final validation
unique_final = df_final.select(['fips', 'year']).unique().shape[0]
print(f"\n1. Data Quality Checks:")
print(f"   Total rows: {df_final.shape[0]:,}")
print(f"   Unique FIPS-Year: {unique_final:,}")
print(f"   Years: {df_final['year'].min()} - {df_final['year'].max()}")
print(f"   States: {df_final['state'].n_unique()}")
print(f"   Counties: {df_final['county_name'].n_unique()}")
print(f"   Missing FIPS: {df_final.filter(pl.col('fips').is_null()).shape[0]}")
print(f"   Actual Duplicates: {df_final.shape[0] - unique_final}")

# Summary statistics
print(f"\n2. MME Summary:")
mme_summary = df_final.select([
    pl.col('opioid_shipments_mme').sum().alias('Total_MME'),
    pl.col('opioid_shipments_mme').mean().alias('Mean_MME_per_county_year'),
    pl.col('opioid_shipments_mme').min().alias('Min_MME'),
    pl.col('opioid_shipments_mme').max().alias('Max_MME')
])
print(mme_summary)

print(f"\n   ‚úì MME values should now be in reasonable range (thousands, not millions)")

# Save to parquet
output_file = '../data/processed/arcos_county_year_with_fips.parquet'
print(f"\n3. Saving to: {output_file}")

df_final.write_parquet(output_file, compression='snappy')

file_size = os.path.getsize(output_file)
print(f"   ‚úì File saved successfully!")
print(f"   Size: {file_size:,} bytes ({file_size / 1024:.2f} KB)")

print("\n" + "=" * 60)
print("SUMMARY:")
print("=" * 60)
print(f"  Time period: {df_final['year'].min()} - {df_final['year'].max()}")
print(f"  States: {df_final['state'].n_unique()}")
print(f"  Counties: {df_final['county_name'].n_unique()}")
print(f"  Total observations: {df_final.shape[0]:,}")
print(f"\n  Columns: {df_final.columns}")
print(f"\n  MME Column:")
print(f"    - opioid_shipments_mme: CDC-standard method")
print(f"      (Dosage_Strength √ó DOSAGE_UNIT √ó MME_Conversion_Factor)")

print("\n" + "=" * 60)
print("Sample of final data:")
print(df_final.head(10))

print("\n‚úì Preprocessing complete! Ready to merge with population data.")
print("\n‚ö†Ô∏è  NEXT STEP: Run src/build_panel.py to rebuild analysis panels with corrected MME data")


FINAL VALIDATION AND SAVE

Before deduplication:
   Total rows: 10,298
   Unique FIPS-Year combinations: 10,248
   Found 50 duplicate FIPS-Year entries. Aggregating...
   After aggregation: 10,248 rows

1. Data Quality Checks:
   Total rows: 10,248
   Unique FIPS-Year: 10,248
   Years: 2006 - 2015
   States: 14
   Counties: 773
   Missing FIPS: 0
   Actual Duplicates: 0

2. MME Summary:
shape: (1, 4)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Total_MME ‚îÜ Mean_MME_per_county_year ‚îÜ Min_MME ‚îÜ Max_MME  ‚îÇ
‚îÇ ---       ‚îÜ ---                      ‚îÜ ---     ‚îÜ ---      ‚îÇ
‚îÇ f32       ‚îÜ f32                      ‚îÜ f32     ‚îÜ f32      ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 1.0437e1