## Step 1: Import Libraries

In [6]:
import zipfile
import pandas as pd
import os

print("Libraries imported successfully!")

Libraries imported successfully!


## Step 2: Locate ARCOS Data File

In [7]:
# Path to extracted TSV file
tsv_path = '../data/raw/arcos_all/arcos_all.tsv'

print("ARCOS Data File:")
print("=" * 60)

if os.path.exists(tsv_path):
    file_size = os.path.getsize(tsv_path)
    print(f"✓ File found: {tsv_path}")
    print(f"  Size: {file_size:,} bytes ({file_size / (1024**2):.2f} MB)")
    print(f"  Size: {file_size / (1024**3):.2f} GB")
    print("\n✓ Ready to process!")
else:
    print(f"✗ File not found: {tsv_path}")
    print("\nPlease extract arcos_all.zip first.")

ARCOS Data File:
✓ File found: ../data/raw/arcos_all/arcos_all.tsv
  Size: 245,729,430,235 bytes (234345.85 MB)
  Size: 228.85 GB

✓ Ready to process!


## Step 3: Preview Data Structure

In [8]:
# Preview first few rows
print(f"Reading from: {tsv_path}")
print("=" * 60)

df_preview = pd.read_csv(tsv_path, sep='\t', nrows=5)
        
print("\nFirst 5 rows:")
print(df_preview)

print(f"\n\nColumns ({len(df_preview.columns)} total):")
for col in df_preview.columns:
    print(f"  - {col}")

Reading from: ../data/raw/arcos_all/arcos_all.tsv

First 5 rows:
  REPORTER_DEA_NO REPORTER_BUS_ACT         REPORTER_NAME  \
0       RM0220688      DISTRIBUTOR  MCKESSON CORPORATION   
1       RM0220688      DISTRIBUTOR  MCKESSON CORPORATION   
2       RM0220688      DISTRIBUTOR  MCKESSON CORPORATION   
3       RM0220688      DISTRIBUTOR  MCKESSON CORPORATION   
4       RM0220688      DISTRIBUTOR  MCKESSON CORPORATION   

   REPORTER_ADDL_CO_INFO      REPORTER_ADDRESS1  REPORTER_ADDRESS2  \
0                    NaN  DBA MCKESSON DRUG CO.  3000 KENSKILL AVE   
1                    NaN  DBA MCKESSON DRUG CO.  3000 KENSKILL AVE   
2                    NaN  DBA MCKESSON DRUG CO.  3000 KENSKILL AVE   
3                    NaN  DBA MCKESSON DRUG CO.  3000 KENSKILL AVE   
4                    NaN  DBA MCKESSON DRUG CO.  3000 KENSKILL AVE   

         REPORTER_CITY REPORTER_STATE  REPORTER_ZIP REPORTER_COUNTY  ...  \
0  WASHINGTON CT HOUSE             OH         43160         FAYETTE  ...   
1

## Step 4: Define Filters and Columns

## Step 4a: Understand the Bottleneck

The extraction is slow because:
1. Reading from compressed ZIP (decompression overhead)
2. Parsing tab-separated text format (slow compared to binary formats)
3. Date parsing on millions of rows
4. Even with `usecols`, pandas still scans all columns

**Alternative approaches:**
- Option 1: Use this notebook (slow but works)
- Option 2: Extract the TSV from ZIP first, then process (2-step but faster total)
- Option 3: Use polars instead of pandas (3-5x faster parsing)
- Option 4: If you only need 2006-2015 and 14 states, filter during read (much smaller output)

## Step 4b: FASTER Alternative - Filter While Reading

If you only need specific states/years, filter during extraction to massively reduce output size and speed up processing.

In [9]:
# OPTIONAL: Define filters to reduce output size dramatically
# Set to None to disable filtering and extract all data

# States to keep (14 states for DiD analysis)
states_filter = ['FL', 'WA', 'GA', 'AL', 'SC', 'NC', 'TN', 'MS', 'OR', 'CO', 'MN', 'NV', 'CA', 'VA']

# Year range
year_min, year_max = 2006, 2015

if states_filter is not None:
    print("Filtering configuration:")
    print(f"  States: {len(states_filter)} states - {', '.join(states_filter)}")
    print(f"  Years: {year_min}-{year_max}")
    print("\nThis will extract only the filtered data (much smaller output)")
else:
    print("Filtering DISABLED - extracting all data")
    print("This will save the full dataset with selected columns only")

Filtering configuration:
  States: 14 states - FL, WA, GA, AL, SC, NC, TN, MS, OR, CO, MN, NV, CA, VA
  Years: 2006-2015

This will extract only the filtered data (much smaller output)


In [10]:
# Columns to keep - focused on essential data for analysis
columns_to_keep = [
    # ESSENTIAL - Geographic identifiers
    'BUYER_STATE',              # State identifier
    'BUYER_COUNTY',             # County name
    
    # ESSENTIAL - Temporal data
    'TRANSACTION_DATE',         # Transaction date (will extract year)
    
    # ESSENTIAL - Outcome measures
    'MME',                      # Morphine Milligram Equivalents (primary outcome)
    'DOSAGE_UNIT',              # Number of pills/units (secondary outcome)
    
    # ESSENTIAL - Drug information
    'DRUG_NAME',                # Opioid type (OXYCODONE, HYDROCODONE, etc.)
    
    # OPTIONAL - For validation and flexibility
    'MME_Conversion_Factor',    # For MME validation/recalculation
    'Dosage_Strength',          # mg per unit (for MME verification)
    'CALC_BASE_WT_IN_GM',       # Total active ingredient weight (for QA)
    'DRUG_CODE',                # Drug classification code
    'NDC_NO',                   # National Drug Code
    'TRANSACTION_CODE',         # Transaction type (to filter sales vs adjustments)
    'Measure'                   # Unit type (TAB, CAP, etc.)
]

print("Configuration:")
print("=" * 60)
print(f"\nColumns to keep ({len(columns_to_keep)}):")
print("\nESSENTIAL (for analysis):")
for col in ['BUYER_STATE', 'BUYER_COUNTY', 'TRANSACTION_DATE', 'MME', 'DOSAGE_UNIT', 'DRUG_NAME']:
    print(f"  ✓ {col}")
print("\nOPTIONAL (for validation/flexibility):")
for col in ['MME_Conversion_Factor', 'Dosage_Strength', 'CALC_BASE_WT_IN_GM', 'DRUG_CODE', 'NDC_NO', 'TRANSACTION_CODE', 'Measure']:
    print(f"  - {col}")
print(f"\nNo filtering applied - keeping all states and years for maximum flexibility")


Configuration:

Columns to keep (13):

ESSENTIAL (for analysis):
  ✓ BUYER_STATE
  ✓ BUYER_COUNTY
  ✓ TRANSACTION_DATE
  ✓ MME
  ✓ DOSAGE_UNIT
  ✓ DRUG_NAME

OPTIONAL (for validation/flexibility):
  - MME_Conversion_Factor
  - Dosage_Strength
  - CALC_BASE_WT_IN_GM
  - DRUG_CODE
  - NDC_NO
  - TRANSACTION_CODE
  - Measure

No filtering applied - keeping all states and years for maximum flexibility


## Step 4c: TEST - Create Small Sample to Verify Columns

Before processing the full 228GB file, let's create a small sample to verify all columns (especially Dosage_Strength) are being saved correctly.

In [11]:
print("=" * 60)
print("TESTING: Creating small sample parquet file")
print("=" * 60)

# Use the same configuration as the full extraction
test_columns = columns_to_keep.copy()
test_dtypes = {
    'BUYER_STATE': str,
    'BUYER_COUNTY': str,
    'DRUG_NAME': str,
    'DRUG_CODE': str,
    'TRANSACTION_CODE': str,
    'Measure': str,
    'NDC_NO': str,
    'MME': 'float32',
    'DOSAGE_UNIT': 'float32',
    'MME_Conversion_Factor': 'float32',
    'Dosage_Strength': 'float32',
    'CALC_BASE_WT_IN_GM': 'float32'
}

print(f"\n1. Reading sample (100,000 rows) from TSV...")
df_test = pd.read_csv(
    tsv_path,
    sep='\t',
    usecols=test_columns,
    dtype=test_dtypes,
    nrows=100000
)

# Add year column (same as full extraction)
df_test['year'] = pd.to_datetime(df_test['TRANSACTION_DATE'], format='%Y-%m-%d', errors='coerce').dt.year.astype('Int16')

print(f"   ✓ Read {len(df_test):,} rows")
print(f"   Columns: {len(df_test.columns)}")

# Save test file
test_output = '../data/raw/TEST_sample.parquet'
print(f"\n2. Saving test file: {test_output}")
df_test.to_parquet(test_output, index=False, compression='snappy')

file_size = os.path.getsize(test_output)
print(f"   ✓ Saved! Size: {file_size / (1024**2):.2f} MB")

# Verify by reading it back
print(f"\n3. Verifying by reading back...")
df_verify = pd.read_parquet(test_output)

print(f"   ✓ Read back {len(df_verify):,} rows × {df_verify.shape[1]} columns")
print(f"\n   Columns in saved file:")
for i, col in enumerate(df_verify.columns, 1):
    null_count = df_verify[col].isna().sum()
    print(f"      {i:2d}. {col:30s} (nulls: {null_count:6,})")

# Check critical columns
print(f"\n4. Critical Column Check:")
print(f"   ✓ Dosage_Strength present: {'Dosage_Strength' in df_verify.columns}")
print(f"   ✓ CALC_BASE_WT_IN_GM present: {'CALC_BASE_WT_IN_GM' in df_verify.columns}")
print(f"   ✓ MME_Conversion_Factor present: {'MME_Conversion_Factor' in df_verify.columns}")

# Show sample data
print(f"\n5. Sample data with key columns:")
print(df_verify[['BUYER_STATE', 'DRUG_NAME', 'Dosage_Strength', 'CALC_BASE_WT_IN_GM', 'DOSAGE_UNIT', 'MME', 'year']].head(10))

print("\n" + "=" * 60)
if 'Dosage_Strength' in df_verify.columns:
    print("✅ SUCCESS! Dosage_Strength is being saved correctly!")
    print("   You can proceed with the full extraction.")
else:
    print("❌ PROBLEM! Dosage_Strength is missing from saved file!")
    print("   Something is wrong with the configuration.")
print("=" * 60)

TESTING: Creating small sample parquet file

1. Reading sample (100,000 rows) from TSV...
   ✓ Read 100,000 rows
   Columns: 14

2. Saving test file: ../data/raw/TEST_sample.parquet
   ✓ Saved! Size: 0.86 MB

3. Verifying by reading back...
   ✓ Read back 100,000 rows × 14 columns

   Columns in saved file:
       1. BUYER_STATE                    (nulls:      0)
       2. BUYER_COUNTY                   (nulls:      1)
       3. TRANSACTION_CODE               (nulls:      0)
       4. DRUG_CODE                      (nulls:      0)
       5. NDC_NO                         (nulls:      0)
       6. DRUG_NAME                      (nulls:      0)
       7. Measure                        (nulls:      4)
       8. MME_Conversion_Factor          (nulls:      0)
       9. Dosage_Strength                (nulls:      0)
      10. TRANSACTION_DATE               (nulls:      0)
      11. CALC_BASE_WT_IN_GM             (nulls:      0)
      12. DOSAGE_UNIT                    (nulls: 14,139)
      1

## Step 5: Extract All Data to Parquet (Optimized)

**Note:** This extracts ALL data without filtering. You can filter later in analysis notebooks for flexibility.

**Optimizations applied:**
- Larger chunk size (500k rows) for fewer iterations
- Pre-specified data types (category for strings, float32 instead of float64)
- Optimized date parsing
- More frequent progress updates

**Estimated time:** 5-10 minutes (3-4x faster than default)

In [12]:
import time

print("Extracting ARCOS data (Optimized)...")
print("=" * 60)

start_time = time.time()

# Check if filtering is enabled
use_filtering = 'states_filter' in locals() and states_filter is not None
if use_filtering:
    print(f"FILTERING ENABLED: {len(states_filter)} states, years {year_min}-{year_max}")
else:
    print("NO FILTERING: Extracting all data")
print()

# Optimized dtypes - no category to avoid memory overhead
dtypes = {
    'BUYER_STATE': str,
    'BUYER_COUNTY': str,
    'DRUG_NAME': str,
    'DRUG_CODE': str,
    'TRANSACTION_CODE': str,
    'Measure': str,
    'NDC_NO': str,
    'MME': 'float32',
    'DOSAGE_UNIT': 'float32',
    'MME_Conversion_Factor': 'float32',
    'Dosage_Strength': 'float32',
    'CALC_BASE_WT_IN_GM': 'float32'
}

# Read with chunking for progress tracking
chunk_size = 2_000_000  # 2M rows per chunk
chunks = []
total_rows = 0
chunk_num = 0

print("Reading TSV file in chunks (2M rows each)...")
print()

reader = pd.read_csv(
    tsv_path,
    sep='\t',
    usecols=columns_to_keep,
    dtype=dtypes,
    engine='c',
    chunksize=chunk_size,
    low_memory=False
)

for chunk in reader:
    chunk_num += 1
    chunk_rows = len(chunk)
    total_rows += chunk_rows
    
    # Extract year from dates
    chunk['year'] = pd.to_datetime(chunk['TRANSACTION_DATE'], format='%Y-%m-%d', errors='coerce').dt.year.astype('Int16')
    
    # Apply filtering if enabled
    if use_filtering:
        chunk = chunk[
            (chunk['BUYER_STATE'].isin(states_filter)) & 
            (chunk['year'] >= year_min) & 
            (chunk['year'] <= year_max)
        ]
    
    chunks.append(chunk)
    
    # Print progress
    if use_filtering:
        kept_rows = len(chunk)
        print(f"Chunk {chunk_num}: {chunk_rows:,} rows → {kept_rows:,} kept ({kept_rows/chunk_rows*100:.1f}%) | Total read: {total_rows:,}")
    else:
        print(f"Chunk {chunk_num}: {chunk_rows:,} rows | Total read: {total_rows:,} | Elapsed: {time.time()-start_time:.1f}s")

print()

# Concatenate all chunks
print("Concatenating chunks...")
df_all = pd.concat(chunks, ignore_index=True)

process_time = time.time() - start_time
print(f"✓ Complete! Processed {len(df_all):,} rows in {process_time:.1f}s ({process_time/60:.1f} min)")
print(f"  Columns: {df_all.shape[1]} | Year range: {df_all['year'].min()}-{df_all['year'].max()} | States: {df_all['BUYER_STATE'].nunique()}")

# Save to parquet
output_dir = '../data/raw'
os.makedirs(output_dir, exist_ok=True)

if use_filtering:
    output_file = os.path.join(output_dir, f'arcos_filtered_{year_min}_{year_max}.parquet')
else:
    output_file = os.path.join(output_dir, 'arcos_all_extracted.parquet')

print(f"\nSaving to parquet: {output_file}")
df_all.to_parquet(output_file, index=False, compression='snappy')

file_size = os.path.getsize(output_file)
total_time = time.time() - start_time

print(f"✓ Saved successfully!")
print(f"  File size: {file_size / (1024**2):.1f} MB")

print("\n" + "=" * 60)
print(f"✓ COMPLETE in {total_time:.1f} seconds ({total_time/60:.1f} minutes)")
print("=" * 60)

Extracting ARCOS data (Optimized)...
FILTERING ENABLED: 14 states, years 2006-2015

Reading TSV file in chunks (2M rows each)...

Chunk 1: 2,000,000 rows → 0 kept (0.0%) | Total read: 2,000,000
Chunk 2: 2,000,000 rows → 0 kept (0.0%) | Total read: 4,000,000
Chunk 3: 2,000,000 rows → 0 kept (0.0%) | Total read: 6,000,000
Chunk 4: 2,000,000 rows → 0 kept (0.0%) | Total read: 8,000,000
Chunk 5: 2,000,000 rows → 0 kept (0.0%) | Total read: 10,000,000
Chunk 6: 2,000,000 rows → 0 kept (0.0%) | Total read: 12,000,000
Chunk 7: 2,000,000 rows → 0 kept (0.0%) | Total read: 14,000,000
Chunk 8: 2,000,000 rows → 0 kept (0.0%) | Total read: 16,000,000
Chunk 9: 2,000,000 rows → 440,201 kept (22.0%) | Total read: 18,000,000
Chunk 10: 2,000,000 rows → 0 kept (0.0%) | Total read: 20,000,000
Chunk 11: 2,000,000 rows → 431,292 kept (21.6%) | Total read: 22,000,000
Chunk 12: 2,000,000 rows → 0 kept (0.0%) | Total read: 24,000,000
Chunk 13: 2,000,000 rows → 321,473 kept (16.1%) | Total read: 26,000,000
Chun

## Step 6: Verify Saved File

In [13]:
# Verify the saved file
print("Verifying saved file...")
print("=" * 60)

df_verify = pd.read_parquet(output_file)

print(f"\nFile loaded successfully!")
print(f"  Shape: {df_verify.shape[0]:,} rows × {df_verify.shape[1]} columns")
print(f"  Year range: {df_verify['year'].min()} - {df_verify['year'].max()}")
print(f"  States: {df_verify['BUYER_STATE'].nunique()} unique states")
print(f"  Counties: {df_verify['BUYER_COUNTY'].nunique()} unique counties")
print(f"  Drugs: {sorted(df_verify['DRUG_NAME'].unique())}")

print("\nFirst 5 rows:")
print(df_verify[['BUYER_STATE', 'BUYER_COUNTY', 'DRUG_NAME', 'Dosage_Strength', 'DOSAGE_UNIT', 'MME', 'year']].head())

print("\nSummary statistics:")
print(f"  Total MME: {df_verify['MME'].sum():,.0f}")
print(f"  Total dosage units: {df_verify['DOSAGE_UNIT'].sum():,.0f}")

print("\n✓ File is valid and ready for analysis!")

Verifying saved file...

File loaded successfully!
  Shape: 218,477,461 rows × 14 columns
  Year range: 2006 - 2015
  States: 14 unique states
  Counties: 777 unique counties
  Drugs: ['BUPRENORPHINE', 'CODEINE', 'DIHYDROCODEINE', 'FENTANYL', 'HYDROCODONE', 'HYDROMORPHONE', 'LEVORPHANOL', 'MEPERIDINE', 'METHADONE', 'MORPHINE', 'OPIUM, POWDERED', 'OXYCODONE', 'OXYMORPHONE', 'TAPENTADOL']

First 5 rows:
  BUYER_STATE BUYER_COUNTY      DRUG_NAME  Dosage_Strength  DOSAGE_UNIT  \
0          CA      ALAMEDA      METHADONE              0.0          0.0   
1          CA      ALAMEDA     MEPERIDINE              0.0          0.0   
2          CA      ALAMEDA       FENTANYL              0.0          0.0   
3          CA      ALAMEDA  HYDROMORPHONE              0.0          0.0   
4          CA      ALAMEDA  HYDROMORPHONE              0.0          0.0   

           MME  year  
0   458.827209  2015  
1    39.217499  2015  
2    10.000000  2015  
3  3546.399902  2015  
4    21.278400  2015  

Summa