# Time-Series Prescriber Dataset

**Purpose:** Create monthly time-series data for each prescriber

**Granularity:** Prescriber-Month (each row = one prescriber in one month)

**Time Range:** 2023-01 to 2025-09 (33 months)

**Output:** Panel dataset for longitudinal analysis and forecasting

## Step 0: Setup

In [10]:
import polars as pl
import numpy as np
import pandas as pd
from google.cloud import bigquery
from pathlib import Path

# Initialize BigQuery client
client = bigquery.Client()

# Create output directory
output_dir = Path('outputs')
output_dir.mkdir(exist_ok=True)

print("✅ Setup complete")

✅ Setup complete


## Step 1: Create Monthly Prescriber Metrics (rx_claims)

In [11]:
# Query: Monthly metrics for each prescriber from rx_claims
query_monthly_rx = """
SELECT 
  PRESCRIBER_NPI_NBR,
  FORMAT_DATE('%Y-%m', SERVICE_DATE_DD) as month,
  
  -- Prescriber name (most common value in the month)
  APPROX_TOP_COUNT(PRESCRIBER_NPI_NM, 1)[OFFSET(0)].value as prescriber_name,
  
  -- Volume metrics
  COUNT(*) as prescription_count,
  COUNT(DISTINCT NDC_DRUG_NM) as unique_drugs,
  SUM(DISPENSED_QUANTITY_VAL) as total_quantity,
  
  -- Revenue metrics
  SUM(TOTAL_PAID_AMT) as total_revenue,
  AVG(TOTAL_PAID_AMT) as avg_claim_amount,
  
  -- Brand preference
  COUNTIF(NDC_PREFERRED_BRAND_NM IS NOT NULL AND NDC_PREFERRED_BRAND_NM != '') as brand_count,
  
  -- Payer mix
  COUNTIF(PAYER_PLAN_CHANNEL_NM = 'Commercial') as commercial_count,
  COUNTIF(PAYER_PLAN_CHANNEL_NM = 'Medicare') as medicare_count,
  COUNTIF(PAYER_PLAN_CHANNEL_NM = 'Medicaid') as medicaid_count,
  
  -- Refill patterns
  COUNTIF(DAYS_SUPPLY_VAL >= 90) as supply_90day_count,
  AVG(DAYS_SUPPLY_VAL) as avg_days_supply,
  
  -- Geographic (take most common)
  APPROX_TOP_COUNT(PRESCRIBER_NPI_STATE_CD, 1)[OFFSET(0)].value as state,
  APPROX_TOP_COUNT(CLINICAL_SERVICE_LINE, 1)[OFFSET(0)].value as specialty
  
FROM 
  `unique-bonbon-472921-q8.Claims.rx_claims`
WHERE 
  PRESCRIBER_NPI_NBR IS NOT NULL
  AND SERVICE_DATE_DD IS NOT NULL
  AND TOTAL_PAID_AMT IS NOT NULL
  AND TOTAL_PAID_AMT > 0
GROUP BY 
  PRESCRIBER_NPI_NBR, month
HAVING 
  prescription_count >= 5  -- Min 5 prescriptions per month
ORDER BY 
  PRESCRIBER_NPI_NBR, month
"""

print("Executing monthly rx_claims query...")
print("(This may take a few minutes - processing 33 months of data)")
df_monthly = pl.from_arrow(client.query(query_monthly_rx).to_arrow())

# Calculate derived features
df_monthly = df_monthly.with_columns([
    (pl.col('total_revenue') / pl.col('prescription_count')).alias('revenue_per_rx'),
    (pl.col('brand_count') / pl.col('prescription_count')).alias('brand_rate'),
    (pl.col('commercial_count') / pl.col('prescription_count')).alias('commercial_rate'),
    (pl.col('medicare_count') / pl.col('prescription_count')).alias('medicare_rate'),
    (pl.col('medicaid_count') / pl.col('prescription_count')).alias('medicaid_rate'),
    (pl.col('supply_90day_count') / pl.col('prescription_count')).alias('supply_90day_rate')
])

print(f"✅ Loaded {len(df_monthly):,} prescriber-month observations")
print(f"   Unique prescribers: {df_monthly['PRESCRIBER_NPI_NBR'].n_unique():,}")
print(f"   Unique months: {df_monthly['month'].n_unique()}")
print(f"   Date range: {df_monthly['month'].min()} to {df_monthly['month'].max()}")

Executing monthly rx_claims query...
(This may take a few minutes - processing 33 months of data)
✅ Loaded 467,665 prescriber-month observations
   Unique prescribers: 50,056
   Unique months: 33
   Date range: 2023-01 to 2025-09


## Step 2: Add Monthly Medical Claims Metrics

In [12]:
query_monthly_med = """
SELECT 
  CAST(PRIMARY_HCP AS INT64) as PRESCRIBER_NPI_NBR,
  FORMAT_DATE('%Y-%m', STATEMENT_FROM_DD) as month,
  
  -- Condition complexity
  COUNT(DISTINCT condition_label) as unique_conditions,
  COUNT(*) as total_med_claims,
  
  -- Procedure involvement
  COUNTIF(PROCEDURE_CD IS NOT NULL AND PROCEDURE_CD != '') as procedure_claims,
  COUNT(DISTINCT PROCEDURE_CD) as unique_procedures,
  
  -- Financial
  SUM(CLAIM_CHARGE_AMT) as total_charges
  
FROM 
  `unique-bonbon-472921-q8.Claims.medical_claims`
WHERE 
  PRIMARY_HCP IS NOT NULL
  AND STATEMENT_FROM_DD IS NOT NULL
GROUP BY 
  PRESCRIBER_NPI_NBR, month
ORDER BY 
  PRESCRIBER_NPI_NBR, month
"""

print("Loading monthly medical claims...")
df_monthly_med = pl.from_arrow(client.query(query_monthly_med).to_arrow())

# Calculate procedure rate
df_monthly_med = df_monthly_med.with_columns([
    (pl.col('procedure_claims') / pl.col('total_med_claims')).alias('procedure_rate')
])

# Join with rx_claims monthly data
df_monthly = df_monthly.join(
    df_monthly_med,
    on=['PRESCRIBER_NPI_NBR', 'month'],
    how='left'
)

print(f"✅ Added medical claims for {df_monthly_med['PRESCRIBER_NPI_NBR'].n_unique():,} prescribers")

Loading monthly medical claims...
✅ Added medical claims for 924,846 prescribers


## Step 3: Add Static Features (Time-Invariant)

In [13]:
# Pharma payments (aggregate across all time - time-invariant)
query_payments = """
SELECT 
  CAST(npi_number AS INT64) as PRESCRIBER_NPI_NBR,
  COUNT(*) as payment_count,
  SUM(total_payment_amount) as total_payments,
  COUNT(DISTINCT associated_product) as unique_products_paid
FROM 
  `unique-bonbon-472921-q8.HCP.provider_payments`
WHERE 
  npi_number IS NOT NULL
  AND total_payment_amount > 0
GROUP BY 
  PRESCRIBER_NPI_NBR
"""

print("Loading pharma payments (time-invariant)...")
df_payments = pl.from_arrow(client.query(query_payments).to_arrow())

df_payments = df_payments.with_columns([
    pl.lit(1).alias('receives_payments')
])

# Join payments (broadcast to all months)
df_monthly = df_monthly.join(
    df_payments,
    on='PRESCRIBER_NPI_NBR',
    how='left'
)

# Fill nulls for payments
df_monthly = df_monthly.with_columns([
    pl.col('receives_payments').fill_null(0),
    pl.col('total_payments').fill_null(0),
    pl.col('payment_count').fill_null(0),
    pl.col('unique_products_paid').fill_null(0)
])

print(f"✅ Added payment data for {df_payments.height:,} prescribers")

Loading pharma payments (time-invariant)...
✅ Added payment data for 618,813 prescribers


In [14]:
# Biographical data (time-invariant)
query_bio = """
SELECT 
  CAST(npi_number AS INT64) as PRESCRIBER_NPI_NBR,
  ARRAY_LENGTH(certifications) as certification_count,
  ARRAY_LENGTH(education) as education_count,
  ARRAY_LENGTH(awards) as award_count,
  ARRAY_LENGTH(memberships) as membership_count
FROM 
  `unique-bonbon-472921-q8.HCP.providers_bio`
WHERE 
  npi_number IS NOT NULL
"""

print("Loading biographical data (time-invariant)...")
df_bio = pl.from_arrow(client.query(query_bio).to_arrow())

df_bio = df_bio.with_columns([
    (pl.col('certification_count').fill_null(0) + 
     pl.col('education_count').fill_null(0) + 
     pl.col('award_count').fill_null(0) + 
     pl.col('membership_count').fill_null(0)).alias('total_credentials')
])

# Join biographical (broadcast to all months)
df_monthly = df_monthly.join(
    df_bio.select(['PRESCRIBER_NPI_NBR', 'certification_count', 'total_credentials']),
    on='PRESCRIBER_NPI_NBR',
    how='left'
)

df_monthly = df_monthly.with_columns([
    pl.col('certification_count').fill_null(0),
    pl.col('total_credentials').fill_null(0)
])

print(f"✅ Added biographical data for {df_bio.height:,} prescribers")

Loading biographical data (time-invariant)...
✅ Added biographical data for 819,282 prescribers


## Step 4: Create Lagged Features (Time-Series Features)

In [15]:
# Sort by prescriber and month
df_monthly = df_monthly.sort(['PRESCRIBER_NPI_NBR', 'month'])

# Create lagged features (previous month)
df_monthly = df_monthly.with_columns([
    # Lag 1 (previous month)
    pl.col('prescription_count').shift(1).over('PRESCRIBER_NPI_NBR').alias('prescription_count_lag1'),
    pl.col('total_revenue').shift(1).over('PRESCRIBER_NPI_NBR').alias('total_revenue_lag1'),
    pl.col('unique_drugs').shift(1).over('PRESCRIBER_NPI_NBR').alias('unique_drugs_lag1'),
    
    # Lag 3 (3 months ago)
    pl.col('prescription_count').shift(3).over('PRESCRIBER_NPI_NBR').alias('prescription_count_lag3'),
    pl.col('total_revenue').shift(3).over('PRESCRIBER_NPI_NBR').alias('total_revenue_lag3'),
    
    # Rolling averages (3-month moving average)
    pl.col('prescription_count').rolling_mean(window_size=3).over('PRESCRIBER_NPI_NBR').alias('prescription_count_ma3'),
    pl.col('total_revenue').rolling_mean(window_size=3).over('PRESCRIBER_NPI_NBR').alias('total_revenue_ma3'),
    
    # Month-over-month growth
    ((pl.col('prescription_count') - pl.col('prescription_count').shift(1).over('PRESCRIBER_NPI_NBR')) / 
     pl.col('prescription_count').shift(1).over('PRESCRIBER_NPI_NBR')).alias('prescription_growth_mom'),
    
    ((pl.col('total_revenue') - pl.col('total_revenue').shift(1).over('PRESCRIBER_NPI_NBR')) / 
     pl.col('total_revenue').shift(1).over('PRESCRIBER_NPI_NBR')).alias('revenue_growth_mom')
])

print("✅ Created lagged and rolling features")
print("   - Lag 1 and Lag 3 features")
print("   - 3-month moving averages")
print("   - Month-over-month growth rates")

✅ Created lagged and rolling features
   - Lag 1 and Lag 3 features
   - 3-month moving averages
   - Month-over-month growth rates


In [18]:
# Parse month and create time features
df_monthly = df_monthly.with_columns([
    # Extract year and month
    pl.col('month').str.slice(0, 4).cast(pl.Int32).alias('year'),
    pl.col('month').str.slice(5, 2).cast(pl.Int32).alias('month_num'),
    
    # Create time index (months since start)
    pl.col('month').rank('dense').alias('time_index')
])

# Add quarter
df_monthly = df_monthly.with_columns([
    ((pl.col('month_num') - 1) // 3 + 1).alias('quarter')
])

# Tenure: number of months prescriber has been active
df_monthly = df_monthly.with_columns([
    pl.col('month').rank('dense').over('PRESCRIBER_NPI_NBR').alias('tenure_months')
])

print("✅ Added time features:")
print("   - year, month_num, quarter")
print("   - time_index (global month counter)")
print("   - tenure_months (prescriber-specific activity duration)")

✅ Added time features:
   - year, month_num, quarter
   - time_index (global month counter)
   - tenure_months (prescriber-specific activity duration)


## Step 6: Summary and Export

In [19]:
print("\n" + "="*80)
print("TIME-SERIES DATASET SUMMARY")
print("="*80)

print(f"\n📊 Dataset Dimensions:")
print(f"   Total observations: {len(df_monthly):,}")
print(f"   Unique prescribers: {df_monthly['PRESCRIBER_NPI_NBR'].n_unique():,}")
print(f"   Unique months: {df_monthly['month'].n_unique()}")
print(f"   Date range: {df_monthly['month'].min()} to {df_monthly['month'].max()}")
print(f"   Avg months per prescriber: {len(df_monthly) / df_monthly['PRESCRIBER_NPI_NBR'].n_unique():.1f}")

print(f"\n📈 Feature Categories:")
print(f"   Time-varying features (monthly):")
print(f"     - prescription_count, unique_drugs, total_revenue, avg_claim_amount")
print(f"     - brand_rate, payer_mix (commercial/medicare/medicaid rates)")
print(f"     - supply_90day_rate, avg_days_supply")
print(f"     - unique_conditions, procedure_rate (from medical_claims)")
print(f"   Time-invariant features (static):")
print(f"     - receives_payments, total_payments, payment_count")
print(f"     - certification_count, total_credentials")
print(f"   Lagged features:")
print(f"     - prescription_count_lag1, total_revenue_lag1, etc.")
print(f"     - prescription_count_ma3, total_revenue_ma3 (3-month MA)")
print(f"   Growth rates:")
print(f"     - prescription_growth_mom, revenue_growth_mom")
print(f"   Time features:")
print(f"     - year, month_num, quarter, time_index, tenure_months")

print(f"\n🔍 Sample Data:")
print(df_monthly.select([
    'PRESCRIBER_NPI_NBR', 'month', 'prescription_count', 'total_revenue', 
    'unique_drugs', 'prescription_growth_mom', 'tenure_months'
]).head(10))

# Check for nulls
print(f"\n⚠️  Null counts (lagged features will have nulls for first observations):")
null_summary = df_monthly.null_count()
print(null_summary.select([
    'prescription_count_lag1', 'prescription_count_lag3', 
    'prescription_count_ma3', 'prescription_growth_mom'
]))


TIME-SERIES DATASET SUMMARY

📊 Dataset Dimensions:
   Total observations: 476,557
   Unique prescribers: 50,056
   Unique months: 33
   Date range: 2023-01 to 2025-09
   Avg months per prescriber: 9.5

📈 Feature Categories:
   Time-varying features (monthly):
     - prescription_count, unique_drugs, total_revenue, avg_claim_amount
     - brand_rate, payer_mix (commercial/medicare/medicaid rates)
     - supply_90day_rate, avg_days_supply
     - unique_conditions, procedure_rate (from medical_claims)
   Time-invariant features (static):
     - receives_payments, total_payments, payment_count
     - certification_count, total_credentials
   Lagged features:
     - prescription_count_lag1, total_revenue_lag1, etc.
     - prescription_count_ma3, total_revenue_ma3 (3-month MA)
   Growth rates:
     - prescription_growth_mom, revenue_growth_mom
   Time features:
     - year, month_num, quarter, time_index, tenure_months

🔍 Sample Data:
shape: (10, 7)
┌──────────────┬─────────┬──────────────┬

In [20]:
# Export to parquet (better for time-series)
output_path = output_dir / 'timeseries_prescriber_monthly.parquet'
df_monthly.write_parquet(output_path)

# Also export to CSV for easy viewing
csv_path = output_dir / 'timeseries_prescriber_monthly.csv'
df_monthly.write_csv(csv_path)

print(f"\n✅ Exported time-series dataset:")
print(f"   Parquet: {output_path}")
print(f"   CSV: {csv_path}")
print(f"\n🎉 Time-series dataset ready for:")
print(f"   - Forecasting (ARIMA, Prophet, LSTM)")
print(f"   - Trend analysis")
print(f"   - Cohort analysis")
print(f"   - Seasonality detection")
print(f"   - Panel regression models")


✅ Exported time-series dataset:
   Parquet: outputs/timeseries_prescriber_monthly.parquet
   CSV: outputs/timeseries_prescriber_monthly.csv

🎉 Time-series dataset ready for:
   - Forecasting (ARIMA, Prophet, LSTM)
   - Trend analysis
   - Cohort analysis
   - Seasonality detection
   - Panel regression models
