# Backfill Simulation - Summary Pipeline v4.0

This notebook demonstrates the **backfill capability** of the Summary Pipeline v4.0.

## What is Backfill?
Backfill handles **late-arriving data** - when new records arrive for a historical month that was already processed.

### Scenario
1. We have processed months 2024-01 through 2024-06
2. New data arrives for 2024-02 (with a newer timestamp)
3. The pipeline must:
   - Detect the newer records
   - Update the 2024-02 summary rows
   - Rebuild the rolling history arrays for all affected accounts

---

## Step 1: Setup Spark Session

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from datetime import datetime

# Create Spark session with Iceberg support
spark = SparkSession.builder \
    .appName("BackfillSimulation") \
    .getOrCreate()

print(f"Spark Version: {spark.version}")
print(f"Session started at: {datetime.now()}")

## Step 2: Check Current State (BEFORE Backfill)

In [None]:
# Check table counts
print("=" * 60)
print("CURRENT TABLE COUNTS")
print("=" * 60)

accounts_count = spark.sql("SELECT COUNT(*) as cnt FROM default.default.accounts_all").collect()[0]['cnt']
summary_count = spark.sql("SELECT COUNT(*) as cnt FROM default.summary").collect()[0]['cnt']
latest_count = spark.sql("SELECT COUNT(*) as cnt FROM default.latest_summary").collect()[0]['cnt']

print(f"accounts_all:   {accounts_count:,} records")
print(f"summary:        {summary_count:,} records")
print(f"latest_summary: {latest_count:,} records")

In [None]:
# Check records by month
print("\n" + "=" * 60)
print("RECORDS BY MONTH")
print("=" * 60)

spark.sql("""
    SELECT rpt_as_of_mo as month, COUNT(*) as summary_records
    FROM default.summary
    GROUP BY rpt_as_of_mo
    ORDER BY rpt_as_of_mo
""").show()

## Step 3: Select Accounts for Backfill Simulation

We'll pick 5 accounts to demonstrate the backfill. Let's see their current state.

In [None]:
# Define accounts to backfill (accounts 1-5)
BACKFILL_ACCOUNTS = [1, 2, 3, 4, 5]
BACKFILL_MONTH = '2024-03'  # We'll backfill March 2024

print(f"Accounts selected for backfill: {BACKFILL_ACCOUNTS}")
print(f"Month to backfill: {BACKFILL_MONTH}")

In [None]:
# Store BEFORE state for comparison
before_state = spark.sql(f"""
    SELECT 
        cons_acct_key,
        rpt_as_of_mo,
        past_due_am,
        days_past_due,
        balance_am,
        payment_history_grid,
        past_due_am_history[0] as past_due_latest,
        days_past_due_history[0] as dpd_latest
    FROM default.summary
    WHERE cons_acct_key IN ({','.join(map(str, BACKFILL_ACCOUNTS))})
      AND rpt_as_of_mo = '{BACKFILL_MONTH}'
    ORDER BY cons_acct_key
""").toPandas()

print("\n" + "=" * 80)
print(f"BEFORE BACKFILL - State for {BACKFILL_MONTH}")
print("=" * 80)
before_state

In [None]:
# Also check current source data timestamps
print("\nCurrent source data timestamps:")
spark.sql(f"""
    SELECT 
        cons_acct_key,
        rpt_as_of_mo,
        past_due_am,
        days_past_due_ct_4in as days_past_due,
        base_ts
    FROM default.default.accounts_all
    WHERE cons_acct_key IN ({','.join(map(str, BACKFILL_ACCOUNTS))})
      AND rpt_as_of_mo = '{BACKFILL_MONTH}'
    ORDER BY cons_acct_key, base_ts
""").show(truncate=False)

## Step 4: Insert Late-Arriving Data

Now we'll simulate late-arriving data by inserting new records with:
- **Newer timestamp** (current time)
- **Modified values** (increased past_due and days_past_due)

In [None]:
# Define the changes we'll make
PAST_DUE_INCREASE = 10000  # Add 10,000 to past_due_am
DPD_INCREASE = 30          # Add 30 days to days_past_due

print(f"Changes to apply:")
print(f"  - past_due_am: +{PAST_DUE_INCREASE:,}")
print(f"  - days_past_due: +{DPD_INCREASE}")

In [None]:
# Insert late-arriving records
insert_sql = f"""
INSERT INTO default.default.accounts_all
SELECT 
    cons_acct_key,
    bureau_mbr_id,
    port_type_cd,
    acct_type_dtl_cd,
    pymt_terms_cd,
    pymt_terms_dtl_cd,
    acct_open_dt,
    acct_closed_dt,
    acct_dt,
    last_pymt_dt,
    schd_pymt_dt,
    orig_pymt_due_dt,
    write_off_dt,
    acct_stat_cd,
    acct_pymt_stat_cd,
    acct_pymt_stat_dtl_cd,
    acct_credit_ext_am,
    acct_bal_am,
    past_due_am + {PAST_DUE_INCREASE} as past_due_am,
    actual_pymt_am,
    next_schd_pymt_am,
    write_off_am,
    asset_class_cd_4in,
    days_past_due_ct_4in + {DPD_INCREASE} as days_past_due_ct_4in,
    high_credit_am_4in,
    cash_limit_am_4in,
    collateral_am_4in,
    total_write_off_am_4in,
    principal_write_off_am_4in,
    settled_am_4in,
    interest_rate_4in,
    suit_filed_wilful_def_stat_cd_4in,
    wo_settled_stat_cd_4in,
    collateral_cd,
    rpt_as_of_mo,
    current_timestamp() as base_ts
FROM default.default.accounts_all
WHERE rpt_as_of_mo = '{BACKFILL_MONTH}' 
  AND cons_acct_key IN ({','.join(map(str, BACKFILL_ACCOUNTS))})
"""

print("Inserting late-arriving records...")
spark.sql(insert_sql)
print("‚úÖ Done!")

In [None]:
# Verify the new records were inserted
print("\nSource data after insertion (showing both old and new records):")
spark.sql(f"""
    SELECT 
        cons_acct_key,
        rpt_as_of_mo,
        past_due_am,
        days_past_due_ct_4in as days_past_due,
        base_ts,
        CASE 
            WHEN base_ts > timestamp'2026-01-21 15:00:00' THEN '‚Üê NEW (late-arriving)'
            ELSE '(original)'
        END as record_type
    FROM default.default.accounts_all
    WHERE cons_acct_key IN ({','.join(map(str, BACKFILL_ACCOUNTS))})
      AND rpt_as_of_mo = '{BACKFILL_MONTH}'
    ORDER BY cons_acct_key, base_ts
""").show(truncate=False)

## Step 5: Run Backfill Pipeline

Now we'll run the backfill to process the late-arriving data.

In [None]:
# Import and run the pipeline
import sys
import json

# Add the summary_v4 directory to path
sys.path.insert(0, '/home/iceberg/notebooks/notebooks/summary_v4')

from summary_pipeline import SummaryConfig, SummaryPipeline

# Load config
config_path = '/home/iceberg/notebooks/notebooks/summary_v4/summary_config.json'
with open(config_path) as f:
    config_dict = json.load(f)

config = SummaryConfig(config_dict)
print(f"Config loaded from: {config_path}")
print(f"Source table: {config.source_table}")
print(f"Destination table: {config.destination_table}")

In [None]:
# Create pipeline instance
pipeline = SummaryPipeline(spark, config)

print(f"\nRunning backfill for month: {BACKFILL_MONTH}")
print("=" * 60)

In [None]:
# Run the backfill
from datetime import date

# Parse month
year, month = map(int, BACKFILL_MONTH.split('-'))
start_date = date(year, month, 1)
end_date = date(year, month, 1)

print(f"Starting backfill processing...")
print(f"Date range: {start_date} to {end_date}")
print()

# Run backfill
pipeline.run_backfill(start_date, end_date)

print("\n" + "=" * 60)
print("‚úÖ BACKFILL COMPLETE!")
print("=" * 60)

## Step 6: Check State AFTER Backfill

In [None]:
# Get AFTER state
after_state = spark.sql(f"""
    SELECT 
        cons_acct_key,
        rpt_as_of_mo,
        past_due_am,
        days_past_due,
        balance_am,
        payment_history_grid,
        past_due_am_history[0] as past_due_latest,
        days_past_due_history[0] as dpd_latest
    FROM default.summary
    WHERE cons_acct_key IN ({','.join(map(str, BACKFILL_ACCOUNTS))})
      AND rpt_as_of_mo = '{BACKFILL_MONTH}'
    ORDER BY cons_acct_key
""").toPandas()

print("\n" + "=" * 80)
print(f"AFTER BACKFILL - State for {BACKFILL_MONTH}")
print("=" * 80)
after_state

## Step 7: Compare BEFORE vs AFTER

In [None]:
# Create comparison dataframe
import pandas as pd

comparison = pd.merge(
    before_state[['cons_acct_key', 'past_due_am', 'days_past_due', 'payment_history_grid']],
    after_state[['cons_acct_key', 'past_due_am', 'days_past_due', 'payment_history_grid']],
    on='cons_acct_key',
    suffixes=('_BEFORE', '_AFTER')
)

# Calculate differences
comparison['past_due_DIFF'] = comparison['past_due_am_AFTER'] - comparison['past_due_am_BEFORE']
comparison['dpd_DIFF'] = comparison['days_past_due_AFTER'] - comparison['days_past_due_BEFORE']

print("\n" + "=" * 100)
print("COMPARISON: BEFORE vs AFTER BACKFILL")
print("=" * 100)
print(f"\nExpected changes: past_due +{PAST_DUE_INCREASE:,}, days_past_due +{DPD_INCREASE}")
print()

comparison[['cons_acct_key', 'past_due_am_BEFORE', 'past_due_am_AFTER', 'past_due_DIFF', 
            'days_past_due_BEFORE', 'days_past_due_AFTER', 'dpd_DIFF']]

In [None]:
# Verify the changes match expected
print("\n" + "=" * 60)
print("VERIFICATION")
print("=" * 60)

all_past_due_correct = (comparison['past_due_DIFF'] == PAST_DUE_INCREASE).all()
all_dpd_correct = (comparison['dpd_DIFF'] == DPD_INCREASE).all()

print(f"\npast_due_am increased by {PAST_DUE_INCREASE:,} for all accounts: ", end="")
print("‚úÖ YES" if all_past_due_correct else "‚ùå NO")

print(f"days_past_due increased by {DPD_INCREASE} for all accounts: ", end="")
print("‚úÖ YES" if all_dpd_correct else "‚ùå NO")

if all_past_due_correct and all_dpd_correct:
    print("\n" + "üéâ" * 20)
    print("BACKFILL SIMULATION SUCCESSFUL!")
    print("üéâ" * 20)

## Step 8: Check Impact on Future Months

The backfill should also update the rolling history arrays for months AFTER the backfilled month.

In [None]:
# Check how the history arrays look for later months
print("\n" + "=" * 80)
print("ROLLING HISTORY ARRAYS - Account 1")
print("=" * 80)

spark.sql("""
    SELECT 
        rpt_as_of_mo,
        payment_history_grid,
        slice(past_due_am_history, 1, 6) as past_due_6mo,
        slice(days_past_due_history, 1, 6) as dpd_6mo
    FROM default.summary
    WHERE cons_acct_key = 1
    ORDER BY rpt_as_of_mo
""").show(truncate=False)

In [None]:
# Final table counts
print("\n" + "=" * 60)
print("FINAL TABLE COUNTS")
print("=" * 60)

accounts_count = spark.sql("SELECT COUNT(*) as cnt FROM default.default.accounts_all").collect()[0]['cnt']
summary_count = spark.sql("SELECT COUNT(*) as cnt FROM default.summary").collect()[0]['cnt']
latest_count = spark.sql("SELECT COUNT(*) as cnt FROM default.latest_summary").collect()[0]['cnt']

print(f"accounts_all:   {accounts_count:,} records (includes new late-arriving records)")
print(f"summary:        {summary_count:,} records (unchanged count, but values updated)")
print(f"latest_summary: {latest_count:,} records")

---

## Summary

This notebook demonstrated:

1. **Before State**: Showed the original data for selected accounts
2. **Late-Arriving Data**: Inserted new records with newer timestamps and modified values
3. **Backfill Execution**: Ran the pipeline in backfill mode
4. **After State**: Verified the summary was updated with the new values
5. **Comparison**: Confirmed the exact changes were applied

### Key Points:
- The pipeline uses `base_ts` (timestamp) to determine which record is the "winner"
- Newer records override older records for the same account/month
- Rolling history arrays are rebuilt to reflect the updated values
- The summary table record count stays the same (update, not insert)

In [None]:
# Cleanup (optional - uncomment to run)
# spark.stop()
# print("Spark session stopped.")