# Time Travel and Snapshots

One of Iceberg's most powerful features is **time travel** - the ability to query your data as it existed at any point in the past.

In this notebook, we'll explore:

* **Understanding snapshots**: What they are and how they work
* **Time travel queries**: Query data from specific points in time
* **Snapshot operations**: Inspect, compare, and manage snapshots
* **Rollback**: Undo changes by rolling back to previous snapshots
* **Branching and tagging**: Create named references to snapshots
* **Snapshot expiration**: Clean up old snapshots to save storage

By the end, you'll be able to use Iceberg's time travel features for debugging, auditing, and data recovery.

In [None]:
import daft
import pyarrow as pa
import json
from pathlib import Path
from pyiceberg.catalog.sql import SqlCatalog
from datetime import datetime, timedelta
import time

%reload_ext autoreload
%autoreload 2
from helpers import inspect_iceberg_table, compare_snapshots

## Setup: Create a Table with Rich History

Let's create a table with multiple snapshots to demonstrate time travel features.

In [None]:
# Setup warehouse
warehouse_path = Path('../data/warehouse_time_travel').absolute()
warehouse_path.mkdir(parents=True, exist_ok=True)
catalog_db = warehouse_path / 'catalog.db'
catalog_db.unlink(missing_ok=True)

catalog = SqlCatalog(
    'time_travel_demo',
    **{'uri': f'sqlite:///{catalog_db}', 'warehouse': f'file://{warehouse_path}'}
)
catalog.create_namespace('demo')
print("✅ Catalog initialized")

In [None]:
# Load events and create table with multiple operations
jsonl_file = Path('../data/input/events.jsonl')

# Snapshot 1: Initial load (Week 1)
print("Creating Snapshot 1: Initial load")
with jsonl_file.open('r') as f:
    week1 = [json.loads(line) for i, line in enumerate(f) if i < 20000]
arrow_table = pa.Table.from_pylist(week1)
events_table = catalog.create_table('demo.events', schema=pa.schema(arrow_table.schema))
events_table.append(arrow_table)
time.sleep(0.1)  # Small delay to ensure different timestamps

# Snapshot 2: Week 2 data
print("Creating Snapshot 2: Week 2 data")
with jsonl_file.open('r') as f:
    week2 = [json.loads(line) for i, line in enumerate(f) if 20000 <= i < 40000]
arrow_table = pa.Table.from_pylist(week2)
events_table.append(arrow_table)
time.sleep(0.1)

# Snapshot 3: Delete bad data (discovered data quality issue)
print("Creating Snapshot 3: Delete LocationUpdate events (data quality fix)")
events_table.delete("type = 'c8y_LocationUpdate'")
time.sleep(0.1)

# Snapshot 4: Week 3 data
print("Creating Snapshot 4: Week 3 data")
with jsonl_file.open('r') as f:
    week3 = [json.loads(line) for i, line in enumerate(f) if 40000 <= i < 60000]
arrow_table = pa.Table.from_pylist(week3)
events_table.append(arrow_table)

print(f"\n✅ Created table with {len(events_table.history())} snapshots")

## Understanding Snapshots

A **snapshot** is an immutable view of a table at a specific point in time. Think of it like a Git commit:

* Each write operation (append, delete, overwrite) creates a new snapshot
* Snapshots are never modified - they're append-only
* Old snapshots remain accessible (until explicitly expired)
* Each snapshot has a unique ID and timestamp

### Snapshot Properties

Each snapshot contains:
* **snapshot_id**: Unique identifier (64-bit integer)
* **timestamp_ms**: When this snapshot was created
* **parent_snapshot_id**: Previous snapshot (forms a chain)
* **sequence_number**: Monotonically increasing number
* **manifest_list**: Path to manifest list (index of data files)
* **summary**: Operation type and statistics

Let's inspect our snapshots:

In [None]:
# Get snapshot history
history = events_table.history()

print(f"Total snapshots: {len(history)}\n")
print(f"{'Snap':<5} {'Timestamp':<20} {'Operation':<10} {'Records':<12} {'Files'}")
print("-" * 70)

for i, snapshot in enumerate(history, 1):
    timestamp = datetime.fromtimestamp(snapshot.timestamp_ms / 1000).strftime('%Y-%m-%d %H:%M:%S')
    operation = snapshot.summary.operation.value if snapshot.summary else 'N/A'
    
    # Get statistics from summary
    props = snapshot.summary.additional_properties if snapshot.summary else {}
    total_records = props.get('total-records', props.get('added-records', 'N/A'))
    total_files = props.get('total-data-files', props.get('added-data-files', 'N/A'))
    
    print(f"{i:<5} {timestamp:<20} {operation:<10} {str(total_records):<12} {total_files}")

In [None]:
# Inspect a specific snapshot in detail
snap1 = history[0]
print(f"Snapshot 1 Details:\n")
print(f"ID: {snap1.snapshot_id}")
print(f"Timestamp: {datetime.fromtimestamp(snap1.timestamp_ms / 1000)}")
print(f"Sequence number: {snap1.sequence_number if hasattr(snap1, 'sequence_number') else 'N/A'}")
print(f"Parent snapshot: {snap1.parent_snapshot_id if hasattr(snap1, 'parent_snapshot_id') else 'None (first snapshot)'}")

if snap1.summary:
    print(f"\nSummary:")
    print(f"  Operation: {snap1.summary.operation.value}")
    if snap1.summary.additional_properties:
        for key, value in sorted(snap1.summary.additional_properties.items()):
            print(f"  {key}: {value}")

## Time Travel Queries

Iceberg supports two ways to query historical data:

1. **By snapshot ID**: Query a specific snapshot
2. **By timestamp**: Query data as of a specific time

### Query by Snapshot ID

This is the most precise method - you specify exactly which snapshot to query.

In [None]:
# Query current state
df_current = daft.read_iceberg(events_table)
print("Current state:")
daft.sql("SELECT COUNT(*) as total_events FROM df_current").show()

# Query Snapshot 1 (first load)
snapshot1_id = history[0].snapshot_id
df_snap1 = daft.read_iceberg(events_table, snapshot_id=snapshot1_id)
print(f"\nSnapshot 1 (ID: {snapshot1_id}):")
daft.sql("SELECT COUNT(*) as total_events FROM df_snap1").show()

# Query Snapshot 2 (after week 2)
snapshot2_id = history[1].snapshot_id
df_snap2 = daft.read_iceberg(events_table, snapshot_id=snapshot2_id)
print(f"\nSnapshot 2 (ID: {snapshot2_id}):")
daft.sql("SELECT COUNT(*) as total_events FROM df_snap2").show()

### Query by Timestamp

This finds the snapshot that was current at the specified time. Useful for questions like:
* "Show me the data as of yesterday at 3pm"
* "What did the monthly report see on March 1st?"

In [None]:
# Get timestamp between snapshot 1 and 2
snap1_time = history[0].timestamp_ms
snap2_time = history[1].timestamp_ms
mid_time = snap1_time + (snap2_time - snap1_time) // 2

print(f"Querying as of: {datetime.fromtimestamp(mid_time / 1000)}")
print(f"(Between snapshot 1 and 2)\n")

# Query as of that timestamp
df_as_of = daft.read_iceberg(events_table, as_of_timestamp=mid_time)
daft.sql("SELECT COUNT(*) as total_events FROM df_as_of").show()

print("\nThis returns Snapshot 1 data, because that was current at the specified time.")

### Time Travel Use Cases

Time travel is useful for:

1. **Debugging**: "What data did the broken job see?"
2. **Auditing**: "Show me all changes in the last 24 hours"
3. **Reproducing reports**: "Re-run the quarterly report with the exact data it used"
4. **Data recovery**: "Restore deleted records from yesterday"
5. **Testing**: "Compare results before and after a schema change"

Let's demonstrate a debugging scenario:

In [None]:
# Scenario: We deleted LocationUpdate events in Snapshot 3
# Let's verify they existed before and are gone now

# Before deletion (Snapshot 2)
snap2_id = history[1].snapshot_id
df_before = daft.read_iceberg(events_table, snapshot_id=snap2_id)
print("Before deletion (Snapshot 2):")
daft.sql("""
    SELECT
        SUM(CASE WHEN type = 'c8y_LocationUpdate' THEN 1 ELSE 0 END) as location_updates,
        COUNT(*) as total
    FROM df_before
""").show()

# After deletion (Snapshot 3)
snap3_id = history[2].snapshot_id
df_after = daft.read_iceberg(events_table, snapshot_id=snap3_id)
print("\nAfter deletion (Snapshot 3):")
daft.sql("""
    SELECT
        SUM(CASE WHEN type = 'c8y_LocationUpdate' THEN 1 ELSE 0 END) as location_updates,
        COUNT(*) as total
    FROM df_after
""").show()

print("\n✅ Time travel confirmed the deletion happened in Snapshot 3")

## Rollback: Undoing Changes

Rollback reverts the table to a previous snapshot. This is like `git reset` - it makes a previous snapshot current.

**Important**: Rollback doesn't delete snapshots or data. It just changes which snapshot is "current".

### When to Use Rollback

* Bad data was loaded
* A bug in the ingestion pipeline
* Accidental deletion
* Testing: rollback after test, then rollback to restore

Let's rollback to before the deletion:

In [None]:
# Current state (after deletion)
df = daft.read_iceberg(events_table)
print("Current state (Snapshot 4 - after deletion):")
result = daft.sql("""
    SELECT
        SUM(CASE WHEN type = 'c8y_LocationUpdate' THEN 1 ELSE 0 END) as location_updates,
        COUNT(*) as total
    FROM df
""").collect()
print(result)

# Rollback to Snapshot 2 (before deletion)
snap2_id = history[1].snapshot_id
print(f"\nRolling back to Snapshot 2 (ID: {snap2_id})...")
events_table = events_table.manage_snapshots().rollback_to_snapshot(snap2_id).commit()

# Verify
df = daft.read_iceberg(events_table)
print("\nAfter rollback:")
result = daft.sql("""
    SELECT
        SUM(CASE WHEN type = 'c8y_LocationUpdate' THEN 1 ELSE 0 END) as location_updates,
        COUNT(*) as total
    FROM df
""").collect()
print(result)

print("\n✅ Rollback successful! LocationUpdate events are back.")

In [None]:
# Check history after rollback
history = events_table.history()
current_snap_id = events_table.current_snapshot().snapshot_id

print("Snapshot history after rollback:\n")
for i, snap in enumerate(history, 1):
    is_current = snap.snapshot_id == current_snap_id
    marker = " ← CURRENT" if is_current else ""
    print(f"Snapshot {i}: {snap.snapshot_id}{marker}")

print("\nNote: All snapshots still exist! We just changed which one is 'current'.")

## Branching and Tagging

Iceberg supports **named references** to snapshots:

* **Tags**: Immutable pointers to snapshots (like Git tags)
  - Use for: releases, quarterly reports, milestones
  - Example: `quarterly-report-2024-Q4`

* **Branches**: Mutable pointers that can advance (like Git branches)
  - Use for: experimental changes, staging environments
  - Example: `experiment-new-schema`

### Creating Tags

Tags are useful for marking important snapshots that you want to keep forever.

In [None]:
# Create tags for important snapshots
snap1_id = history[0].snapshot_id
snap2_id = history[1].snapshot_id

# Tag the initial load
events_table = events_table.manage_snapshots().create_tag('initial-load', snap1_id).commit()
print(f"✅ Created tag 'initial-load' → Snapshot {snap1_id}")

# Tag week 2
events_table = events_table.manage_snapshots().create_tag('week-2-complete', snap2_id).commit()
print(f"✅ Created tag 'week-2-complete' → Snapshot {snap2_id}")

In [None]:
# Query by tag name
df_initial = daft.read_iceberg(events_table, snapshot_id='initial-load')
print("Querying 'initial-load' tag:")
daft.sql("SELECT COUNT(*) as total FROM df_initial").show()

# Query another tag
df_week2 = daft.read_iceberg(events_table, snapshot_id='week-2-complete')
print("\nQuerying 'week-2-complete' tag:")
daft.sql("SELECT COUNT(*) as total FROM df_week2").show()

### Creating Branches

Branches allow you to make experimental changes without affecting the main timeline.

Note: PyIceberg's current version has limited branch support. In production with Spark/Flink, you can:
* Create branches: `CREATE BRANCH experiment FROM SNAPSHOT 123`
* Write to branches: `INSERT INTO table.branch_experiment ...`
* Merge branches back to main

For now, we'll demonstrate the concept:

In [None]:
# Demonstrate the branch concept
print("Branch concept (requires Spark/Flink for full support):\n")
print("1. Create branch 'experiment' from current snapshot")
print("   CREATE BRANCH experiment")
print()
print("2. Write to the branch")
print("   INSERT INTO table.branch_experiment SELECT ...")
print()
print("3. Query the branch")
print("   SELECT * FROM table.branch_experiment")
print()
print("4. Main table is unaffected")
print("   SELECT * FROM table  -- sees main branch")
print()
print("5. Merge or drop the branch")
print("   CALL merge_branch('table', 'experiment')")
print("   DROP BRANCH experiment")

## Snapshot Expiration

Snapshots are retained forever by default. This enables unlimited time travel, but:

* **Storage costs**: Old data files accumulate
* **Metadata overhead**: Manifest lists grow
* **Cleanup complexity**: Hard to know what's still needed

**Snapshot expiration** removes old snapshots and their unreferenced data files.

### Expiration Strategies

1. **Time-based**: Expire snapshots older than N days
2. **Count-based**: Keep only the last N snapshots
3. **Tagged snapshots**: Never expire tagged snapshots

### Safe Expiration

Iceberg expiration is safe:
* Only removes unreferenced files
* Respects the `min_snapshots_to_keep` setting
* Honors retention policies

Let's demonstrate expiration:

In [None]:
# Check current snapshots
print("Snapshots before expiration:")
history = events_table.history()
for i, snap in enumerate(history, 1):
    timestamp = datetime.fromtimestamp(snap.timestamp_ms / 1000).strftime('%Y-%m-%d %H:%M:%S')
    print(f"  Snapshot {i}: {snap.snapshot_id} @ {timestamp}")

print(f"\nTotal: {len(history)} snapshots")

In [None]:
# Expire old snapshots (keep last 2, plus tagged)
# Note: This is a destructive operation in production!
# For this demo, we'll show the concept without actually expiring

print("Expiration command (not executed in demo):")
print()
print("  events_table.expire_snapshots(")
print("      older_than=datetime.now() - timedelta(days=7),  # Expire > 7 days old")
print("      retain_last=2  # But keep at least 2 snapshots")
print("  )")
print()
print("This would:")
print("  • Remove snapshots older than 7 days")
print("  • But always keep the last 2 snapshots")
print("  • Never remove tagged snapshots ('initial-load', 'week-2-complete')")
print("  • Delete data files only referenced by expired snapshots")
print()
print("⚠️  In production, test expiration carefully!")
print("    Use 'dry_run=True' first to see what would be deleted.")

## Comparing Snapshots

Let's use our helper function to compare two snapshots:

In [None]:
# Compare Snapshot 1 and Snapshot 2
snap1_id = history[0].snapshot_id
snap2_id = history[1].snapshot_id

compare_snapshots(events_table, snap1_id, snap2_id)

## Review Questions

Test your understanding:

1. **How is Iceberg time travel different from backup files?**
   - Think about query capabilities, storage efficiency, and metadata.

2. **What happens to data files when you rollback?**
   - Are files deleted? What changes in the metadata?

3. **Why might you want both tags and branches?**
   - When would you use each?

4. **Is rollback destructive?**
   - Can you undo a rollback?

5. **What's the difference between querying by snapshot ID vs. timestamp?**
   - Which is more precise? Which is more user-friendly?

6. **How does snapshot expiration decide what to delete?**
   - What are the rules for safe deletion?

## Hands-on Challenge

### Challenge 1: Simulate a Bad Load and Rollback

1. Append some test data to the events table
2. Verify the data is there
3. "Discover" it's bad data (simulate)
4. Rollback to before the bad load
5. Verify the data is gone

### Challenge 2: Create Monthly Tags

1. Based on the timestamps of your snapshots
2. Create tags like `month-2024-01`, `month-2024-02`
3. Query by tag to verify

### Challenge 3: Audit Trail

1. Create a report showing all changes to the table
2. For each snapshot: timestamp, operation, records added/deleted
3. Calculate: net change in records from start to now

Use the cells below:

In [None]:
# Challenge 1: Your code here


In [None]:
# Challenge 2: Your code here


In [None]:
# Challenge 3: Your code here


## Summary

In this notebook, we explored Iceberg's time travel capabilities:

* **Snapshots**: Immutable views of the table at specific points in time
  - Created by every write operation
  - Form a chain with parent pointers
  - Contain operation summaries and statistics

* **Time Travel**: Query historical data
  - By snapshot ID (precise)
  - By timestamp (user-friendly)
  - Use cases: debugging, auditing, reproducing reports

* **Rollback**: Undo changes
  - Non-destructive (snapshots remain)
  - Changes which snapshot is "current"
  - Can be undone by rolling back again

* **Tags and Branches**: Named references
  - Tags: Immutable, for milestones
  - Branches: Mutable, for experiments
  - Enable human-readable snapshot references

* **Snapshot Expiration**: Clean up old data
  - Time-based or count-based policies
  - Safe: only removes unreferenced files
  - Respects tags and retention settings

### Key Takeaways

1. **Time travel is cheap**: Just metadata changes, no data copying
2. **Rollback is safe**: Can always roll forward again
3. **Tags preserve history**: Tagged snapshots never expire
4. **Expiration is necessary**: Balance retention vs. storage costs
5. **Audit trail is automatic**: Every change is recorded

### What's Next?

In the next notebooks:
* **Schema evolution**: Change schema without rewriting data
* **Concurrency**: Optimistic locking and conflict resolution
* **Partitioning**: Scale to millions of files
* **Object stores**: Iceberg on S3 with efficient metadata