# Schema Evolution Without Rewrites

In traditional databases and data lakes, changing a schema means rewriting all your data. With billions of rows, this is expensive and slow.

Iceberg solves this by storing **schema in metadata**, not in data files. You can:

* Add columns without rewriting data
* Remove columns without data loss (for time travel)
* Rename columns safely
* Promote types (int32 → int64)
* Reorder columns

All of this happens **instantly** - just metadata updates.

In this notebook, we'll cover:

* Adding and removing columns
* Renaming columns
* Type promotion
* Schema versioning
* Handling schema drift

## Why Schema Evolution Matters

Imagine you have a table with 10 billion rows (10 TB). Without schema evolution:
* Adding a column: Rewrite all 10 TB (hours/days)
* Renaming a column: Rewrite all 10 TB
* Cost: Compute + storage + time

With Iceberg:
* Any schema change: Update metadata JSON (milliseconds)
* Cost: Nearly zero
* Data files: Untouched

In [None]:
import daft
import pyarrow as pa
from pathlib import Path
from pyiceberg.catalog.sql import SqlCatalog
from datetime import datetime

%reload_ext autoreload
%autoreload 2
from helpers import inspect_iceberg_table

In [None]:
# Setup
warehouse_path = Path('../data/warehouse_schema_evolution').absolute()
warehouse_path.mkdir(parents=True, exist_ok=True)
catalog_db = warehouse_path / 'catalog.db'
catalog_db.unlink(missing_ok=True)

catalog = SqlCatalog('schema_demo', **{'uri': f'sqlite:///{catalog_db}', 'warehouse': f'file://{warehouse_path}'})
catalog.create_namespace('demo')

# Create table with initial schema
df_events = daft.read_json('../data/input/events.jsonl')
df_sample = df_events.limit(10000)

arrow_table = df_sample.to_arrow()
events_table = catalog.create_table('demo.events', schema=pa.schema(arrow_table.schema))
events_table.append(arrow_table)

print(f"✅ Created table with {len(arrow_table):,} records")
print(f"\nInitial schema:")
for field in events_table.schema().fields:
    print(f"  • {field.name}: {field.field_type}")

## Adding Columns

When you add a column:
* New metadata version is created
* New writes include the column
* Old data files: Column reads as NULL
* No data rewrite required!

In [None]:
# Add a new column
with events_table.update_schema() as update:
    update.add_column('processed_at', pa.timestamp('ms'), doc='When this event was processed')

print("✅ Added column 'processed_at'")
print(f"\nNew schema:")
for field in events_table.schema().fields:
    print(f"  • {field.name}: {field.field_type}")

In [None]:
# Query: Old data shows NULL for new column
df = daft.read_iceberg(events_table)
print("Old data (processed_at is NULL):")
df.select('id', 'type', 'processed_at').show(5)

# Add new data with the column
df_new = df_events.offset(10000).limit(1000)

# Add processed_at timestamp to the data
from datetime import datetime
new_data = df_new.to_arrow().to_pylist()
for record in new_data:
    record['processed_at'] = datetime.now().isoformat()

new_arrow = pa.Table.from_pylist(new_data)
events_table.append(new_arrow)

print(f"\n✅ Appended {len(new_arrow):,} records with processed_at")

# Query again
df = daft.read_iceberg(events_table)
print("\nNew data (processed_at has values):")
daft.sql("""
    SELECT id, type, processed_at
    FROM df
    WHERE processed_at IS NOT NULL
    LIMIT 5
""").show()

## Removing Columns

When you remove a column:
* Column marked as "deleted" in metadata
* New queries don't see it
* Data files still contain it (for time travel)
* Column ID preserved (for historical queries)

In [None]:
# Remove a column
with events_table.update_schema() as update:
    update.delete_column('processed_at')

print("✅ Removed column 'processed_at'")
print("\nCurrent schema:")
for field in events_table.schema().fields:
    print(f"  • {field.name}: {field.field_type}")

# Try to query - column is gone
df = daft.read_iceberg(events_table)
print("\nAvailable columns:")
print([f.name for f in df.schema()])

### Time Travel with Old Schemas

Even though we removed the column, we can still see it in old snapshots:

In [None]:
# Query an old snapshot where the column existed
history = events_table.history()
old_snapshot = history[1].snapshot_id  # After we added the column

df_old = daft.read_iceberg(events_table, snapshot_id=old_snapshot)
print(f"Snapshot {old_snapshot} schema:")
print([f.name for f in df_old.schema()])
print("\nThe 'processed_at' column is visible in the old snapshot!")

## Renaming Columns

Iceberg uses **column IDs** internally, not names. This makes renaming safe:
* Old code using old names: Works (uses column ID)
* New code using new names: Works (same column ID)
* No data rewrite needed

In [None]:
# Rename a column
with events_table.update_schema() as update:
    update.rename_column('source', 'device_id')

print("✅ Renamed 'source' → 'device_id'")
print("\nNew schema:")
for field in events_table.schema().fields:
    print(f"  • {field.name} (ID: {field.field_id}): {field.field_type}")

# Query with new name
df = daft.read_iceberg(events_table)
daft.sql("SELECT device_id, type FROM df LIMIT 3").show()

## Type Promotion

Iceberg supports **safe type promotions**:
* `int32` → `int64` ✅
* `float` → `double` ✅
* `decimal(10,2)` → `decimal(18,2)` ✅

Unsafe promotions are rejected:
* `int64` → `int32` ❌ (data loss)
* `string` → `int64` ❌ (incompatible)

Let's create a table to demonstrate type promotion:

In [None]:
# Create a table with int32
sensor_schema = pa.schema([
    pa.field('id', pa.int32()),
    pa.field('value', pa.float32())
])

sensors_table = catalog.create_table('demo.sensors', schema=sensor_schema)
sensor_data = pa.table({'id': [1, 2, 3], 'value': [1.5, 2.5, 3.5]})
sensors_table.append(sensor_data)

print("Initial schema:")
for field in sensors_table.schema().fields:
    print(f"  • {field.name}: {field.field_type}")

# Promote types
with sensors_table.update_schema() as update:
    update.update_column('id', pa.int64())
    update.update_column('value', pa.float64())

print("\n✅ Promoted types: int32→int64, float→double")
print("\nNew schema:")
for field in sensors_table.schema().fields:
    print(f"  • {field.name}: {field.field_type}")

# Old data is automatically promoted when read
df = daft.read_iceberg(sensors_table)
df.show()

## Schema Versioning

Every schema change creates a new **schema version**. Metadata stores all versions:
* Schema ID 0: Initial schema
* Schema ID 1: After adding column
* Schema ID 2: After renaming
* ...

Snapshots reference specific schema versions. Let's inspect:

In [None]:
# Read metadata to see all schema versions
metadata = events_table.metadata

print(f"Total schema versions: {len(metadata.schemas)}\n")

for schema in metadata.schemas:
    print(f"Schema ID {schema.schema_id}:")
    print(f"  Fields: {[f.name for f in schema.fields[:5]]}")  # Show first 5
    if len(schema.fields) > 5:
        print(f"  ... and {len(schema.fields) - 5} more")
    print()

# Show which schema each snapshot uses
print("Snapshots and their schemas:")
for i, snap in enumerate(events_table.history(), 1):
    # Note: PyIceberg may not expose schema_id directly on snapshot
    print(f"  Snapshot {i}: {snap.snapshot_id}")

## Handling Schema Drift

Real-world data isn't always clean. Let's simulate schema drift using the radiator.jsonl dataset, which has type conflicts.

### The Problem

In radiator.jsonl, `meas_ModelNumber.value` is:
* Usually an integer: `{"value": 4}`
* Sometimes a string: `{"value": " 3"}` or `{"value": "1"}`

How do we handle this?

In [None]:
# Read radiator data with Daft (it handles type coercion)
radiator_df = daft.read_json('../data/input/radiator.jsonl')

print("Daft inferred schema:")
print(radiator_df.schema())

# Sample the data
print("\nSample:")
radiator_df.select('source', 'time', 'type').show(3)

In [None]:
# Write to Iceberg - Daft handles type conflicts
# We'll take a small sample
sample_df = radiator_df.limit(5000)
sample_arrow = sample_df.to_arrow()

radiator_table = catalog.create_table('demo.radiator', schema=pa.schema(sample_arrow.schema))
radiator_table.append(sample_arrow)

print("✅ Written radiator data to Iceberg")
print("\nIceberg schema:")
for field in radiator_table.schema().fields[:5]:
    print(f"  • {field.name}: {field.field_type}")

Bit shallow here ... what happens when I query old data in the face of larger schema changes in the meanwhile?

## Review Questions

1. **Why does Iceberg use column IDs instead of names?**
   - What problems does this solve?

2. **What would break if you renamed a column in Parquet directly?**
   - Think about existing queries and applications.

3. **When would you need to rewrite data despite schema evolution?**
   - Are there any schema changes that require rewrites?

4. **How does Iceberg handle reading old data with new schemas?**
   - What happens to missing columns?

5. **Can you demote a type (int64 → int32)?**
   - Why or why not?

6. **What happens to removed columns in old snapshots?**
   - Are they still queryable?

## Hands-on Challenge

### Challenge 1: Add Computed Columns

1. Add a column `hour` to the events table
2. Extract it from the `time` column
3. Append new data with the hour populated
4. Verify old data shows NULL, new data has values

### Challenge 2: Schema Evolution History

1. List all schema versions for a table
2. For each version, show: schema ID, number of fields
3. Identify what changed between versions

### Challenge 3: Handle Type Conflicts

1. Create a test dataset with type conflicts
2. Try writing to Iceberg
3. Use explicit schema to handle conflicts
4. Verify data integrity

Use the cells below:

In [None]:
# Challenge 1: Your code here


In [None]:
# Challenge 2: Your code here


In [None]:
# Challenge 3: Your code here


## Summary

Schema evolution in Iceberg is powerful and efficient:

* **Adding columns**: Instant, old data shows NULL
* **Removing columns**: Instant, preserved for time travel
* **Renaming columns**: Safe due to column IDs
* **Type promotion**: Automatic, safe upcasts only
* **Schema versioning**: All versions preserved
* **No data rewrites**: All changes are metadata-only

### Key Insights

1. **Column IDs are the secret**: Names can change, IDs can't
2. **Metadata-only changes**: No compute cost for schema evolution
3. **Backward compatible**: Old queries still work
4. **Time travel aware**: Historical schemas preserved
5. **Type safety**: Only safe promotions allowed

### Comparison with Traditional Systems

| Operation | Traditional DB | Hive | Iceberg |
|-----------|---------------|------|----------|
| Add column | ALTER TABLE (seconds) | Add to metastore | Metadata update (ms) |
| Remove column | ALTER TABLE + rewrite | Remove from metastore | Metadata update (ms) |
| Rename column | ALTER TABLE + rewrite | Manual migration | Metadata update (ms) |
| Type promotion | Rewrite all data | Rewrite all data | Metadata update (ms) |
| Time travel | Not supported | Manual snapshots | Built-in |

### What's Next?

In the next notebooks:
* **Concurrency**: Optimistic locking and conflict resolution
* **Partitioning**: Scale to millions of files
* **Object stores**: Iceberg on S3