# 04 — Schema Evolution

Iceberg tracks schema by column **ID** (not by name or position), so you can safely:
- Add new columns
- Drop columns
- Rename columns
- Widen types (e.g., INT → BIGINT)

Old data files are **never rewritten** — Iceberg handles the mapping at read time.

In [None]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("IcebergDemo")
    .master("local[*]")
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1")
    .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.demo.type", "hadoop")
    .config("spark.sql.catalog.demo.warehouse", "../warehouse")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .getOrCreate()
)
print("Spark + Iceberg ready.")

## Current Schema

In [None]:
spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

## 1. Add a Column

Let's add a `status` column to track order status.

In [None]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    ADD COLUMNS (status STRING)
""")

spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

In [None]:
# Existing rows show NULL for the new column — no data rewrite needed!
spark.sql("SELECT * FROM demo.ecommerce.orders ORDER BY order_id").show()

In [None]:
# Set status for existing orders
spark.sql("""
    UPDATE demo.ecommerce.orders
    SET status = 'shipped'
    WHERE status IS NULL
""")

spark.sql("SELECT * FROM demo.ecommerce.orders ORDER BY order_id").show()

## 2. Rename a Column

Rename `customer` to `customer_name` — because Iceberg tracks columns by ID,
this is safe and doesn't break existing data.

In [None]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    RENAME COLUMN customer TO customer_name
""")

spark.sql("SELECT order_id, customer_name, product FROM demo.ecommerce.orders ORDER BY order_id").show()

## 3. Add Another Column with a Default Comment

In [None]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    ADD COLUMNS (shipping_address STRING COMMENT 'Customer shipping address')
""")

spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

## 4. Drop a Column

We don't need `shipping_address` after all.

In [None]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    DROP COLUMN shipping_address
""")

spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

## 5. Old Data Still Readable

Even after all these schema changes, we can still time-travel back to snapshots with the old schema. Iceberg reconciles the schema differences at read time.

In [None]:
# Get the first snapshot (before schema changes)
first_snapshot = spark.sql("""
    SELECT snapshot_id FROM demo.ecommerce.orders.snapshots
    ORDER BY committed_at
    LIMIT 1
""").collect()[0].snapshot_id

print(f"Reading from earliest snapshot ({first_snapshot}) with the CURRENT schema:")

spark.sql(f"""
    SELECT * FROM demo.ecommerce.orders
    VERSION AS OF {first_snapshot}
    ORDER BY order_id
""").show()

## Key Takeaway

| Operation           | Effect on existing data files |
|---------------------|-------------------------------|
| Add column          | None — new column reads as NULL |
| Drop column         | None — column is hidden at read time |
| Rename column       | None — mapped by column ID |
| Widen type          | None — handled at read time |

No data migration needed. No downtime. Schema changes are **metadata-only**.

**Next up:** Partitioning in notebook 05!