# 04 — Schema Evolution

Iceberg tracks schema by column **ID** (not by name or position), so you can safely:
- Add new columns
- Drop columns
- Rename columns
- Widen types (e.g., INT → BIGINT)

Old data files are **never rewritten** — Iceberg handles the mapping at read time.

In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("IcebergDemo")
    .master("local[*]")
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1")
    .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.demo.type", "hadoop")
    .config("spark.sql.catalog.demo.warehouse", "../warehouse")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .getOrCreate()
)
print("Spark + Iceberg ready.")

26/02/23 13:57:36 WARN Utils: Your hostname, barkha-xg1 resolves to a loopback address: 127.0.1.1; using 192.168.1.227 instead (on interface enp195s0)
26/02/23 13:57:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /home/barkha/.ivy2/cache
The jars for the packages stored in: /home/barkha/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c29059ac-4652-4bca-a5f2-d08550546b8f;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.7.1 in central
:: resolution report :: resolve 54ms :: artifacts dl 2ms
	:: modules in use:
	org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.7.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	----

:: loading settings :: url = jar:file:/home/barkha/iceberg-demo/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


26/02/23 13:57:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/02/23 13:57:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
26/02/23 13:57:37 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
26/02/23 13:57:37 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


Spark + Iceberg ready.


## Current Schema

In [2]:
spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

+----------+---------+-------+
|col_name  |data_type|comment|
+----------+---------+-------+
|order_id  |int      |NULL   |
|customer  |string   |NULL   |
|product   |string   |NULL   |
|quantity  |int      |NULL   |
|price     |double   |NULL   |
|order_date|date     |NULL   |
+----------+---------+-------+



## 1. Add a Column

Let's add a `status` column to track order status.

In [3]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    ADD COLUMNS (status STRING)
""")

spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

+----------+---------+-------+
|col_name  |data_type|comment|
+----------+---------+-------+
|order_id  |int      |NULL   |
|customer  |string   |NULL   |
|product   |string   |NULL   |
|quantity  |int      |NULL   |
|price     |double   |NULL   |
|order_date|date     |NULL   |
|status    |string   |NULL   |
+----------+---------+-------+



26/02/23 13:57:52 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [4]:
# Existing rows show NULL for the new column — no data rewrite needed!
spark.sql("SELECT * FROM demo.ecommerce.orders ORDER BY order_id").show()

+--------+--------+----------+--------+------+----------+------+
|order_id|customer|   product|quantity| price|order_date|status|
+--------+--------+----------+--------+------+----------+------+
|       1|   Alice|    Laptop|       1|999.99|2024-01-15|  NULL|
|       2|     Bob|     Mouse|       2| 29.99|2024-01-16|  NULL|
|       3| Charlie|  Keyboard|       1| 79.99|2024-01-16|  NULL|
|       4|   Alice|   Monitor|       1|349.99|2024-01-17|  NULL|
|       5|   Diana|Headphones|       3| 59.99|2024-01-18|  NULL|
+--------+--------+----------+--------+------+----------+------+



In [5]:
# Set status for existing orders
spark.sql("""
    UPDATE demo.ecommerce.orders
    SET status = 'shipped'
    WHERE status IS NULL
""")

spark.sql("SELECT * FROM demo.ecommerce.orders ORDER BY order_id").show()

+--------+--------+----------+--------+------+----------+-------+
|order_id|customer|   product|quantity| price|order_date| status|
+--------+--------+----------+--------+------+----------+-------+
|       1|   Alice|    Laptop|       1|999.99|2024-01-15|shipped|
|       2|     Bob|     Mouse|       2| 29.99|2024-01-16|shipped|
|       3| Charlie|  Keyboard|       1| 79.99|2024-01-16|shipped|
|       4|   Alice|   Monitor|       1|349.99|2024-01-17|shipped|
|       5|   Diana|Headphones|       3| 59.99|2024-01-18|shipped|
+--------+--------+----------+--------+------+----------+-------+



## 2. Rename a Column

Rename `customer` to `customer_name` — because Iceberg tracks columns by ID,
this is safe and doesn't break existing data.

In [6]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    RENAME COLUMN customer TO customer_name
""")

spark.sql("SELECT order_id, customer_name, product FROM demo.ecommerce.orders ORDER BY order_id").show()

+--------+-------------+----------+
|order_id|customer_name|   product|
+--------+-------------+----------+
|       1|        Alice|    Laptop|
|       2|          Bob|     Mouse|
|       3|      Charlie|  Keyboard|
|       4|        Alice|   Monitor|
|       5|        Diana|Headphones|
+--------+-------------+----------+



## 3. Add Another Column with a Default Comment

In [7]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    ADD COLUMNS (shipping_address STRING COMMENT 'Customer shipping address')
""")

spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

+----------------+---------+-------------------------+
|col_name        |data_type|comment                  |
+----------------+---------+-------------------------+
|order_id        |int      |NULL                     |
|customer_name   |string   |NULL                     |
|product         |string   |NULL                     |
|quantity        |int      |NULL                     |
|price           |double   |NULL                     |
|order_date      |date     |NULL                     |
|status          |string   |NULL                     |
|shipping_address|string   |Customer shipping address|
+----------------+---------+-------------------------+



## 4. Drop a Column

We don't need `shipping_address` after all.

In [8]:
spark.sql("""
    ALTER TABLE demo.ecommerce.orders
    DROP COLUMN shipping_address
""")

spark.sql("DESCRIBE demo.ecommerce.orders").show(truncate=False)

+-------------+---------+-------+
|col_name     |data_type|comment|
+-------------+---------+-------+
|order_id     |int      |NULL   |
|customer_name|string   |NULL   |
|product      |string   |NULL   |
|quantity     |int      |NULL   |
|price        |double   |NULL   |
|order_date   |date     |NULL   |
|status       |string   |NULL   |
+-------------+---------+-------+



## 5. Old Data Still Readable

Even after all these schema changes, we can still time-travel back to snapshots with the old schema. Iceberg reconciles the schema differences at read time.

In [9]:
# Get the first snapshot (before schema changes)
first_snapshot = spark.sql("""
    SELECT snapshot_id FROM demo.ecommerce.orders.snapshots
    ORDER BY committed_at
    LIMIT 1
""").collect()[0].snapshot_id

print(f"Reading from earliest snapshot ({first_snapshot}) with the CURRENT schema:")

spark.sql(f"""
    SELECT * FROM demo.ecommerce.orders
    VERSION AS OF {first_snapshot}
    ORDER BY order_id
""").show()

Reading from earliest snapshot (8255495233418216272) with the CURRENT schema:
+--------+--------+----------+--------+------+----------+
|order_id|customer|   product|quantity| price|order_date|
+--------+--------+----------+--------+------+----------+
|       1|   Alice|    Laptop|       1|999.99|2024-01-15|
|       2|     Bob|     Mouse|       2| 29.99|2024-01-16|
|       3| Charlie|  Keyboard|       1| 79.99|2024-01-16|
|       4|   Alice|   Monitor|       1|349.99|2024-01-17|
|       5|   Diana|Headphones|       3| 59.99|2024-01-18|
+--------+--------+----------+--------+------+----------+



## Key Takeaway

| Operation           | Effect on existing data files |
|---------------------|-------------------------------|
| Add column          | None — new column reads as NULL |
| Drop column         | None — column is hidden at read time |
| Rename column       | None — mapped by column ID |
| Widen type          | None — handled at read time |

No data migration needed. No downtime. Schema changes are **metadata-only**.

**Next up:** Partitioning in notebook 05!