# 05 — Partitioning

Iceberg introduces **hidden partitioning** — users don't need to know how a table is partitioned to query it correctly. Iceberg also supports **partition evolution**, letting you change the partitioning scheme without rewriting data.

In this notebook:
1. Create a partitioned table
2. See hidden partitioning in action
3. Evolve the partition scheme
4. Demonstrate partition pruning with EXPLAIN

In [None]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("IcebergDemo")
    .master("local[*]")
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1")
    .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.demo.type", "hadoop")
    .config("spark.sql.catalog.demo.warehouse", "../warehouse")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .getOrCreate()
)
print("Spark + Iceberg ready.")

## 1. Create a Partitioned Table

In Hive, you'd partition by a literal column (e.g., `order_date`), and users would have to include that column in queries.

Iceberg uses **partition transforms** — you can partition by `month(order_date)` and Iceberg handles it transparently.

In [None]:
spark.sql("DROP TABLE IF EXISTS demo.ecommerce.orders_partitioned")

spark.sql("""
    CREATE TABLE demo.ecommerce.orders_partitioned (
        order_id    INT,
        customer    STRING,
        product     STRING,
        quantity    INT,
        price       DOUBLE,
        order_date  DATE
    )
    USING iceberg
    PARTITIONED BY (month(order_date))
""")

print("Partitioned table created (partitioned by month of order_date).")

In [None]:
# Insert data spanning several months
spark.sql("""
    INSERT INTO demo.ecommerce.orders_partitioned VALUES
        (1,  'Alice',   'Laptop',     1, 999.99,  DATE '2024-01-15'),
        (2,  'Bob',     'Mouse',      2, 29.99,   DATE '2024-01-16'),
        (3,  'Charlie', 'Keyboard',   1, 79.99,   DATE '2024-02-01'),
        (4,  'Diana',   'Monitor',    1, 349.99,  DATE '2024-02-14'),
        (5,  'Eve',     'Headphones', 3, 59.99,   DATE '2024-03-01'),
        (6,  'Frank',   'Webcam',     1, 89.99,   DATE '2024-03-15'),
        (7,  'Alice',   'Tablet',     1, 449.99,  DATE '2024-04-01'),
        (8,  'Bob',     'Charger',    2, 19.99,   DATE '2024-04-10')
""")

spark.sql("SELECT * FROM demo.ecommerce.orders_partitioned ORDER BY order_date").show()

## 2. Hidden Partitioning in Action

Users query by `order_date` — they don't need to know it's partitioned by month. Iceberg automatically prunes partitions.

In [None]:
# This query only needs to read the January partition
spark.sql("""
    SELECT * FROM demo.ecommerce.orders_partitioned
    WHERE order_date >= DATE '2024-01-01' AND order_date < DATE '2024-02-01'
    ORDER BY order_id
""").show()

## 3. View Partition Info

Iceberg metadata tables show partition details.

In [None]:
spark.sql("""
    SELECT partition, record_count, file_count
    FROM demo.ecommerce.orders_partitioned.partitions
""").show(truncate=False)

## 4. Partition Pruning with EXPLAIN

Let's prove that Iceberg prunes partitions by looking at the query plan.

In [None]:
spark.sql("""
    EXPLAIN
    SELECT * FROM demo.ecommerce.orders_partitioned
    WHERE order_date = DATE '2024-03-01'
""").show(truncate=False)

## 5. Partition Evolution

Business changed — now you want to partition by **day** instead of month.

In Hive, this would require recreating the table and rewriting all data. In Iceberg, it's a metadata-only operation!

In [None]:
# Change partitioning from month to day — no data rewrite!
spark.sql("""
    ALTER TABLE demo.ecommerce.orders_partitioned
    REPLACE PARTITION FIELD month(order_date) WITH day(order_date)
""")

print("Partition scheme evolved from month(order_date) to day(order_date).")
print("Old data files are unchanged — only new writes use the new scheme.")

In [None]:
# Insert new data — will use the new day-level partitioning
spark.sql("""
    INSERT INTO demo.ecommerce.orders_partitioned VALUES
        (9,  'Charlie', 'SSD',    1, 129.99, DATE '2024-05-01'),
        (10, 'Diana',   'RAM',    2, 64.99,  DATE '2024-05-02')
""")

spark.sql("""
    SELECT partition, record_count, file_count
    FROM demo.ecommerce.orders_partitioned.partitions
""").show(truncate=False)

In [None]:
# All data is still queryable seamlessly
spark.sql("""
    SELECT * FROM demo.ecommerce.orders_partitioned
    ORDER BY order_date
""").show()

## Key Takeaway

| Feature                | Hive                              | Iceberg                          |
|------------------------|-----------------------------------|----------------------------------|
| Partitioning           | Explicit partition columns        | Hidden partition transforms      |
| User must know layout? | Yes — queries need partition cols  | No — Iceberg handles it          |
| Change partitioning    | Recreate table + rewrite data     | Metadata-only, instant           |
| Mixed partition layouts | Not possible                     | Old + new layouts coexist        |

---

## That's a Wrap!

You've seen the core features of Apache Iceberg:
1. **Table creation** with a standard SQL interface
2. **Full CRUD** with row-level updates
3. **Time travel** and snapshot management
4. **Schema evolution** without data rewrites
5. **Hidden partitioning** with partition evolution

All of this runs locally with zero infrastructure — just Spark + Iceberg.