# 01 — Setup & First Iceberg Table

In this notebook we will:
1. Initialize a SparkSession with Iceberg support
2. Create a namespace (database)
3. Create our first Iceberg table
4. Insert sample data
5. Run basic queries

## 1. Initialize Spark with Iceberg

In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("IcebergDemo")
    .master("local[*]")
    # Pull in the Iceberg runtime JAR (downloaded automatically)
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1")
    # Register the Iceberg catalog
    .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.demo.type", "hadoop")
    .config("spark.sql.catalog.demo.warehouse", "../warehouse")
    # Iceberg SQL extensions (MERGE INTO, time-travel, etc.)
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .getOrCreate()
)

print(f"Spark version: {spark.version}")
print("Iceberg catalog 'demo' is ready!")

26/02/23 13:51:16 WARN Utils: Your hostname, barkha-xg1 resolves to a loopback address: 127.0.1.1; using 192.168.1.227 instead (on interface enp195s0)
26/02/23 13:51:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /home/barkha/.ivy2/cache
The jars for the packages stored in: /home/barkha/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b6070ec0-ae11-4c94-9584-2e7e1dcbc37f;1.0
	confs: [default]


:: loading settings :: url = jar:file:/home/barkha/iceberg-demo/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.7.1 in central
downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.7.1/iceberg-spark-runtime-3.5_2.12-1.7.1.jar ...
	[SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.7.1!iceberg-spark-runtime-3.5_2.12.jar (863ms)
:: resolution report :: resolve 463ms :: artifacts dl 865ms
	:: modules in use:
	org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.7.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-b6070ec0-ae11-4c94-9584-2

Spark version: 3.5.3
Iceberg catalog 'demo' is ready!


26/02/23 13:51:28 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## 2. Create a Namespace

A namespace in Iceberg is like a database — it groups related tables.

In [2]:
spark.sql("CREATE NAMESPACE IF NOT EXISTS demo.ecommerce")
spark.sql("SHOW NAMESPACES IN demo").show()

+---------+
|namespace|
+---------+
|ecommerce|
+---------+



## 3. Create an Iceberg Table

In [3]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS demo.ecommerce.orders (
        order_id   INT,
        customer   STRING,
        product    STRING,
        quantity   INT,
        price      DOUBLE,
        order_date DATE
    )
    USING iceberg
""")

print("Table created!")
spark.sql("SHOW TABLES IN demo.ecommerce").show()

Table created!
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|ecommerce|   orders|      false|
+---------+---------+-----------+



## 4. Insert Sample Data

In [4]:
spark.sql("""
    INSERT INTO demo.ecommerce.orders VALUES
        (1,  'Alice',   'Laptop',     1, 999.99,  DATE '2024-01-15'),
        (2,  'Bob',     'Mouse',      2, 29.99,   DATE '2024-01-16'),
        (3,  'Charlie', 'Keyboard',   1, 79.99,   DATE '2024-01-16'),
        (4,  'Alice',   'Monitor',    1, 349.99,  DATE '2024-01-17'),
        (5,  'Diana',   'Headphones', 3, 59.99,   DATE '2024-01-18')
""")

print("5 rows inserted.")

5 rows inserted.


## 5. Query the Table

In [5]:
spark.sql("SELECT * FROM demo.ecommerce.orders ORDER BY order_id").show()

+--------+--------+----------+--------+------+----------+
|order_id|customer|   product|quantity| price|order_date|
+--------+--------+----------+--------+------+----------+
|       1|   Alice|    Laptop|       1|999.99|2024-01-15|
|       2|     Bob|     Mouse|       2| 29.99|2024-01-16|
|       3| Charlie|  Keyboard|       1| 79.99|2024-01-16|
|       4|   Alice|   Monitor|       1|349.99|2024-01-17|
|       5|   Diana|Headphones|       3| 59.99|2024-01-18|
+--------+--------+----------+--------+------+----------+



In [6]:
# Quick aggregation
spark.sql("""
    SELECT customer, COUNT(*) AS num_orders, ROUND(SUM(price * quantity), 2) AS total_spent
    FROM demo.ecommerce.orders
    GROUP BY customer
    ORDER BY total_spent DESC
""").show()

+--------+----------+-----------+
|customer|num_orders|total_spent|
+--------+----------+-----------+
|   Alice|         2|    1349.98|
|   Diana|         1|     179.97|
| Charlie|         1|      79.99|
|     Bob|         1|      59.98|
+--------+----------+-----------+



## What Just Happened?

Under the hood, Iceberg created:
- **Data files** (Parquet) in `warehouse/ecommerce/orders/data/`
- **Metadata files** (JSON + Avro) in `warehouse/ecommerce/orders/metadata/`

This metadata layer is what gives Iceberg its superpowers — time travel, schema evolution, and more.

**Next up:** CRUD operations in notebook 02!