# Iceberg Workshop: Getting Started

## How to run this workshop

The workshop is consisted of several code cells that are designed to be executed from top to bottom. 

For example, this is the a code cell contains code to print "Hello Iceberg Summit"


In [None]:
print("Hello Iceberg Summit")

To execute a cell, click it and press Shift + Enter. The output will be displayed below the cell.

# Iceberg Metadata Structure

<!-- ![](./imgs/iceberg-metadata.png){width=30px} -->

<img src="./imgs/iceberg-metadata.png" alt="Iceberg Metadata" style="width:500px;">

# Setup Catalog

In [None]:
from pyiceberg.catalog import load_catalog

catalog = load_catalog("default")

catalog.create_namespace_if_not_exists("demo_ns")

# Cleanup To Ensure Re-runnable

In [None]:
try:
    # In case the table already exists
    catalog.drop_table("demo_ns.nyc_taxis")
except:
    pass

## Example Data: NYC Taxi Dataset

In this workshop, we will use New York City Taxi & Limousine Commission's Trip Record Data, which can be downloaded from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [None]:
import pyarrow.parquet as pq

taxis_data_1 = pq.read_table('/home/jovyan/data/yellow_tripdata_2024-01.parquet')
taxis_data_2 = pq.read_table('/home/jovyan/data/yellow_tripdata_2024-02.parquet')
taxis_data_3 = pq.read_table('/home/jovyan/data/yellow_tripdata_2024-03.parquet')
dataset_schema = taxis_data_1.schema
dataset_schema

## Create an Iceberg table

First, we'll create an iceberg table using the dataset's schema.

In [None]:
TABLE_NAME = "demo_ns.nyc_taxis"

In [None]:
nyc_taxis_tbl = catalog.create_table(TABLE_NAME, schema=dataset_schema)
nyc_taxis_tbl

## What happens behind table creation?

A metadata file has been created and registered as the latest metadata of table `demo_ns.demo_table_1`. Let's login to Minio Bucket and see the file:

- Minio Url: http://localhost:9001/
- username: admin
- password: password

The table is created at [s3://warehouse/demo_ns/nyc_taxis](http://localhost:9001/browser/warehouse/demo_ns%2Fnyc_taxis%2F): 

![](./imgs/simple_table_create.png)

# Add data to the table

It will create a new snapshot on the table

In [None]:
nyc_taxis_tbl.append(taxis_data_1)
nyc_taxis_tbl

## Read the table

We can see example data has been added to the table

In [None]:
nyc_taxis_tbl.scan(limit=10).to_pandas()

## What happens when adding data?

The data has been written into a parquet file and a new snapshot has been created.

Let's check the table location again: [s3://warehouse/demo_ns/nyc_taxis](http://localhost:9001/browser/warehouse/demo_ns%2Fnyc_taxis%2F)

We can see the table now have both `metadata` and `data`

![](./imgs/simple_table_create_append_data.png)

In the `metadata`, we can see some new files are generated

![](./imgs/simple_table_create_append_data_new_metadata.png)

In the `data`, we can see a new parquet file that contains the inserted data

![](./imgs/simple_table_create_append_data_new_data.png)


# Table Evolution: Make table partitioned

The table we just created is unpartitioned. In this example, we want to take a further step to partition the table. We will partition the table by the `day` value of`tpep_pickup_datatime` column.

In [None]:
from pyiceberg.transforms import DayTransform

with nyc_taxis_tbl.update_spec() as update_spec:
    update_spec.add_field("tpep_pickup_datetime", DayTransform())

nyc_taxis_tbl

In [None]:
nyc_taxis_tbl.append(taxis_data_2)

In [None]:
nyc_taxis_tbl.scan(limit=3).to_pandas()

In [None]:
nyc_taxis_tbl.scan().to_pandas().size

# Partitioned Data

If we go to the [`data` folder](http://localhost:9001/browser/warehouse/demo_ns%2Fnyc_taxis%2Fdata%2F) of table `nyc_taxis`:

![](./imgs/partition-by-day.png)

We can see that newly inserted data partitioned by date.

## Table Evolution: Change to partition by month for future data insertion

I changed my mind and now I want to partition the table by the "month" of `tpep_pickup_datetime` for any furture data insertion. No worries—we can easily achieve it!

Iceberg allows you to update the partitioning strategy without recreating the table or re-writing any data.

In [None]:
from pyiceberg.transforms import MonthTransform

with nyc_taxis_tbl.update_spec() as update_spec:
    update_spec.remove_field("tpep_pickup_datetime_day")
    update_spec.add_field("tpep_pickup_datetime", MonthTransform())

nyc_taxis_tbl

Now let's append some new data to the table

In [None]:
nyc_taxis_tbl.append(taxis_data_3)

If we go to the the [`data` folder](http://localhost:9001/browser/warehouse/demo_ns%2Fnyc_taxis%2Fdata%2F) of table `nyc_taxis` again, we will find the new data is partitioned by the month value. (You can find folders of new partitions at the bottom)

![](./imgs/partition-by-month.png)

The previous day partitions' folders are still there because data inserted before partition spec change will remain in their original partition.

# Table Evolution: Change Table Schema
Iceberg supports schema evolution without rewriting any data. For example, we can rename `VendorId` to `ID`.



In [None]:
# Before rename
nyc_taxis_tbl.scan(limit=3).to_pandas()

In [None]:
with nyc_taxis_tbl.update_schema() as update:
    update.rename_column("VendorID", "ID")

In [None]:
# After rename
nyc_taxis_tbl.scan(limit=3).to_pandas()

# Metadata Table

We can get more details of an iceberg by looking at its metadata tables. 

## Partitions
For example, to learn about existing partitions in the table, we can query the `partitions` metadata table

In [None]:
nyc_taxis_tbl.inspect.partitions().to_pandas()

## Files

If we want to see all the data files in the table, we can query the `files` metadata table

In [None]:
nyc_taxis_tbl.inspect.files().to_pandas()

## Snapshots

If we want to look at snapshots of the table, we can query the `snapshots` metadata table.

Every time when a data change operation happens, Iceberg will form a new snapshot. In this example, we did 3 append and therefore we will have 2 snapshots

In [None]:
nyc_taxis_tbl.inspect.snapshots().to_pandas()

There are more metadata tables available, you can find more information here: https://iceberg.apache.org/docs/nightly/spark-queries/#inspecting-tables

# Interoperability with other engines: Spark

Iceberg tables provides engine/platform interoperability. In above example, we use PyIceberg to perform all table operations, we will show that the tables created by PyIceberg can also be consumed by Spark

First, let's set a spark session

In [None]:
from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
  .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.iceberg:iceberg-aws-bundle:1.8.1")
  .config("spark.sql.catalog.demo.type", "rest")
  .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog")  
  .config("spark.sql.catalog.demo.uri", "http://rest:8181")
  .config("spark.sql.catalog.demo.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")       
  .config("spark.sql.catalog.demo.warehouse", "s3://warehouse")
  .config("spark.sql.catalog.demo.s3.endpoint", "http://minio:9000")
  .config("spark.sql.catalog.demo.s3.region", "us-east-1")
  .config("spark.sql.catalog.demo.s3.path-style-access", "true")
).getOrCreate()

We can query the nyc_taxis table we just created

In [None]:
spark.sql("SELECT ID, tpep_pickup_datetime, fare_amount FROM demo.demo_ns.nyc_taxis LIMIT 5").show()

We can also query the metadata tables of nyc_taxis in spark. For examle, the `snapshots` metadata table

In [None]:
spark.sql("SELECT * FROM demo.demo_ns.nyc_taxis.snapshots").show()