<a href="https://colab.research.google.com/github/HonahX/iceberg-summit-workshop/blob/colab_dev/Iceberg_getting_started_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Iceberg Workshop: Getting Started

## How to run this workshop

The workshop is consisted of several code cells that are designed to be executed from top to bottom.

For example, this is the a code cell contains code to print "Hello Iceberg Summit"


In [None]:
print("Hello Iceberg Summit")

Hello Iceberg Summit


To execute a cell, click it and press Shift + Enter. The output will be displayed below the cell.

To execute a cell, click it and press Shift + Enter. The output will be displayed below the cell.

# Iceberg Metadata Structure

![My Image](https://github.com/HonahX/iceberg-summit-workshop/blob/main/notebooks/imgs/iceberg-metadata.png?raw=true)

# Setup

## Install Dependencies

In [2]:
%pip install pyiceberg[pyarrow,pandas,sql-sqlite]==0.9.0

Collecting pyiceberg==0.9.0 (from pyiceberg[pandas,pyarrow,sql-sqlite]==0.9.0)
  Downloading pyiceberg-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting mmh3<6.0.0,>=4.0.0 (from pyiceberg==0.9.0->pyiceberg[pandas,pyarrow,sql-sqlite]==0.9.0)
  Downloading mmh3-5.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting strictyaml<2.0.0,>=1.7.0 (from pyiceberg==0.9.0->pyiceberg[pandas,pyarrow,sql-sqlite]==0.9.0)
  Downloading strictyaml-1.7.3-py3-none-any.whl.metadata (11 kB)
Downloading pyiceberg-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading mmh3-5.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (101 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

## Create utils to print directory

In [3]:
import os

def print_directory(root_path, indent=''):
    try:
        entries = sorted(os.listdir(root_path))
    except FileNotFoundError:
        print(f"{indent}[Error] Path not found: {root_path}")
        return
    except PermissionError:
        print(f"{indent}[Error] Permission denied: {root_path}")
        return

    for i, entry in enumerate(entries):
        path = os.path.join(root_path, entry)
        is_last = (i == len(entries) - 1)
        branch = '└── ' if is_last else '├── '
        print(f"{indent}{branch}{entry}")
        if os.path.isdir(path):
            new_indent = indent + ('    ' if is_last else '│   ')
            print_directory(path, new_indent)

## Download Example Data

In [4]:
import os
data_dir = "/data"
os.makedirs(data_dir, exist_ok=True)

!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet -O /data/yellow_tripdata_2024-01.parquet
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-02.parquet -O /data/yellow_tripdata_2024-02.parquet
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-03.parquet -O /data/yellow_tripdata_2024-03.parquet

--2025-04-06 16:14:05--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 3.163.157.7, 3.163.157.72, 3.163.157.96, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|3.163.157.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49961641 (48M) [binary/octet-stream]
Saving to: ‘/data/yellow_tripdata_2024-01.parquet’


2025-04-06 16:14:06 (57.6 MB/s) - ‘/data/yellow_tripdata_2024-01.parquet’ saved [49961641/49961641]

--2025-04-06 16:14:06--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-02.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 3.163.157.7, 3.163.157.72, 3.163.157.96, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|3.163.157.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50349284 (48M) [binary/octet-stream

## Setup Catalog

In [38]:
from pyiceberg.catalog import load_catalog


warehouse = "/warehouse"
os.makedirs(warehouse, exist_ok=True)
sqlite_uri = f"sqlite:////{warehouse}/sql-catalog.db"
catalog = load_catalog("in-memory", warehouse=warehouse, **{
    "uri": sqlite_uri
})

catalog.create_namespace_if_not_exists("demo_ns")

# Cleanup To Ensure Re-runnable

In [7]:
try:
    # In case the table already exists
    catalog.drop_table("demo_ns.nyc_taxis")
except:
    pass

## Example Data: NYC Taxi Dataset

In this workshop, we will use New York City Taxi & Limousine Commission's Trip Record Data, which can be downloaded from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [8]:
import pyarrow.parquet as pq

taxis_data_1 = pq.read_table('/data/yellow_tripdata_2024-01.parquet')
taxis_data_2 = pq.read_table('/data/yellow_tripdata_2024-02.parquet')
taxis_data_3 = pq.read_table('/data/yellow_tripdata_2024-03.parquet')
dataset_schema = taxis_data_1.schema
dataset_schema

VendorID: int32
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: int64
trip_distance: double
RatecodeID: int64
store_and_fwd_flag: large_string
PULocationID: int32
DOLocationID: int32
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
Airport_fee: double

## Create an Iceberg table

First, we'll create an iceberg table using the dataset's schema.

In [9]:
TABLE_NAME = "demo_ns.nyc_taxis"

In [10]:
nyc_taxis_tbl = catalog.create_table(TABLE_NAME, schema=dataset_schema)
nyc_taxis_tbl

nyc_taxis(
  1: VendorID: optional int,
  2: tpep_pickup_datetime: optional timestamp,
  3: tpep_dropoff_datetime: optional timestamp,
  4: passenger_count: optional long,
  5: trip_distance: optional double,
  6: RatecodeID: optional long,
  7: store_and_fwd_flag: optional string,
  8: PULocationID: optional int,
  9: DOLocationID: optional int,
  10: payment_type: optional long,
  11: fare_amount: optional double,
  12: extra: optional double,
  13: mta_tax: optional double,
  14: tip_amount: optional double,
  15: tolls_amount: optional double,
  16: improvement_surcharge: optional double,
  17: total_amount: optional double,
  18: congestion_surcharge: optional double,
  19: Airport_fee: optional double
),
partition by: [],
sort order: [],
snapshot: null

## What happens behind table creation?

A metadata file has been created and registered as the latest metadata of table `demo_ns.nyc_taxis`. Let's view the table's location.

In [11]:
print_directory(nyc_taxis_tbl.location())

└── metadata
    └── 00000-b28ed822-9ec9-45f8-8ca4-9e956eb83fc5.metadata.json


# Add data to the table

It will create a new snapshot on the table

In [12]:
nyc_taxis_tbl.append(taxis_data_1)
nyc_taxis_tbl

nyc_taxis(
  1: VendorID: optional int,
  2: tpep_pickup_datetime: optional timestamp,
  3: tpep_dropoff_datetime: optional timestamp,
  4: passenger_count: optional long,
  5: trip_distance: optional double,
  6: RatecodeID: optional long,
  7: store_and_fwd_flag: optional string,
  8: PULocationID: optional int,
  9: DOLocationID: optional int,
  10: payment_type: optional long,
  11: fare_amount: optional double,
  12: extra: optional double,
  13: mta_tax: optional double,
  14: tip_amount: optional double,
  15: tolls_amount: optional double,
  16: improvement_surcharge: optional double,
  17: total_amount: optional double,
  18: congestion_surcharge: optional double,
  19: Airport_fee: optional double
),
partition by: [],
sort order: [],
snapshot: Operation.APPEND: id=1112881952072389391, schema_id=0

## Read the table

We can see example data has been added to the table

In [13]:
nyc_taxis_tbl.scan(limit=10).to_pandas()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2024-01-01 00:57:55,2024-01-01 01:17:43,1,1.72,1,N,186,79,2,17.7,1.0,0.5,0.0,0.0,1.0,22.7,2.5,0.0
1,1,2024-01-01 00:03:00,2024-01-01 00:09:36,1,1.8,1,N,140,236,1,10.0,3.5,0.5,3.75,0.0,1.0,18.75,2.5,0.0
2,1,2024-01-01 00:17:06,2024-01-01 00:35:01,1,4.7,1,N,236,79,1,23.3,3.5,0.5,3.0,0.0,1.0,31.3,2.5,0.0
3,1,2024-01-01 00:36:38,2024-01-01 00:44:56,1,1.4,1,N,79,211,1,10.0,3.5,0.5,2.0,0.0,1.0,17.0,2.5,0.0
4,1,2024-01-01 00:46:51,2024-01-01 00:52:57,1,0.8,1,N,211,148,1,7.9,3.5,0.5,3.2,0.0,1.0,16.1,2.5,0.0
5,1,2024-01-01 00:54:08,2024-01-01 01:26:31,1,4.7,1,N,148,141,1,29.6,3.5,0.5,6.9,0.0,1.0,41.5,2.5,0.0
6,2,2024-01-01 00:49:44,2024-01-01 01:15:47,2,10.82,1,N,138,181,1,45.7,6.0,0.5,10.0,0.0,1.0,64.95,0.0,1.75
7,1,2024-01-01 00:30:40,2024-01-01 00:58:40,0,3.0,1,N,246,231,2,25.4,3.5,0.5,0.0,0.0,1.0,30.4,2.5,0.0
8,2,2024-01-01 00:26:01,2024-01-01 00:54:12,1,5.44,1,N,161,261,2,31.0,1.0,0.5,0.0,0.0,1.0,36.0,2.5,0.0
9,2,2024-01-01 00:28:08,2024-01-01 00:29:16,1,0.04,1,N,113,113,2,3.0,1.0,0.5,0.0,0.0,1.0,8.0,2.5,0.0


## What happens when adding data?

The data has been written into a parquet file and a new snapshot has been created.

Let's check the table location again:

In [14]:
print_directory(nyc_taxis_tbl.location())

├── data
│   └── 00000-0-ca442fe2-ee5c-4cf7-b44e-720efbd88733.parquet
└── metadata
    ├── 00000-b28ed822-9ec9-45f8-8ca4-9e956eb83fc5.metadata.json
    ├── 00001-a7eae212-6711-4931-a54f-38d8d478f186.metadata.json
    ├── ca442fe2-ee5c-4cf7-b44e-720efbd88733-m0.avro
    └── snap-1112881952072389391-0-ca442fe2-ee5c-4cf7-b44e-720efbd88733.avro


In the `metadata`, we can see some new files are generated:


*   new metadata file: `00001-<uuid>-.metadata.json`
*   manifest file: `<uuid>-m0.avro`
*   manifest list file: `snap-<snapshot-id>-0-<uuid>.avro`

In the `data`, we can see a new parquet file that contains the inerted data



*   `00000-0-<uuid>.parquet`



# Table Evolution: Make table partitioned

The table we just created is unpartitioned. In this example, we want to take a further step to partition the table. We will partition the table by the `day` value of`tpep_pickup_datatime` column.

In [15]:
from pyiceberg.transforms import DayTransform

with nyc_taxis_tbl.update_spec() as update_spec:
    update_spec.add_field("tpep_pickup_datetime", DayTransform())

nyc_taxis_tbl

nyc_taxis(
  1: VendorID: optional int,
  2: tpep_pickup_datetime: optional timestamp,
  3: tpep_dropoff_datetime: optional timestamp,
  4: passenger_count: optional long,
  5: trip_distance: optional double,
  6: RatecodeID: optional long,
  7: store_and_fwd_flag: optional string,
  8: PULocationID: optional int,
  9: DOLocationID: optional int,
  10: payment_type: optional long,
  11: fare_amount: optional double,
  12: extra: optional double,
  13: mta_tax: optional double,
  14: tip_amount: optional double,
  15: tolls_amount: optional double,
  16: improvement_surcharge: optional double,
  17: total_amount: optional double,
  18: congestion_surcharge: optional double,
  19: Airport_fee: optional double
),
partition by: [tpep_pickup_datetime_day],
sort order: [],
snapshot: Operation.APPEND: id=1112881952072389391, schema_id=0

## Insert new data

The newly inserted data will be partitioned by the `day` value of `tpep_pickup_datetime` column

In [16]:
nyc_taxis_tbl.append(taxis_data_2)

In [17]:
nyc_taxis_tbl.scan(limit=3).to_pandas()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2024-01-31 23:59:53,2024-02-01 00:18:35,1,6.95,1,N,249,166,1,30.3,1.0,0.5,7.06,0.0,1.0,42.36,2.5,0.0
1,2,2024-01-31 23:59:24,2024-02-01 00:06:13,1,1.28,1,N,68,137,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0
2,2,2024-01-31 23:57:33,2024-02-01 00:05:48,1,1.4,1,N,90,79,1,10.0,1.0,0.5,1.95,0.0,1.0,16.95,2.5,0.0


In [18]:
nyc_taxis_tbl.scan().to_pandas().size

113470850

# Partitioned Data

If we go to the `data` folder of the table, we can see the newly inserted data partitioned by date.

In [19]:
print_directory(os.path.join(nyc_taxis_tbl.location(), "data"))

├── 00000-0-ca442fe2-ee5c-4cf7-b44e-720efbd88733.parquet
├── tpep_pickup_datetime_day=2008-12-31
│   └── 00000-3-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2009-01-01
│   └── 00000-12-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-01-31
│   └── 00000-1-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-01
│   └── 00000-0-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-02
│   └── 00000-2-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-03
│   └── 00000-4-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-04
│   └── 00000-5-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-05
│   └── 00000-6-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-06
│   └── 00000-7-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024

## Table Evolution: Change to partition by month for future data insertion

I changed my mind and now I want to partition the table by the "month" of `tpep_pickup_datetime` for any furture data insertion. No worries—we can easily achieve it!

Iceberg allows you to update the partitioning strategy without recreating the table or re-writing any data.

In [20]:
from pyiceberg.transforms import MonthTransform

with nyc_taxis_tbl.update_spec() as update_spec:
    update_spec.remove_field("tpep_pickup_datetime_day")
    update_spec.add_field("tpep_pickup_datetime", MonthTransform())

nyc_taxis_tbl

nyc_taxis(
  1: VendorID: optional int,
  2: tpep_pickup_datetime: optional timestamp,
  3: tpep_dropoff_datetime: optional timestamp,
  4: passenger_count: optional long,
  5: trip_distance: optional double,
  6: RatecodeID: optional long,
  7: store_and_fwd_flag: optional string,
  8: PULocationID: optional int,
  9: DOLocationID: optional int,
  10: payment_type: optional long,
  11: fare_amount: optional double,
  12: extra: optional double,
  13: mta_tax: optional double,
  14: tip_amount: optional double,
  15: tolls_amount: optional double,
  16: improvement_surcharge: optional double,
  17: total_amount: optional double,
  18: congestion_surcharge: optional double,
  19: Airport_fee: optional double
),
partition by: [tpep_pickup_datetime_month],
sort order: [],
snapshot: Operation.APPEND: id=3690222227076645093, parent_id=1112881952072389391, schema_id=0

Now let's append some new data to the table

In [21]:
nyc_taxis_tbl.append(taxis_data_3)

If we go to the the `data` folder of table `nyc_taxis` again, we will find the new data is partitioned by the month value. (You can find folders of new partitions at the bottom)

```
├── tpep_pickup_datetime_month=2002-12
│   └── 00000-2-<uuid>.parquet
├── tpep_pickup_datetime_month=2024-02
│   └── 00000-1-<uuid>.parquet
├── tpep_pickup_datetime_month=2024-03
│   └── 00000-0-<uuid>.parquet
└── tpep_pickup_datetime_month=2024-04
    └── 00000-3-<uuid>.parquet
```

In [22]:
print_directory(os.path.join(nyc_taxis_tbl.location(), "data"))

├── 00000-0-ca442fe2-ee5c-4cf7-b44e-720efbd88733.parquet
├── tpep_pickup_datetime_day=2008-12-31
│   └── 00000-3-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2009-01-01
│   └── 00000-12-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-01-31
│   └── 00000-1-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-01
│   └── 00000-0-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-02
│   └── 00000-2-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-03
│   └── 00000-4-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-04
│   └── 00000-5-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-05
│   └── 00000-6-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024-02-06
│   └── 00000-7-571d036f-d63c-4b9e-bb9a-722d49896eaf.parquet
├── tpep_pickup_datetime_day=2024

# Table Evolution: Change Table Schema
Iceberg supports schema evolution without rewriting any data. For example, we can rename `VendorId` to `ID`.


In [23]:
# Before rename
nyc_taxis_tbl.scan(limit=3).to_pandas()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2024-02-29 23:52:39,2024-02-29 23:57:31,2,0.69,1,N,234,113,1,6.5,1.0,0.5,2.3,0.0,1.0,13.8,2.5,0.0
1,2,2024-02-29 23:59:33,2024-03-01 00:18:39,2,3.43,1,N,68,148,1,19.8,1.0,0.5,3.0,0.0,1.0,27.8,2.5,0.0
2,2,2024-02-29 23:59:13,2024-03-01 00:13:55,1,8.92,1,N,132,39,1,34.5,1.0,0.5,0.0,0.0,1.0,38.75,0.0,1.75


In [24]:
with nyc_taxis_tbl.update_schema() as update:
    update.rename_column("VendorID", "ID")

In [25]:
# After rename
nyc_taxis_tbl.scan(limit=3).to_pandas()

Unnamed: 0,ID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,2,2024-02-29 23:52:39,2024-02-29 23:57:31,2,0.69,1,N,234,113,1,6.5,1.0,0.5,2.3,0.0,1.0,13.8,2.5,0.0
1,2,2024-02-29 23:59:33,2024-03-01 00:18:39,2,3.43,1,N,68,148,1,19.8,1.0,0.5,3.0,0.0,1.0,27.8,2.5,0.0
2,2,2024-02-29 23:59:13,2024-03-01 00:13:55,1,8.92,1,N,132,39,1,34.5,1.0,0.5,0.0,0.0,1.0,38.75,0.0,1.75


# Metadata Table

We can get more details of an iceberg by looking at its metadata tables.

## Partitions
For example, to learn about existing partitions in the table, we can query the `partitions` metadata table

In [26]:
nyc_taxis_tbl.inspect.partitions().to_pandas()

Unnamed: 0,partition,spec_id,record_count,file_count,total_data_file_size_in_bytes,position_delete_record_count,position_delete_file_count,equality_delete_record_count,equality_delete_file_count,last_updated_at,last_updated_snapshot_id
0,"{'tpep_pickup_datetime_day': None, 'tpep_picku...",2,3582605,1,62528393,0,0,0,0,2025-04-06 16:16:44.180,393423256772927473
1,"{'tpep_pickup_datetime_day': None, 'tpep_picku...",2,19,1,8355,0,0,0,0,2025-04-06 16:16:44.180,393423256772927473
2,"{'tpep_pickup_datetime_day': None, 'tpep_picku...",2,2,1,7581,0,0,0,0,2025-04-06 16:16:44.180,393423256772927473
3,"{'tpep_pickup_datetime_day': None, 'tpep_picku...",2,2,1,7583,0,0,0,0,2025-04-06 16:16:44.180,393423256772927473
4,"{'tpep_pickup_datetime_day': 2024-02-01, 'tpep...",1,109994,1,1956979,0,0,0,0,2025-04-06 16:16:27.839,3690222227076645093
5,"{'tpep_pickup_datetime_day': 2024-01-31, 'tpep...",1,11,1,8041,0,0,0,0,2025-04-06 16:16:27.839,3690222227076645093
6,"{'tpep_pickup_datetime_day': 2024-02-02, 'tpep...",1,105470,1,1892034,0,0,0,0,2025-04-06 16:16:27.839,3690222227076645093
7,"{'tpep_pickup_datetime_day': 2008-12-31, 'tpep...",1,1,1,7501,0,0,0,0,2025-04-06 16:16:27.839,3690222227076645093
8,"{'tpep_pickup_datetime_day': 2024-02-03, 'tpep...",1,110603,1,1949768,0,0,0,0,2025-04-06 16:16:27.839,3690222227076645093
9,"{'tpep_pickup_datetime_day': 2024-02-04, 'tpep...",1,88091,1,1660712,0,0,0,0,2025-04-06 16:16:27.839,3690222227076645093


## Files

If we want to see all the data files in the table, we can query the `files` metadata table

In [27]:
nyc_taxis_tbl.inspect.files().to_pandas()

Unnamed: 0,content,file_path,file_format,spec_id,record_count,file_size_in_bytes,column_sizes,value_counts,null_value_counts,nan_value_counts,lower_bounds,upper_bounds,key_metadata,split_offsets,equality_ids,sort_order_id,readable_metrics
0,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,2,3582605,62528393,"[(1, 420938), (2, 16875604), (3, 17082862), (4...","[(1, 3582605), (2, 3582605), (3, 3582605), (4,...","[(1, 0), (2, 0), (3, 0), (4, 426190), (5, 0), ...",[],"[(1, b'\x01\x00\x00\x00'), (2, b'\x00\xa0\x9b\...","[(1, b'\x06\x00\x00\x00'), (2, b'\xc0\xfd\xa0\...",,"[4, 17986208, 35790600, 53693562]",,,"{'ID': {'column_size': 420938, 'value_count': ..."
1,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,2,19,8355,"[(1, 90), (2, 241), (3, 231), (4, 128), (5, 23...","[(1, 19), (2, 19), (3, 19), (4, 19), (5, 19), ...","[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0...",[],"[(1, b'\x02\x00\x00\x00'), (2, b'\x00`>H\x8d\x...","[(1, b'\x02\x00\x00\x00'), (2, b'@\xd9m\x0e\x8...",,[4],,,"{'ID': {'column_size': 90, 'value_count': 19, ..."
2,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,2,2,7581,"[(1, 90), (2, 118), (3, 118), (4, 110), (5, 11...","[(1, 2), (2, 2), (3, 2), (4, 2), (5, 2), (6, 2...","[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0...",[],"[(1, b'\x02\x00\x00\x00'), (2, b'@[\x0f\xbf\xf...","[(1, b'\x02\x00\x00\x00'), (2, b'\xc0i\x8f(\xf...",,[4],,,"{'ID': {'column_size': 90, 'value_count': 2, '..."
3,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,2,2,7583,"[(1, 90), (2, 118), (3, 118), (4, 110), (5, 11...","[(1, 2), (2, 2), (3, 2), (4, 2), (5, 2), (6, 2...","[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0...",[],"[(1, b'\x02\x00\x00\x00'), (2, b'\x80%\x88\x8d...","[(1, b'\x02\x00\x00\x00'), (2, b'\x807\x1dE \x...",,[4],,,"{'ID': {'column_size': 90, 'value_count': 2, '..."
4,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,1,109994,1956979,"[(1, 13034), (2, 522669), (3, 531264), (4, 179...","[(1, 109994), (2, 109994), (3, 109994), (4, 10...","[(1, 0), (2, 0), (3, 0), (4, 4342), (5, 0), (6...",[],"[(1, b'\x01\x00\x00\x00'), (2, b'\x00\xc05\xad...","[(1, b'\x02\x00\x00\x00'), (2, b'\xc0\xdd\xfd\...",,[4],,,"{'ID': {'column_size': 13034, 'value_count': 1..."
5,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,1,11,8041,"[(1, 90), (2, 184), (3, 183), (4, 110), (5, 18...","[(1, 11), (2, 11), (3, 11), (4, 11), (5, 11), ...","[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0...",[],"[(1, b'\x02\x00\x00\x00'), (2, b'@\xe5_\x91F\x...","[(1, b'\x02\x00\x00\x00'), (2, b'\x802\xda\xac...",,[4],,,"{'ID': {'column_size': 90, 'value_count': 11, ..."
6,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,1,105470,1892034,"[(1, 12520), (2, 508876), (3, 512860), (4, 190...","[(1, 105470), (2, 105470), (3, 105470), (4, 10...","[(1, 0), (2, 0), (3, 0), (4, 3807), (5, 0), (6...",[],"[(1, b'\x01\x00\x00\x00'), (2, b'@b\x1c\xcbZ\x...","[(1, b'\x02\x00\x00\x00'), (2, b'\xc0=\xd5\xe8...",,[4],,,"{'ID': {'column_size': 12520, 'value_count': 1..."
7,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,1,1,7501,"[(1, 90), (2, 110), (3, 110), (4, 110), (5, 11...","[(1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1...","[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0...",[],"[(1, b'\x02\x00\x00\x00'), (2, b'@\xb2,\x91__\...","[(1, b'\x02\x00\x00\x00'), (2, b'@\xb2,\x91__\...",,[4],,,"{'ID': {'column_size': 90, 'value_count': 1, '..."
8,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,1,110603,1949768,"[(1, 12655), (2, 530870), (3, 534682), (4, 216...","[(1, 110603), (2, 110603), (3, 110603), (4, 11...","[(1, 0), (2, 0), (3, 0), (4, 5944), (5, 0), (6...",[],"[(1, b'\x01\x00\x00\x00'), (2, b'\x00\x80\xe4\...","[(1, b'\x02\x00\x00\x00'), (2, b'\xc0\x9d\xac\...",,[4],,,"{'ID': {'column_size': 12655, 'value_count': 1..."
9,0,/warehouse/demo_ns.db/nyc_taxis/data/tpep_pick...,PARQUET,1,88091,1660712,"[(1, 10222), (2, 468219), (3, 454875), (4, 179...","[(1, 88091), (2, 88091), (3, 88091), (4, 88091...","[(1, 0), (2, 0), (3, 0), (4, 3976), (5, 0), (6...",[],"[(1, b'\x01\x00\x00\x00'), (2, b'\x00\xe0\xbb\...","[(1, b'\x02\x00\x00\x00'), (2, b'\x80\xbbt$\x9...",,[4],,,"{'ID': {'column_size': 10222, 'value_count': 8..."


## Snapshots

If we want to look at snapshots of the table, we can query the `snapshots` metadata table.

Every time when a data change operation happens, Iceberg will form a new snapshot. In this example, we did 3 append and therefore we will have 3 snapshots

In [28]:
nyc_taxis_tbl.inspect.snapshots().to_pandas()

Unnamed: 0,committed_at,snapshot_id,parent_id,operation,manifest_list,summary
0,2025-04-06 16:16:15.681,1112881952072389391,,append,/warehouse/demo_ns.db/nyc_taxis/metadata/snap-...,"[(added-files-size, 51811595), (added-data-fil..."
1,2025-04-06 16:16:27.839,3690222227076645093,1.112882e+18,append,/warehouse/demo_ns.db/nyc_taxis/metadata/snap-...,"[(added-files-size, 54024951), (added-data-fil..."
2,2025-04-06 16:16:44.180,393423256772927473,3.690222e+18,append,/warehouse/demo_ns.db/nyc_taxis/metadata/snap-...,"[(added-files-size, 62551912), (added-data-fil..."


# Interoperability with other engines: Spark

Iceberg tables provides engine/platform interoperability. To see how we can use spark to read tables written by PyIceberg, please use the "Docker" version of the workshop: https://github.com/HonahX/iceberg-summit-workshop