Apache Iceberg is an open table format for huge analytic datasets. It is designed to improve on the performance and usability of existing table formats like Hive, Hudi, and Delta Lake.

PyIceberg is a Python library for interacting with Apache Iceberg tables.

In this tutorial, we will understand the metadata files of Apache Iceberg using PyIceberg and local files only.

```bash
conda install pyiceberg
conda install sqlalchemy
```

In [61]:
import shutil
import os
table_dir = "iceberg_warehouse"

if os.path.exists(table_dir):
    shutil.rmtree(table_dir)

os.makedirs(table_dir, exist_ok=True)

For the sake of demonstration, we'll configure the catalog to use the SqlCatalog implementation, which will store information in a local sqlite database. We'll also configure the catalog to store data files in the local filesystem instead of an object store. This should not be used in production due to the limited scalability.

In [62]:
from pyiceberg.catalog.sql import SqlCatalog

warehouse_path = os.path.abspath("./iceberg_warehouse")
catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{warehouse_path}",
    },
)

What is inside this SQLite database?

This database is the actual catalog that lists the Iceberg tables.

In [63]:
catalog.create_namespace("default")

Let's open the SQLite database and look what is inside 👀

In [64]:
import os
from pyiceberg.catalog import Catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, LongType
from pyiceberg.table import Table
from pyiceberg.io import FileIO
import json
import shutil

# Step 2: Create a local directory for the Iceberg table

table_dir = "iceberg_table"
os.makedirs(table_dir, exist_ok=True)

# Step 3: Initialize an Iceberg table

# Define the schema for the table
schema = Schema(
    NestedField(field_id=1, name="id", field_type=LongType(), required=False),
    NestedField(field_id=1, name="name", field_type=StringType(), required=False)
)

# Create a catalog and table
table = catalog.create_table("default.my_table", schema)


You can see from the SQLite catalog that there is a reference to a JSON metadata file. Let's open it.

...

Now, we can write data to table. PyIceberg is nicely integrated with PyArrow. We create an Arrow table and append it to the Iceberg table.

In [65]:
import pyarrow as pa


# Step 4: Add some data to the table

# Define some data
data = [
    {"id": 1, "name": "Alice"},
    {"id": 2, "name": "Bob"}
]
# Create a PyArrow Table from the list of dictionaries
arrow_table = pa.Table.from_pylist(data)

# Write the data to the table
arrow_table

pyarrow.Table
id: int64
name: string
----
id: [[1,2]]
name: [["Alice","Bob"]]

In [66]:
table.append(arrow_table)



In [67]:
table.scan().to_arrow()



pyarrow.Table
id: int64
name: large_string
----
id: [[1,2]]
name: [["Alice","Bob"]]

Now, let's look again at the catalog database. The `metadata_location` has changed. It is now pointing to a new JSON file. We can now look at it. We can see that the `snapshots` list has a record. That record is now referring to a `manifest-list`.

Now, let's jump back to our slides 🛑

...

Let's now open the `manifest-list`file. It is actually an Avro file.

In [57]:

from avro.datafile import DataFileReader
from avro.io import DatumReader

metadata_folder = './iceberg_warehouse/default.db/my_table/metadata'

reader = DataFileReader(open(os.path.join(metadata_folder, 'snap-4588393757938124859-0-01d98fe8-70df-420d-bdb4-b06812f13d9d.avro'), "rb"), DatumReader())
for user in reader:
    # a generator to loop over dictionaries
    print(user)
reader.close()

{'manifest_path': 'file:///Users/marcosantoni/Desktop/data-lake-course/local_pyiceberg/iceberg_warehouse/default.db/my_table/metadata/01d98fe8-70df-420d-bdb4-b06812f13d9d-m0.avro', 'manifest_length': 4367, 'partition_spec_id': 0, 'content': 0, 'sequence_number': 1, 'min_sequence_number': 1, 'added_snapshot_id': 4588393757938124859, 'added_files_count': 1, 'existing_files_count': 0, 'deleted_files_count': 0, 'added_rows_count': 2, 'existing_rows_count': 0, 'deleted_rows_count': 0, 'partitions': [], 'key_metadata': None}


Then look at the actual manifest file

In [58]:

from avro.datafile import DataFileReader
from avro.io import DatumReader

metadata_folder = './iceberg_warehouse/default.db/my_table/metadata'

reader = DataFileReader(open(os.path.join(metadata_folder, '01d98fe8-70df-420d-bdb4-b06812f13d9d-m0.avro'), "rb"), DatumReader())
for user in reader:
    # a generator to loop over dictionaries
    print(user)
reader.close()

{'status': 1, 'snapshot_id': 4588393757938124859, 'sequence_number': None, 'file_sequence_number': None, 'data_file': {'content': 0, 'file_path': 'file:///Users/marcosantoni/Desktop/data-lake-course/local_pyiceberg/iceberg_warehouse/default.db/my_table/data/00000-0-01d98fe8-70df-420d-bdb4-b06812f13d9d.parquet', 'file_format': 'PARQUET', 'partition': {}, 'record_count': 2, 'file_size_in_bytes': 915, 'column_sizes': [{'key': 1, 'value': 118}, {'key': 2, 'value': 90}], 'value_counts': [{'key': 1, 'value': 2}, {'key': 2, 'value': 2}], 'null_value_counts': [{'key': 1, 'value': 0}, {'key': 2, 'value': 0}], 'nan_value_counts': [], 'lower_bounds': [{'key': 1, 'value': b'\x01\x00\x00\x00\x00\x00\x00\x00'}, {'key': 2, 'value': b'Alice'}], 'upper_bounds': [{'key': 1, 'value': b'\x02\x00\x00\x00\x00\x00\x00\x00'}, {'key': 2, 'value': b'Bob'}], 'key_metadata': None, 'split_offsets': [4], 'equality_ids': None, 'sort_order_id': None}}


/Users/marcosantoni/miniconda3/envs/data_file_formats/lib/python3.12/site-packages/avro/schema.py:1233: IgnoredLogicalType: Unknown map, using array.


What is actually inside the Parquet file? It contains the actual data content of the Iceberg table (that's why it is actually in the `data` folder).

In [None]:
parquet_file = '00000-0-01d98fe8-70df-420d-bdb4-b06812f13d9d.parquet'
table_dir = 'iceberg_warehouse/default.db/my_table/data'

import pyarrow.parquet as pq

table_from_parquet = pq.read_table(os.path.join(table_dir, parquet_file))
table_from_parquet

pyarrow.Table
id: int64
name: string
----
id: [[1,2]]
name: [["Alice","Bob"]]

Let's add another record

In [68]:
# Define some data
data = [
    {"id": 3, "name": "Daniel"}
]
# Create a PyArrow Table from the list of dictionaries
arrow_table = pa.Table.from_pylist(data)
table.append(arrow_table)



What is now in the SQLite catalog? The `metadata_location` now has a new path. Let's open that JSON file.

Let's open the latest `manifest-list`. The list now contains 2 `manifest-path` records. We see that the first one has `added_rows_count` equals to `2` (the first insert operation) while the second one has `added_rows_count` equals to `1` (last insert operation). Let's open the last manifest file.

In [69]:
from avro.datafile import DataFileReader
from avro.io import DatumReader

metadata_folder = './iceberg_warehouse/default.db/my_table/metadata'

reader = DataFileReader(open(os.path.join(metadata_folder, 'snap-5542359964143702113-0-4ebc7486-8441-4f91-8b59-6857b02be11c.avro'), "rb"), DatumReader())
for user in reader:
    # a generator to loop over dictionaries
    print(user)
reader.close()

{'manifest_path': 'file:///Users/marcosantoni/Desktop/data-lake-course/local_pyiceberg/iceberg_warehouse/default.db/my_table/metadata/4ebc7486-8441-4f91-8b59-6857b02be11c-m0.avro', 'manifest_length': 4372, 'partition_spec_id': 0, 'content': 0, 'sequence_number': 2, 'min_sequence_number': 2, 'added_snapshot_id': 5542359964143702113, 'added_files_count': 1, 'existing_files_count': 0, 'deleted_files_count': 0, 'added_rows_count': 1, 'existing_rows_count': 0, 'deleted_rows_count': 0, 'partitions': [], 'key_metadata': None}
{'manifest_path': 'file:///Users/marcosantoni/Desktop/data-lake-course/local_pyiceberg/iceberg_warehouse/default.db/my_table/metadata/c8943001-21fe-4384-aadc-9c505d248da3-m0.avro', 'manifest_length': 4368, 'partition_spec_id': 0, 'content': 0, 'sequence_number': 1, 'min_sequence_number': 1, 'added_snapshot_id': 4869232518670708754, 'added_files_count': 1, 'existing_files_count': 0, 'deleted_files_count': 0, 'added_rows_count': 2, 'existing_rows_count': 0, 'deleted_rows_c

Let's open the manifest file. Thanks to metadata like for example `upper_bounds` compute engines can exploit for faster queries

In [70]:
from avro.datafile import DataFileReader
from avro.io import DatumReader

metadata_folder = './iceberg_warehouse/default.db/my_table/metadata'

reader = DataFileReader(open(os.path.join(metadata_folder, '4ebc7486-8441-4f91-8b59-6857b02be11c-m0.avro'), "rb"), DatumReader())
for user in reader:
    # a generator to loop over dictionaries
    print(user)
reader.close()

{'status': 1, 'snapshot_id': 5542359964143702113, 'sequence_number': None, 'file_sequence_number': None, 'data_file': {'content': 0, 'file_path': 'file:///Users/marcosantoni/Desktop/data-lake-course/local_pyiceberg/iceberg_warehouse/default.db/my_table/data/00000-0-4ebc7486-8441-4f91-8b59-6857b02be11c.parquet', 'file_format': 'PARQUET', 'partition': {}, 'record_count': 1, 'file_size_in_bytes': 909, 'column_sizes': [{'key': 1, 'value': 110}, {'key': 2, 'value': 88}], 'value_counts': [{'key': 1, 'value': 1}, {'key': 2, 'value': 1}], 'null_value_counts': [{'key': 1, 'value': 0}, {'key': 2, 'value': 0}], 'nan_value_counts': [], 'lower_bounds': [{'key': 1, 'value': b'\x03\x00\x00\x00\x00\x00\x00\x00'}, {'key': 2, 'value': b'Daniel'}], 'upper_bounds': [{'key': 1, 'value': b'\x03\x00\x00\x00\x00\x00\x00\x00'}, {'key': 2, 'value': b'Daniel'}], 'key_metadata': None, 'split_offsets': [4], 'equality_ids': None, 'sort_order_id': None}}


/Users/marcosantoni/miniconda3/envs/data_file_formats/lib/python3.12/site-packages/avro/schema.py:1233: IgnoredLogicalType: Unknown map, using array.


Finally, let's open the Datafile

In [71]:
parquet_file = '00000-0-4ebc7486-8441-4f91-8b59-6857b02be11c.parquet'
table_dir = 'iceberg_warehouse/default.db/my_table/data'

import pyarrow.parquet as pq

table_from_parquet = pq.read_table(os.path.join(table_dir, parquet_file))
table_from_parquet

pyarrow.Table
id: int64
name: string
----
id: [[3]]
name: [["Daniel"]]

Each operation creates a new snapshot of the table and a new datafile.