# Inside the Iceberg: Metadata Structures

In the previous notebook, we saw Iceberg from the outside - creating tables, appending data, querying. Now let's look inside to understand exactly how Iceberg works.

We'll explore:

* **The Catalog Database**: What's in the SQLite file?
* **Metadata JSON Files**: Schema, snapshots, and table history
* **Manifest Files (AVRO)**: Lists of data files with statistics
* **Data Files (Parquet)**: The actual data
* **The Complete Picture**: How a query uses all these pieces

By the end, you'll understand how Iceberg achieves atomic commits, time travel, and fast queries.

In [None]:
import daft
import pyarrow as pa
from pathlib import Path
from pyiceberg.catalog.sql import SqlCatalog
import sqlite3
from datetime import datetime

%reload_ext autoreload
%autoreload 2
from helpers import inspect_iceberg_table, inspect_metadata_json, inspect_manifest

## Setup: Create a Table with History

First, let's create a table with multiple snapshots so we have interesting metadata to explore.

In [None]:
# Setup warehouse
warehouse_path = Path('../data/warehouse_metadata').absolute()
warehouse_path.mkdir(parents=True, exist_ok=True)
catalog_db = warehouse_path / 'catalog.db'
catalog_db.unlink(missing_ok=True)

catalog = SqlCatalog(
    'metadata_demo',
    **{'uri': f'sqlite:///{catalog_db}', 'warehouse': f'file://{warehouse_path}'}
)
catalog.create_namespace('demo')
print("‚úÖ Catalog initialized")

In [None]:
# Load events data and create table with multiple operations
df_events = daft.read_json('../data/input/events.jsonl')

# Snapshot 1: Initial data
print("Snapshot 1: Creating initial load...")
df_batch1 = df_events.limit(30000)
arrow_table = df_batch1.to_arrow()
events_table = catalog.create_table('demo.events', schema=pa.schema(arrow_table.schema))
events_table.append(arrow_table)
print(f"Snapshot 1: Appended {len(arrow_table):,} records")

# Snapshot 2: Append more
print("Snapshot 2: Appending more data...")
df_batch2 = df_events.offset(30000).limit(30000)
arrow_table = df_batch2.to_arrow()
events_table.append(arrow_table)
print(f"Snapshot 2: Appended {len(arrow_table):,} more records")

# Snapshot 3: Delete some records
print("Snapshot 3: Deleting LocationUpdate events...")
events_table.delete("type = 'c8y_LocationUpdate'")
print("Snapshot 3: Deleted LocationUpdate events")

print(f"\n‚úÖ Created table with {len(events_table.history())} snapshots")

## The Catalog Database

The catalog database is the **entry point** to all Iceberg tables. It's a simple SQLite database (in our case) that stores:

* **Table locations**: Where each table's metadata lives
* **Namespace properties**: Configuration for database schemas
* **Atomic pointers**: Current metadata file for each table

### Why Use a Catalog?

The catalog enables **atomic commits**. When a writer updates a table:

1. Write new metadata JSON file
2. Update catalog pointer atomically (SQL UPDATE)
3. If UPDATE fails ‚Üí commit failed, retry

The catalog is the **single source of truth** for which metadata file is current.

### Inspecting the Catalog

Let's look inside the SQLite database:

In [None]:
# Connect to the catalog database
conn = sqlite3.connect(catalog_db)
cursor = conn.cursor()

# Show all tables in the catalog database
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = cursor.fetchall()
print("Tables in catalog database:")
for table in tables:
    print(f"  ‚Ä¢ {table[0]}")

In [None]:
# Show the schema of iceberg_tables
cursor.execute("PRAGMA table_info(iceberg_tables)")
columns = cursor.fetchall()
print("Schema of 'iceberg_tables':")
for col in columns:
    print(f"  {col[1]}: {col[2]}")

In [None]:
# Query the iceberg_tables table
cursor.execute("SELECT * FROM iceberg_tables")
rows = cursor.fetchall()

print("Registered Iceberg tables:\n")
for row in rows:
    catalog_name, namespace, table_name, metadata_location, prev_metadata = row
    print(f"Table: {namespace}.{table_name}")
    print(f"  Current metadata: {Path(metadata_location).name}")
    if prev_metadata:
        print(f"  Previous metadata: {Path(prev_metadata).name}")
    print()

### What We See

The `iceberg_tables` table has:

* **metadata_location**: Points to the **current** metadata JSON file
* **previous_metadata_location**: Points to the **previous** metadata JSON file

This is how Iceberg achieves **atomic commits**:

```sql
UPDATE iceberg_tables
SET metadata_location = 'new_metadata.json',
    previous_metadata_location = 'old_metadata.json'
WHERE table_name = 'events'
  AND metadata_location = 'old_metadata.json'  -- Optimistic lock!
```

If two writers try to commit at the same time:
- First succeeds (updates the row)
- Second fails (WHERE clause doesn't match anymore)
- Second must retry with the new metadata

This is **optimistic concurrency control**!

In [None]:
conn.close()

## Metadata JSON Files

Each commit creates a **new metadata JSON file**. This file contains:

* **Schema versions**: All schema versions (for time travel)
* **Partition specs**: All partition specs (for partition evolution)
* **Snapshots**: All snapshots with their manifest lists
* **Snapshot log**: Chronological list of snapshots
* **Current snapshot ID**: Pointer to the current snapshot
* **Metadata log**: History of metadata files

Let's find and inspect a metadata JSON file:

In [None]:
# Find metadata files
table_dir = Path(events_table.location().replace('file://', ''))
metadata_files = sorted(table_dir.glob('metadata/*.metadata.json'))

print(f"Found {len(metadata_files)} metadata file(s):")
for i, mf in enumerate(metadata_files, 1):
    size = mf.stat().st_size
    print(f"  {i}. {mf.name} ({size:,} bytes, {size/1024:.1f} KB)")

# Use the latest metadata file
latest_metadata = metadata_files[-1]
print(f"\nUsing latest: {latest_metadata.name}")

In [None]:
# Parse the metadata JSON
with open(latest_metadata, 'r') as f:
    metadata = json.load(f)

# Show top-level structure
print("Top-level keys in metadata JSON:")
for key in metadata.keys():
    value = metadata[key]
    if isinstance(value, list):
        print(f"  ‚Ä¢ {key}: list with {len(value)} item(s)")
    elif isinstance(value, dict):
        print(f"  ‚Ä¢ {key}: dict with {len(value)} key(s)")
    else:
        print(f"  ‚Ä¢ {key}: {type(value).__name__}")

In [None]:
# Show key metadata values
print("Key metadata values:\n")
print(f"Table UUID: {metadata['table-uuid']}")
print(f"Format version: {metadata['format-version']}")
print(f"Location: {metadata['location']}")
print(f"Last updated: {datetime.fromtimestamp(metadata['last-updated-ms']/1000).strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Current snapshot ID: {metadata['current-snapshot-id']}")
print(f"Last sequence number: {metadata['last-sequence-number']}")
print(f"\nNumber of schemas: {len(metadata['schemas'])}")
print(f"Number of partition specs: {len(metadata['partition-specs'])}")
print(f"Number of snapshots: {len(metadata['snapshots'])}")

### Schemas in Metadata

Iceberg stores **all schema versions** in the metadata. Each schema has a unique ID.

When you read a snapshot, Iceberg uses the schema that was current at that snapshot. This enables:
* **Time travel with old schemas**
* **Schema evolution without rewrites**

Let's inspect the schemas:

In [None]:
# Show all schemas
for schema in metadata['schemas']:
    print(f"Schema ID {schema['schema-id']}:")
    print(f"  Fields: {len(schema['fields'])}")
    for field in schema['fields'][:3]:  # Show first 3 fields
        print(f"    ‚Ä¢ {field['name']} (ID {field['id']}): {field['type']}")
    if len(schema['fields']) > 3:
        print(f"    ... and {len(schema['fields']) - 3} more fields")
    print()

### Snapshots in Metadata

Each snapshot represents a **commit** to the table. Snapshots contain:

* **snapshot-id**: Unique identifier
* **timestamp-ms**: When this snapshot was created
* **manifest-list**: Path to AVRO file listing manifests
* **schema-id**: Which schema version to use
* **summary**: Statistics (operation, files added/deleted, records added/deleted)

Let's inspect the snapshots:

In [None]:
# Show all snapshots
print(f"Total snapshots: {len(metadata['snapshots'])}\n")

for i, snapshot in enumerate(metadata['snapshots'], 1):
    print(f"Snapshot {i}: ID {snapshot['snapshot-id']}")
    print(f"  Timestamp: {datetime.fromtimestamp(snapshot['timestamp-ms']/1000).strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"  Sequence number: {snapshot.get('sequence-number', 'N/A')}")
    print(f"  Schema ID: {snapshot['schema-id']}")
    print(f"  Manifest list: {Path(snapshot['manifest-list']).name}")

    if 'summary' in snapshot:
        print("  Summary:")
        for key, value in sorted(snapshot['summary'].items()):
            print(f"    {key}: {value}")
    print()

### Snapshot Log

The `snapshot-log` is a chronological list of snapshots with timestamps. This enables:
* **Time travel by timestamp**: "Show me data as of 2024-12-01"
* **Audit trail**: When was each commit made?

The log is separate from `snapshots` because expired snapshots are removed from `snapshots` but their log entries might be retained.

In [None]:
# Show snapshot log
print("Snapshot log (chronological):")
for entry in metadata['snapshot-log']:
    snap_id = entry['snapshot-id']
    timestamp = datetime.fromtimestamp(entry['timestamp-ms']/1000).strftime('%Y-%m-%d %H:%M:%S')
    is_current = snap_id == metadata['current-snapshot-id']
    marker = " ‚Üê CURRENT" if is_current else ""
    print(f"  {timestamp}: Snapshot {snap_id}{marker}")

### Metadata Log

The `metadata-log` tracks which metadata files existed and when. This is used for:
* **Metadata file expiration**: Clean up old metadata files
* **Debugging**: Understand table history
* **Consistency checks**: Verify metadata chain

In [None]:
# Show metadata log
if 'metadata-log' in metadata:
    print("Metadata file history:")
    for entry in metadata['metadata-log']:
        meta_file = Path(entry['metadata-file']).name
        timestamp = datetime.fromtimestamp(entry['timestamp-ms']/1000).strftime('%Y-%m-%d %H:%M:%S')
        print(f"  {timestamp}: {meta_file}")
else:
    print("No metadata log (first metadata file)")

### Using the Helper Function

Now let's use our helper to visualize the metadata in a more readable format:

In [None]:
inspect_metadata_json(latest_metadata)

## Manifest Files (AVRO)

Manifests are the **index** that tells Iceberg which data files exist and where they are. The hierarchy is:

```
Snapshot
  ‚îî‚îÄ Manifest List (AVRO) ‚Üê Points to multiple manifests
       ‚îú‚îÄ Manifest 1 (AVRO) ‚Üê Lists data files for partition 1
       ‚îú‚îÄ Manifest 2 (AVRO) ‚Üê Lists data files for partition 2
       ‚îî‚îÄ Manifest N (AVRO) ‚Üê Lists data files for partition N
```

Each **manifest file** contains:
* **Data file paths**: Where the Parquet files are
* **Partition values**: What partition each file belongs to
* **Statistics**: Record counts, min/max values, null counts
* **File metadata**: Size, format, compression

This metadata enables **predicate pushdown** - skipping files without reading them.

### Finding Manifest Files

Manifest files are named with pattern: `<uuid>-m<N>.avro`

In [None]:
# Find manifest files
manifest_files = sorted(table_dir.glob('metadata/*-m*.avro'))
print(f"Found {len(manifest_files)} manifest file(s):")
for mf in manifest_files:
    size = mf.stat().st_size
    print(f"  ‚Ä¢ {mf.name} ({size:,} bytes)")

# Pick first manifest to inspect
if manifest_files:
    manifest_to_inspect = manifest_files[0]
    print(f"\nWill inspect: {manifest_to_inspect.name}")

### Reading Manifest Files

Manifests are AVRO files. Let's read one and see what's inside.

If you don't have `fastavro` installed, run: `pip install fastavro`

In [None]:
try:
    import fastavro

    # Read the manifest
    with open(manifest_to_inspect, 'rb') as f:
        reader = fastavro.reader(f)
        records = list(reader)

    print(f"Manifest contains {len(records)} entry(ies)\n")

    # Show first entry in detail
    if records:
        entry = records[0]
        print("First entry structure:")
        print(f"  Status: {entry.get('status', 'N/A')}  (0=EXISTING, 1=ADDED, 2=DELETED)")

        data_file = entry.get('data_file', {})
        print(f"\n  Data file:")
        print(f"    Path: {Path(data_file.get('file_path', 'N/A')).name}")
        print(f"    Format: {data_file.get('file_format', 'N/A')}")
        print(f"    Records: {data_file.get('record_count', 0):,}")
        print(f"    Size: {data_file.get('file_size_in_bytes', 0):,} bytes")

        if data_file.get('value_counts'):
            print(f"\n    Value counts (first 3 columns):")
            for i, (col, count) in enumerate(data_file['value_counts'].items()):
                if i >= 3:
                    break
                print(f"      {col}: {count:,}")

        if data_file.get('lower_bounds'):
            print(f"\n    Lower bounds (first 2):")
            for i, (col, val) in enumerate(data_file['lower_bounds'].items()):
                if i >= 2:
                    break
                print(f"      {col}: {val!r}")

        if data_file.get('upper_bounds'):
            print(f"\n    Upper bounds (first 2):")
            for i, (col, val) in enumerate(data_file['upper_bounds'].items()):
                if i >= 2:
                    break
                print(f"      {col}: {val!r}")

except ImportError:
    print("‚ö†Ô∏è  fastavro not installed. Install with: pip install fastavro")
    print("   We'll skip the detailed manifest inspection.")

### Using the Helper Function

Let's use our helper to visualize the manifest:

In [None]:
if manifest_files:
    inspect_manifest(manifest_to_inspect)
else:
    print("No manifest files found")

## Data Files (Parquet)

Finally, the actual data! Data files are standard **Parquet files**. Iceberg doesn't change Parquet - it just tracks them in manifests.

Key properties:
* **Immutable**: Once written, never modified
* **Referenced by manifests**: Manifests point to data files
* **Multiple files per table**: Each append creates new files
* **Deletes don't rewrite**: Delete files mark rows as deleted

Let's find and inspect a data file:

In [None]:
# Find data files
data_files = sorted(table_dir.glob('data/*.parquet'))
print(f"Found {len(data_files)} data file(s):")

total_size = 0
for df in data_files:
    size = df.stat().st_size
    total_size += size
    print(f"  ‚Ä¢ {df.name} ({size / 1024 / 1024:.2f} MB)")

print(f"\nTotal data size: {total_size / 1024 / 1024:.2f} MB")

In [None]:
# Inspect first data file with PyArrow
if data_files:
    import pyarrow.parquet as pq

    data_file = data_files[0]
    pq_file = pq.ParquetFile(data_file)

    print(f"Inspecting: {data_file.name}\n")
    print(f"Total rows: {pq_file.metadata.num_rows:,}")
    print(f"Total columns: {pq_file.metadata.num_columns}")
    print(f"Row groups: {pq_file.metadata.num_row_groups}")
    print(f"Format version: {pq_file.metadata.format_version}")
    print(f"Created by: {pq_file.metadata.created_by}")

    print(f"\nSchema:")
    for i, field in enumerate(pq_file.schema):
        print(f"  {i+1}. {field.name}: {field.physical_type}")

## The Complete Picture: Tracing a Query

Now let's trace how a query uses all these metadata structures:

```
SELECT * FROM events WHERE type = 'c8y_Event' AND time > '2024-01-01'
```

### Step-by-Step Query Execution

1. **Catalog Lookup** (SQLite)
   - Query: `SELECT metadata_location FROM iceberg_tables WHERE table_name = 'events'`
   - Result: Path to current metadata JSON file

2. **Read Metadata JSON**
   - Parse: `current-snapshot-id`
   - Find snapshot with that ID
   - Get: `manifest-list` path

3. **Read Manifest List** (AVRO)
   - Lists all manifest files for this snapshot
   - Each manifest covers a partition or set of files

4. **Read Manifests** (AVRO)
   - For each manifest, check statistics:
     - Does `lower_bounds['type']` ‚â§ 'c8y_Event' ‚â§ `upper_bounds['type']`?
     - Does `lower_bounds['time']` ‚â§ '2024-01-01' ‚â§ `upper_bounds['time']`?
   - If not: **skip this manifest entirely**
   - If yes: read the list of data files

5. **Predicate Pushdown on Files**
   - For each data file in relevant manifests:
     - Check file-level statistics
     - Skip files where predicates can't match

6. **Read Data Files** (Parquet)
   - Read only files that passed predicate pushdown
   - Within each file, read only necessary columns
   - Apply row-level filters

This is why Iceberg is fast - it reads minimal metadata to skip most of the data!

### Visualizing the Hierarchy

In [None]:
# Show the complete metadata hierarchy
print("Complete Iceberg Metadata Hierarchy:\n")
print("1. üìö Catalog (SQLite)")
print(f"   {catalog_db.name}")
print(f"   ‚îî‚îÄ Table: demo.events ‚Üí {latest_metadata.name}")
print()
print("2. üìÑ Metadata JSON")
print(f"   {latest_metadata.name}")
print(f"   ‚îú‚îÄ Schema: {len(metadata['schemas'])} version(s)")
print(f"   ‚îú‚îÄ Partition specs: {len(metadata['partition-specs'])}")
print(f"   ‚îî‚îÄ Snapshots: {len(metadata['snapshots'])}")
print()
print("3. üì¶ Manifest Files (AVRO)")
for i, mf in enumerate(manifest_files, 1):
    print(f"   {mf.name} ({mf.stat().st_size:,} bytes)")
print()
print("4. üíæ Data Files (Parquet)")
for i, df in enumerate(data_files, 1):
    print(f"   {df.name} ({df.stat().st_size / 1024 / 1024:.2f} MB)")
print()
print(f"Total metadata overhead: {sum(mf.stat().st_size for mf in metadata_files + manifest_files) / 1024:.1f} KB")
print(f"Total data size: {sum(df.stat().st_size for df in data_files) / 1024 / 1024:.2f} MB")

## Review Questions

Test your understanding:

1. **Why store metadata in multiple JSON files instead of one?**
   - Hint: Think about atomicity and append-only operations.

2. **What would happen if you directly edited a data file?**
   - Would the manifest notice? Would queries see your changes?

3. **How does Iceberg achieve atomic commits with SQLite?**
   - What SQL statement is used? What makes it atomic?

4. **Why separate manifest lists from manifest files?**
   - Why not put all data files in one manifest?

5. **How does predicate pushdown work?**
   - At what levels can files/partitions be skipped?

6. **What's the metadata overhead for this table?**
   - Calculate: metadata size / data size
   - Is this reasonable?

## Hands-on Challenge

### Challenge 1: Parse Metadata Manually

1. Open the latest metadata JSON in a text editor
2. Find the `current-snapshot-id`
3. Locate that snapshot in the `snapshots` array
4. Extract the `manifest-list` path
5. Verify this file exists in the metadata directory

### Challenge 2: Analyze Manifest Statistics

1. Read a manifest file using fastavro
2. For each data file entry, extract:
   - Record count
   - File size
   - Lower/upper bounds for 'type' column
3. Calculate: total records, average file size

### Challenge 3: Simulate Predicate Pushdown

1. Write a query filter: `type = 'c8y_Measurement'`
2. Read manifests and check statistics
3. Count how many files would be skipped
4. Calculate: % of data skipped

Use the cells below:

In [None]:
# Challenge 1: Your code here


In [None]:
# Challenge 2: Your code here


In [None]:
# Challenge 3: Your code here


## Summary

In this deep dive, we explored:

* **Catalog Database**: The atomic pointer to current metadata
  - Enables optimistic concurrency control
  - Single UPDATE statement makes commits atomic

* **Metadata JSON**: Complete table state
  - All schema versions (for time travel)
  - All snapshots with manifest lists
  - Snapshot log for temporal queries
  - Metadata log for file management

* **Manifest Files**: Index of data files
  - AVRO format for efficiency
  - Per-file statistics for pruning
  - Partition information
  - Enables predicate pushdown

* **Data Files**: Immutable Parquet
  - Never modified after creation
  - Referenced by manifests
  - Standard Parquet format

### Key Insights

1. **Metadata is append-only**: New files created, old ones retained
2. **Catalog is the single source of truth**: Points to current metadata
3. **Statistics enable pruning**: Skip files/partitions without reading
4. **Everything is versioned**: Time travel works by reading old snapshots
5. **Minimal metadata overhead**: ~KB of metadata per GB of data

### What's Next?

Now that you understand the internal structures, we'll explore:
* **Time travel**: Using snapshots for historical queries
* **Schema evolution**: How column changes work in metadata
* **Concurrency**: Simulating optimistic locking conflicts
* **Partitioning**: Managing millions of files efficiently