# Getting Started with Apache Iceberg

This notebook provides a comprehensive introduction to Apache Iceberg:

* What is Iceberg?
* Setting up Iceberg
* Creating and managing tables
* Appending data
* Querying tables
* Basic operations

After going through this notebook, you should understand what Iceberg is and how to perform basic table operations.

## What is Apache Iceberg?

Iceberg is a table format for huge analytic datasets. Think of it as "Git for data":

* **Parquet** is a **file** format (like individual source code files)
* **Iceberg** is a **table** format (like a Git repository that versions those files)

The Iceberg table format is [standardized under the umbrella of the Apache Software Foundation](https://iceberg.apache.org/terms/). 


### Key Problems Iceberg Solves

Working with raw Parquet files in a data lake has challenges:

1. **No ACID transactions**: What if your write fails halfway through or you query the file halfway through? You get inconsistent data.
2. **No schema evolution**: Adding a column means rewriting all files.
3. **Slow queries**: Query engines must list and scan all files to find relevant data.
4. **No time travel**: Can't query data as it was yesterday.
5. **Unsafe concurrent writes**: Two writers can corrupt each other's changes.

### How Iceberg Fixes This

Iceberg adds a **metadata layer** on top of Parquet files:

```
Traditional Data Lake       Iceberg Data Lake
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Parquet Files  ‚îÇ        ‚îÇ Iceberg Catalog ‚îÇ  ‚Üê Points to current metadata
‚îÇ  (scattered,    ‚îÇ        ‚îÇ                 ‚îÇ
‚îÇ   untracked)    ‚îÇ        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                 ‚îÇ
                            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                            ‚îÇ Metadata JSON  ‚îÇ  ‚Üê Schema, snapshots, history
                            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                            ‚îÇ Manifest Files ‚îÇ  ‚Üê Lists of data files + stats
                            ‚îÇ   (AVRO)       ‚îÇ
                            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                            ‚îÇ Parquet Files  ‚îÇ  ‚Üê Actual data (immutable)
                            ‚îÇ   (tracked,    ‚îÇ
                            ‚îÇ    versioned)  ‚îÇ
                            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

This separation of metadata and data provides:

* **ACID transactions**: All-or-nothing commits via atomic metadata updates
* **Schema evolution**: Change schema in metadata without touching data files
* **Fast queries**: Read manifests to find relevant files (no directory listing)
* **Time travel**: Each snapshot points to a specific set of files
* **Safe concurrency**: Optimistic locking detects conflicts

### Iceberg vs. Other Approaches

| Approach | Schema Changes | Time Travel | ACID | Concurrent Writes |
|----------|---------------|-------------|------|------------------|
| **Raw Parquet** | Rewrite all data | Manual snapshots | ‚ùå | ‚ùå |
| **Parquet + Hive** | Rewrite all data | Manual snapshots | ‚ùå | ‚ö†Ô∏è (locking issues) |
| **Iceberg** | Metadata only | Built-in | ‚úÖ | ‚úÖ (optimistic) |
| **Delta Lake** | Metadata only | Built-in | ‚úÖ | ‚úÖ (optimistic) |

*Note: Delta Lake is similar to Iceberg but is more tightly coupled to Spark. Iceberg works with any engine (Daft, Spark, Trino, Flink, etc.)*

### When to Use Iceberg

Use Iceberg when you need:

* **Large-scale analytics**: Billions of rows, millions of files
* **Schema flexibility**: Schema changes without downtime
* **Data governance**: Auditing, versioning, compliance
* **Concurrent access**: Multiple readers and writers
* **Query performance**: Fast queries on partitioned data

Don't use Iceberg for:

* **Small datasets**: < 1 GB - the metadata overhead isn't worth it
* **Single-file scenarios**: Just use Parquet directly
* **Real-time updates**: Simmilar to Parquet, Iceberg is for batch/micro-batch, not streaming updates

## Setting Up Iceberg

Iceberg requires a **catalog** to manage table metadata. The catalog stores:

* Table locations
* Current metadata file pointers
* Namespace (database) information

We'll use **SqlCatalog** with SQLite for learning purposes. SqlCatalog with SQLite is **not** suitable for production concurrent writes, but it's perfect for learning and development.

For production use, Iceberg provides a standard REST Catalog API with a production implementation such as 

* **[Apache Polaris](https://polaris.apache.org/)**: Java-based implementation, cooperatively developed by Snowflake and Dremio.
* **[Lakekeeper](https://docs.lakekeeper.io/)**: An independent, lightweight Rust-based implementation.

Let's set up again the catalog and create a namespace in it to hold our tables.

In [1]:
import daft
import pyarrow as pa
import json
import shutil
from pathlib import Path
from pyiceberg.catalog.sql import SqlCatalog

%reload_ext autoreload
%autoreload 2
from helpers import inspect_iceberg_table

In [2]:
warehouse_path = Path('../data/warehouse_getting_started').absolute()
shutil.rmtree(warehouse_path, ignore_errors=True)
warehouse_path.mkdir(parents=True, exist_ok=True)
catalog_db = warehouse_path / 'catalog.db'
catalog_db.unlink(missing_ok=True)
catalog = SqlCatalog(
    'getting_started',
    **{
        'uri': f'sqlite:///{catalog_db}',
        'warehouse': f'file://{warehouse_path}'
    }
)

# Create a namespace (like a database schema)
catalog.create_namespace('iot')

print(f"‚úÖ Catalog initialized at {warehouse_path}")
print(f"   Catalog DB: {catalog_db}")
print(f"   Namespace 'iot' created")

‚úÖ Catalog initialized at /Users/eickler/Documents/knee-deep-in-the-lake/02_iceberg/../data/warehouse_getting_started
   Catalog DB: /Users/eickler/Documents/knee-deep-in-the-lake/02_iceberg/../data/warehouse_getting_started/catalog.db
   Namespace 'iot' created


## Creating Tables

There are two main approaches to create Iceberg tables:

1. **Define schema explicitly**: Create table with a pre-defined schema, in this case through PyArrow
2. **Infer from data**: Let Iceberg tools discover the schema from your data

Choosing the right approach depends on your environment. For example, in the IoT space, it is often incredibly hard to centrally define a schema that all the devices from different vendors in your IoT installation and align it with software deployments to all devices. For this reason, Cumulocity infers all schemas from the source data to not cause unintentional data loss.

### Approach 1: Explicit Schema

Use this when you know your schema upfront and want strict validation (i.e., rejecting data that is not adhering to the schema).

In [3]:
# Define schema explicitly
schema = pa.schema([
    pa.field('device_id', pa.string(), nullable=False),
    pa.field('timestamp', pa.timestamp('ms'), nullable=False),
    pa.field('temperature', pa.float64()),
    pa.field('humidity', pa.float64()),
])

# Create table with explicit schema
sensors_table = catalog.create_table(
    'iot.sensors',
    schema=schema
)

print("‚úÖ Created 'iot.sensors' table with explicit schema")
print(f"   Location: {sensors_table.location()}")

‚úÖ Created 'iot.sensors' table with explicit schema
   Location: file:///Users/eickler/Documents/knee-deep-in-the-lake/02_iceberg/../data/warehouse_getting_started/iot/sensors


### Approach 2: Infer Schema from Data

Use this when you're working with JSON or other semi-structured data and want to ingest all data.

In [4]:
df_events = daft.read_json('../data/input/events.jsonl')
df_sample = df_events.limit(50000)
arrow_table = df_sample.to_arrow()

print(f"üìä Loaded {len(arrow_table):,} events from JSON")
print(f"   Auto-discovered schema: {arrow_table.schema}")

# Create Iceberg table from inferred schema
events_table = catalog.create_table(
    'iot.events',
    schema=pa.schema(arrow_table.schema)
)

print(f"\n‚úÖ Created 'iot.events' table")
print(f"   Location: {events_table.location()}")

  from .autonotebook import tqdm as notebook_tqdm


[00:00] üó°Ô∏è üêü Json Scan: 50,000 rows out, 0 B read | üó°Ô∏è üêü Limit 50000: 50,000 rows in, 50,000 rows out

üìä Loaded 50,000 events from JSON
   Auto-discovered schema: creationTime: timestamp[s, tz=+00:00]
id: large_string
source: large_string
text: large_string
time: timestamp[s, tz=+00:00]
type: large_string

‚úÖ Created 'iot.events' table
   Location: file:///Users/eickler/Documents/knee-deep-in-the-lake/02_iceberg/../data/warehouse_getting_started/iot/events


## Appending Data

Appending data to an Iceberg table creates a new **snapshot**. Each snapshot is an immutable view of the table at a point in time.

In [5]:
from datetime import datetime

# Append the events data we loaded earlier
events_table.append(arrow_table)

print(f"‚úÖ Appended {len(arrow_table):,} records")

# Check the table history
history = events_table.history()
print(f"\nüìú Table now has {len(history)} snapshot(s)")
for i, snapshot in enumerate(history, 1):
    time = datetime.fromtimestamp(snapshot.timestamp_ms / 1000)
    print(f"   Snapshot {i}: ID {snapshot.snapshot_id}, Time: {time}")

‚úÖ Appended 50,000 records

üìú Table now has 1 snapshot(s)
   Snapshot 1: ID 769645367980875530, Time: 2026-02-16 19:38:05.777000


### What Happened Behind the Scenes?

When you appended data:

1. **Data file created**: A new Parquet file was written (to `data/`)
2. **Manifest created**: An AVRO manifest file lists this new data file (to `metadata/`)
3. **Metadata updated**: A new metadata JSON file was created (to `metadata/`)
4. **Catalog updated**: The catalog now points to the new metadata file

Since the catalog update moves the pointer as part of a catalog database transaction, the entire update is **atomic**. From the outside, either it everything happened or nothing. (It may leave some leftover files in the latter case, though.)

### What Got Created?

When you create a table, Iceberg creates this directory structure:

```
warehouse_getting_started/
‚îî‚îÄ‚îÄ iot/                  ‚Üê Namespace
    ‚îî‚îÄ‚îÄ events/           ‚Üê Table
        ‚îú‚îÄ‚îÄ data/         ‚Üê Parquet files go here (the data that we appended above)
        ‚îî‚îÄ‚îÄ metadata/     ‚Üê Metadata JSON and manifest AVRO files
    ‚îî‚îÄ‚îÄ sensors/          ‚Üê Table
        ‚îî‚îÄ‚îÄ metadata/     ‚Üê Just the metadata JSON, no data here yet
‚îî‚îÄ‚îÄ catalog.db            ‚Üê The SQlite database
```

Let's verify:

In [6]:
# List the warehouse structure
import os
for root, dirs, files in os.walk(warehouse_path):
    level = root.replace(str(warehouse_path), '').count(os.sep)
    indent = ' ' * 2 * level
    print(f'{indent}{os.path.basename(root)}/')
    subindent = ' ' * 2 * (level + 1)
    for file in files[:5]:  # Show first 5 files only
        print(f'{subindent}{file}')
    if len(files) > 5:
        print(f'{subindent}... and {len(files) - 5} more files')

warehouse_getting_started/
  catalog.db
  iot/
    sensors/
      metadata/
        00000-6c38818e-bb68-42ed-8fc5-0ee6de0ca9f5.metadata.json
    events/
      data/
        00000-0-c9ce97f6-f716-4c40-a1b6-72464047b7b4.parquet
      metadata/
        00001-9cc114e2-b2dc-4539-be69-08bddb8040fa.metadata.json
        snap-769645367980875530-0-c9ce97f6-f716-4c40-a1b6-72464047b7b4.avro
        c9ce97f6-f716-4c40-a1b6-72464047b7b4-m0.avro
        00000-1ce98da7-447c-41ac-9b34-29bd64ec6e77.metadata.json


The `events` table has now two versions, one starting with `00000` and one starting with `00001`. The first one only contains the table description, the second adds a snapshot with the snapshot manifest (`snap-*`) listing the manifests (`*-m0.avro`) with files and statistics. We will look more closely into this in the next section.

## Querying Tables

We again use Daft to read and query Iceberg tables. Daft also understands Iceberg's metadata. Using the metadata, it can do optimizations like column pruning and predicate pushdown that we have discussed in the Parquet chapter also on millions of Parquet files without actually reading each of them.

Let's try again our queries, this time on the Iceberg table instead of the raw Parquet file.

In [7]:
df = daft.read_iceberg(events_table)
df.show(5)


creationTime Timestamp[us; UTC],id String,source String,text String,time Timestamp[us; UTC],type String
2024-08-14 12:09:38 +00:00,353672,140673,Automatic,2024-08-14 12:09:39 +00:00,OperationMode
2024-08-14 12:09:38 +00:00,353673,140672,Automatic,2024-08-14 12:09:39 +00:00,OperationMode
2024-08-14 12:09:38 +00:00,353768,140709,Automatic,2024-08-14 12:09:39 +00:00,OperationMode
2024-08-14 12:13:53 +00:00,353776,140672,Starting to work on workpiece 2024_9550021,2024-08-14 12:13:53 +00:00,c8y_StartWorkpieceStep
2024-08-14 12:15:14 +00:00,353679,140672,Stop to work on workpiece 2024_9550021,2024-08-14 12:15:14 +00:00,c8y_StopWorkpieceStep


In [8]:
# Count the number of events by type
daft.sql("""
    SELECT type, COUNT(*) as count
    FROM df
    GROUP BY type
    ORDER BY count DESC
""").show()

type String,count UInt64
c8y_StartWorkpieceStep,20167
c8y_StopWorkpieceStep,20151
OperationMode,9682


In [9]:
# Show devices with the most events, this time using the Dataframe API instead of SQL
(
    df
    .groupby('source')
    .agg(daft.col('source').count().alias('event_count'))
    .sort('event_count', desc=True)
    .show()
)

source String,event_count UInt64
140707,47925
2119198,1078
140672,495
140673,266
140709,236


## Basic Operations

Let's explore common table operations:

* **Append**: Add more data
* **Overwrite**: Replace table contents
* **Delete**: Remove records matching a condition

Each operation creates a new snapshot.

### Appending More Data

Let's load an additional batch of events and add it to the table.

In [10]:
df_batch2 = df_events.offset(50000).limit(50000)
batch2_arrow = df_batch2.to_arrow()
events_table.append(batch2_arrow)

df = daft.read_iceberg(events_table)
daft.sql("""
    SELECT type, COUNT(*) as count
    FROM df
    GROUP BY type
    ORDER BY count DESC
    LIMIT 5
""").show()

type String,count UInt64
c8y_StartWorkpieceStep,35879
c8y_StopWorkpieceStep,35850
OperationMode,28271


### Deleting Records

Iceberg supports **row-level deletes** using predicates. This creates **delete files** that mark rows as deleted without rewriting the entire data files.

In [11]:
events_table.delete("type = 'OperationMode'")

df = daft.read_iceberg(events_table)
daft.sql("""
    SELECT type, COUNT(*) as count
    FROM df
    GROUP BY type
    ORDER BY count DESC
    LIMIT 5
""").show()

type String,count UInt64
c8y_StartWorkpieceStep,35879
c8y_StopWorkpieceStep,35850


### Overwriting Data

Overwrite replaces the entire table contents with new data. This is useful for reprocessing or fixing data issues. We'll just use a very small sample to show it.

In [12]:
sample_table = catalog.create_table(
    'iot.sample',
    schema=pa.schema([pa.field('id', pa.int64()), pa.field('value', pa.string())])
)

data1 = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
sample_table.append(data1)
daft.read_iceberg(sample_table).show()

id Int64,value String
1,a
2,b
3,c


In [13]:
data2 = pa.table({'id': [10, 20], 'value': ['x', 'y']})
sample_table.overwrite(data2)
daft.read_iceberg(sample_table).show()


id Int64,value String
10,x
20,y


## Inspecting Table Metadata

Let's use our helper function to visualize the complete table structure:

In [None]:
inspect_iceberg_table(events_table)

0,1
Table UUID,7a823ee0-ae21-4113-9ccd-f719c3fd79d2
Location,file:///Users/eickler/Documents/knee-deep-in-the-lake/02_iceberg/../data/warehouse_getting_started/iot/events
Format Version,2
Last Updated,2026-02-16 19:38:06
Current Snapshot ID,7263287393199513725
Total Snapshots,3

0,1
Snapshot ID,7263287393199513725
Timestamp,2026-02-16 19:38:06
Summary,added-data-files: 2 added-files-size: 1744081 added-records: 71729 deleted-data-files: 2 deleted-records: 100000 removed-files-size: 2457475 total-data-files: 2 total-delete-files: 0 total-equality-deletes: 0 total-files-size: 1744081 total-position-deletes: 0 total-records: 71729


## Review Questions

Test your understanding:

1. **What problem does Iceberg solve that Parquet alone doesn't?**
   - Think about transactions, versioning, and metadata management.

2. **Why is metadata separation important for large datasets?**
   - Hint: What happens when you need to list millions of files in S3?

3. **How is Iceberg different from just versioning Parquet files in folders?**
   - Consider: `data/2024-01-01/`, `data/2024-01-02/` vs. Iceberg snapshots

4. **What happens to data files when you delete records?**
   - Are the Parquet files rewritten? What gets created instead?

5. **Why does each operation create a new snapshot?**
   - What would happen without snapshots?

6. **Can you query a table while someone else is writing to it?**
   - Think about snapshot isolation.

## Hands-on Challenge

Now it's your turn! Try these exercises:

### Challenge 1: Create a Table from cmdata.jsonl

1. Load `../data/input/cmdata.jsonl`
2. Create an Iceberg table called `iot.devices`
3. Query the table to find how many devices have alarms

### Challenge 2: Append and Verify

1. Append the first 1000 records from events.jsonl to a new table
2. Append the next 1000 records
3. Verify the total count is 2000
4. Inspect the table history - how many snapshots?

### Challenge 3: Explore Metadata Files

1. Find the metadata JSON file for your events table
2. Open it in a text editor or use `inspect_metadata_json()` (covered in next notebook)
3. Find the `current-snapshot-id` value
4. Can you locate the corresponding snapshot in the `snapshots` array?

Use the cells below for your solution:

In [15]:
# Challenge 1: Your code here


In [16]:
# Challenge 2: Your code here


In [17]:
# Challenge 3: Your code here


## Summary

In this notebook, we covered:

* **What Iceberg is**: A table format that adds ACID transactions, schema evolution, and time travel to data lakes
* **Why it matters**: Solves key problems with raw Parquet files (no transactions, slow queries, no versioning)
* **Setting up**: Using SqlCatalog with SQLite for development
* **Creating tables**: Both explicit schema and inferred from data
* **Appending data**: Each append creates a new immutable snapshot
* **Querying**: Using Daft with SQL and DataFrame API
* **Basic operations**: Append, delete, overwrite - all with ACID guarantees

### Key Takeaways

1. **Metadata is the magic**: Iceberg's metadata layer enables all its features
2. **Snapshots are immutable**: Once created, a snapshot never changes
3. **Operations are atomic**: All-or-nothing commits via catalog updates
4. **Query engines read metadata**: Daft reads manifests to find relevant files
5. **Data files are never edited**: Deletes create delete files, not rewrites
