Iceberg Rest Catalog Flow

graph TD

    subgraph Query Engines 
        A[SparkSQL]
        P[Trino Coordinator]
    end

    A[SparkSQL] -->|Query| B[Iceberg Extensions]
    P[Trino Coordinator] -->|Query| B
    B -->|Get Metadata| C[REST Catalog]

    A[SparkSQL] -->|Distribute| W[Worker Nodes, Spark Executors]
    P[Trino Coordinator] -->|Distribute| W
    W -->|Read/Write Data| E

    C -->|Read/Write Metadata| E[MinIO/S3]

Iceberg

This repository demonstrates a simple example of using Apache Iceberg with both SparkSQL and Trino (formerly PrestoSQL) query engines. The architecture diagram above shows the key differences in how these engines interact with Iceberg tables:

SparkSQL Flow

SparkSQL sends queries through Iceberg extensions which handle table management
Metadata is retrieved from the REST Catalog service
The actual data files are accessed through S3FileIO
Data is stored in MinIO (S3-compatible storage)

Trino Flow

The Trino coordinator receives SQL queries and distributes work across worker nodes
Worker nodes perform parallel operations using Iceberg extensions
Data access is distributed across workers using S3FileIO for maximum throughput

Key Differences:

Architecture: Spark uses a more monolithic approach where each executor handles both compute and I/O. Trino separates coordination from execution with dedicated workers.
Scalability: Trino's distributed architecture is designed for larger clusters and concurrent queries, while Spark excels at batch processing.
Query Optimization: Both have different query optimizers - Spark uses Catalyst while Trino has its own cost-based optimizer.

Getting Started

Start the services:
```
docker compose up -d
```
Wait for all services to be healthy:
- REST Catalog (port 8181)
- MinIO (port 9000)
- Trino (port 8081)
- Spark (port 8080) # If you want to use spark in a container

Run the example:

docker compose exec spark-iceberg python /home/iceberg/main.py

Verify the results:
- Check MinIO console at http://localhost:9001
- View Trino UI at http://localhost:8081
- Examine Spark UI at http://localhost:8080
Launch spark.

if you want to use spark locally, you can launch it with the following command:
```
spark-submit \
    --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,org.apache.iceberg:iceberg-aws-bundle:1.7.1 \
    src/main.py
```
otherwise, you can use the spark-iceberg container by uncommenting the spark-iceberg service in the docker-compose.yaml file. keep in mind that you need to change the endpoints in the main.py and docker-compose.yaml files to match the the ones on the docker bridge network.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
presto-catalog		presto-catalog
src		src
.envrc		.envrc
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iceberg Rest Catalog Flow

Iceberg

SparkSQL Flow

Trino Flow

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Iceberg Rest Catalog Flow

Iceberg

SparkSQL Flow

Trino Flow

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages