Welcome! This repository provides a full working environment to explore Apache Polaris integrated with Apache Spark. Follow the steps below to bring up the environment, configure your Polaris catalog, and run interactive Spark SQL queries and jobs.
Spin up the Polaris services using Docker Compose:
docker compose up -dThis will start Polaris and its dependencies locally. By default, Polaris will be available at:
http://localhost:8181
You'll need Python 3.8+ and virtualenv.
Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows use .venv\Scripts\activateInstall dependencies:
pip install -r requirements.txtRun the bootstrap script to set up your Polaris catalog, roles, and user:
python bootstrap.pyThis will:
-
Create a catalog named polariscatalog
-
Create and assign roles
-
Register a user and return their credentials
Launch the Spark UI by visiting:
http://localhost:8888
You can use Spark interactively or run Python scripts. Here's an example script to configure Spark with Polaris:
import pyspark
from pyspark.sql import SparkSession
POLARIS_URI = 'http://polaris:8181/api/catalog'
POLARIS_CATALOG_NAME = 'polariscatalog'
POLARIS_CREDENTIALS = 'your_client_id:your_client_secret'
POLARIS_SCOPE = 'PRINCIPAL_ROLE:ALL'
conf = (
pyspark.SparkConf()
.setAppName('PolarisSparkApp')
.set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.4.0')
.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
.set('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog')
.set('spark.sql.catalog.polaris.warehouse', POLARIS_CATALOG_NAME)
.set('spark.sql.catalog.polaris.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
.set('spark.sql.catalog.polaris.uri', POLARIS_URI)
.set('spark.sql.catalog.polaris.credential', POLARIS_CREDENTIALS)
.set('spark.sql.catalog.polaris.scope', POLARIS_SCOPE)
.set('spark.sql.catalog.polaris.token-refresh-enabled', 'true')
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("✅ Spark session configured for Polaris is running.")# List namespaces
spark.sql("SHOW NAMESPACES IN polaris").show()
# Create a namespace if it doesn't exist
spark.sql("CREATE NAMESPACE IF NOT EXISTS polaris.db").show()
# List tables in a namespace
spark.sql("SHOW TABLES IN polaris.db").show()
# Define sample data
data = [
(1, "Alice", "2023-01-01"),
(2, "Bob", "2023-01-02"),
(3, "Charlie", "2023-01-03"),
]
columns = ["id", "name", "event_date"]
df = spark.createDataFrame(data, columns)
# Write a new Iceberg table in Polaris
df.writeTo("polaris.db.events") \
.partitionedBy("event_date") \
.tableProperty("write.format.default", "parquet") \
.create()
# Verify table creation
spark.sql("SHOW TABLES IN polaris.db").show()-
Replace your_client_id:your_client_secret with the credentials generated by bootstrap.py.
-
You may update the example script to explore different Polaris features.
⚠️ Important Note on File Permissions
When using the FILE storage type for your Polaris catalog, the Spark container may preemptively create the data and metadata directories during write operations. However, this can cause a permissions conflict when the Polaris container later tries to write metadata files to the same directories.
You may encounter an error like:
ServiceUnavailableException: Failed to create file: file:/data/db/events/metadata/00000-...
This happens because:
- Spark (running as one user) creates the folder.
- Polaris (running as another user) tries to write to it but lacks permissions.
Before starting your environment or running your Spark job, manually set the appropriate permissions on the shared volume or directory:
mkdir -p /data/db/events/metadata/
chmod -R 777 /dataOr, if using Docker volumes, ensure the volume is mounted and writable by all relevant containers:
volumes:
data-volume:
driver: local
services:
spark:
volumes:
- data-volume:/data
polaris:
volumes:
- data-volume:/dataWhen using the FILE storage type in Polaris, it's your responsibility to ensure that the underlying directory structure for each Iceberg table exists before attempting to create or write to the table from Spark.
In this local setup:
- Spark and Polaris share the same file-backed storage directory (
./icebergdata, mounted as/data). - Polaris must be able to write metadata files, but if Spark creates the directory first (e.g., during a
.create()call), the folder may end up with restrictive permissions. - This leads to errors like:
ServiceUnavailableException: Failed to create file: file:/data/...
To prevent this, use the helper script below to create the required folders with the correct permissions (777).
This script ensures the necessary folder structure exists for a given table and sets appropriate permissions.
cd icebergdata
python table_setup.py <namespace.table>Example:
cd icebergdata
python table_setup.py test.tableThis will create:
./icebergdata/test/table/metadata/
./icebergdata/test/table/data/And ensure both are chmod 777.