Polaris Environment Setup Guide

Welcome! This repository provides a full working environment to explore Apache Polaris integrated with Apache Spark. Follow the steps below to bring up the environment, configure your Polaris catalog, and run interactive Spark SQL queries and jobs.

🚀 Step 1: Start the Polaris Environment

Spin up the Polaris services using Docker Compose:

docker compose up -d

This will start Polaris and its dependencies locally. By default, Polaris will be available at:

http://localhost:8181

🐍 Step 2: Bootstrap Your Polaris Catalog

You'll need Python 3.8+ and virtualenv.

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows use .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Run the bootstrap script to set up your Polaris catalog, roles, and user:

python bootstrap.py

This will:

Create a catalog named polariscatalog
Create and assign roles
Register a user and return their credentials

💻 Step 3: Open Spark and Interact with Polaris

Launch the Spark UI by visiting:

http://localhost:8888

You can use Spark interactively or run Python scripts. Here's an example script to configure Spark with Polaris:

import pyspark
from pyspark.sql import SparkSession

POLARIS_URI = 'http://polaris:8181/api/catalog'
POLARIS_CATALOG_NAME = 'polariscatalog'
POLARIS_CREDENTIALS = 'your_client_id:your_client_secret'
POLARIS_SCOPE = 'PRINCIPAL_ROLE:ALL'

conf = (
    pyspark.SparkConf()
        .setAppName('PolarisSparkApp')
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.4.0')
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
        .set('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.polaris.warehouse', POLARIS_CATALOG_NAME)
        .set('spark.sql.catalog.polaris.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
        .set('spark.sql.catalog.polaris.uri', POLARIS_URI)
        .set('spark.sql.catalog.polaris.credential', POLARIS_CREDENTIALS)
        .set('spark.sql.catalog.polaris.scope', POLARIS_SCOPE)
        .set('spark.sql.catalog.polaris.token-refresh-enabled', 'true')
)

spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("✅ Spark session configured for Polaris is running.")

Example: Spark SQL with Polaris

# List namespaces
spark.sql("SHOW NAMESPACES IN polaris").show()

# Create a namespace if it doesn't exist
spark.sql("CREATE NAMESPACE IF NOT EXISTS polaris.db").show()

# List tables in a namespace
spark.sql("SHOW TABLES IN polaris.db").show()

# Define sample data
data = [
    (1, "Alice", "2023-01-01"),
    (2, "Bob", "2023-01-02"),
    (3, "Charlie", "2023-01-03"),
]
columns = ["id", "name", "event_date"]
df = spark.createDataFrame(data, columns)

# Write a new Iceberg table in Polaris
df.writeTo("polaris.db.events") \
  .partitionedBy("event_date") \
  .tableProperty("write.format.default", "parquet") \
  .create()

# Verify table creation
spark.sql("SHOW TABLES IN polaris.db").show()

🧾 Notes

Replace your_client_id:your_client_secret with the credentials generated by bootstrap.py.
You may update the example script to explore different Polaris features.

⚠️ Important Note on File Permissions

When using the FILE storage type for your Polaris catalog, the Spark container may preemptively create the data and metadata directories during write operations. However, this can cause a permissions conflict when the Polaris container later tries to write metadata files to the same directories.

🐞 Problem

You may encounter an error like:

ServiceUnavailableException: Failed to create file: file:/data/db/events/metadata/00000-...

This happens because:

Spark (running as one user) creates the folder.
Polaris (running as another user) tries to write to it but lacks permissions.

✅ Solution

Before starting your environment or running your Spark job, manually set the appropriate permissions on the shared volume or directory:

mkdir -p /data/db/events/metadata/
chmod -R 777 /data

Or, if using Docker volumes, ensure the volume is mounted and writable by all relevant containers:

volumes:
  data-volume:
    driver: local

services:
  spark:
    volumes:
      - data-volume:/data

  polaris:
    volumes:
      - data-volume:/data

📁 Setting Up Iceberg Table Directories (Required for FILE Storage)

When using the FILE storage type in Polaris, it's your responsibility to ensure that the underlying directory structure for each Iceberg table exists before attempting to create or write to the table from Spark.

❓ Why is this necessary?

In this local setup:

Spark and Polaris share the same file-backed storage directory (./icebergdata, mounted as /data).
Polaris must be able to write metadata files, but if Spark creates the directory first (e.g., during a .create() call), the folder may end up with restrictive permissions.
This leads to errors like:

ServiceUnavailableException: Failed to create file: file:/data/...

To prevent this, use the helper script below to create the required folders with the correct permissions (777).

🛠️ How to Use `table_setup.py`

This script ensures the necessary folder structure exists for a given table and sets appropriate permissions.

✅ Usage

cd icebergdata
python table_setup.py <namespace.table>

Example:

cd icebergdata
python table_setup.py test.table

This will create:

./icebergdata/test/table/metadata/
./icebergdata/test/table/data/

And ensure both are chmod 777.

⚠️ Important: Always run this script before creating a table using Spark when using FILE storage in Polaris.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
icebergdata		icebergdata
.gitignore		.gitignore
bootstrap.py		bootstrap.py
docker-compose.yml		docker-compose.yml
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polaris Environment Setup Guide

🚀 Step 1: Start the Polaris Environment

🐍 Step 2: Bootstrap Your Polaris Catalog

💻 Step 3: Open Spark and Interact with Polaris

Example: Spark SQL with Polaris

🧾 Notes

🐞 Problem

✅ Solution

📁 Setting Up Iceberg Table Directories (Required for FILE Storage)

❓ Why is this necessary?

🛠️ How to Use `table_setup.py`

✅ Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Polaris Environment Setup Guide

🚀 Step 1: Start the Polaris Environment

🐍 Step 2: Bootstrap Your Polaris Catalog

💻 Step 3: Open Spark and Interact with Polaris

Example: Spark SQL with Polaris

🧾 Notes

🐞 Problem

✅ Solution

📁 Setting Up Iceberg Table Directories (Required for FILE Storage)

❓ Why is this necessary?

🛠️ How to Use table_setup.py

✅ Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🛠️ How to Use `table_setup.py`

Packages