Skip to content

AlexMercedCoder/quick-test-polaris-environment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Polaris Environment Setup Guide

Welcome! This repository provides a full working environment to explore Apache Polaris integrated with Apache Spark. Follow the steps below to bring up the environment, configure your Polaris catalog, and run interactive Spark SQL queries and jobs.


🚀 Step 1: Start the Polaris Environment

Spin up the Polaris services using Docker Compose:

docker compose up -d

This will start Polaris and its dependencies locally. By default, Polaris will be available at:

http://localhost:8181

🐍 Step 2: Bootstrap Your Polaris Catalog

You'll need Python 3.8+ and virtualenv.

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows use .venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Run the bootstrap script to set up your Polaris catalog, roles, and user:

python bootstrap.py

This will:

  • Create a catalog named polariscatalog

  • Create and assign roles

  • Register a user and return their credentials

💻 Step 3: Open Spark and Interact with Polaris

Launch the Spark UI by visiting:

http://localhost:8888

You can use Spark interactively or run Python scripts. Here's an example script to configure Spark with Polaris:

import pyspark
from pyspark.sql import SparkSession

POLARIS_URI = 'http://polaris:8181/api/catalog'
POLARIS_CATALOG_NAME = 'polariscatalog'
POLARIS_CREDENTIALS = 'your_client_id:your_client_secret'
POLARIS_SCOPE = 'PRINCIPAL_ROLE:ALL'

conf = (
    pyspark.SparkConf()
        .setAppName('PolarisSparkApp')
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.4.0')
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
        .set('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.polaris.warehouse', POLARIS_CATALOG_NAME)
        .set('spark.sql.catalog.polaris.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
        .set('spark.sql.catalog.polaris.uri', POLARIS_URI)
        .set('spark.sql.catalog.polaris.credential', POLARIS_CREDENTIALS)
        .set('spark.sql.catalog.polaris.scope', POLARIS_SCOPE)
        .set('spark.sql.catalog.polaris.token-refresh-enabled', 'true')
)

spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("✅ Spark session configured for Polaris is running.")

Example: Spark SQL with Polaris

# List namespaces
spark.sql("SHOW NAMESPACES IN polaris").show()

# Create a namespace if it doesn't exist
spark.sql("CREATE NAMESPACE IF NOT EXISTS polaris.db").show()

# List tables in a namespace
spark.sql("SHOW TABLES IN polaris.db").show()

# Define sample data
data = [
    (1, "Alice", "2023-01-01"),
    (2, "Bob", "2023-01-02"),
    (3, "Charlie", "2023-01-03"),
]
columns = ["id", "name", "event_date"]
df = spark.createDataFrame(data, columns)

# Write a new Iceberg table in Polaris
df.writeTo("polaris.db.events") \
  .partitionedBy("event_date") \
  .tableProperty("write.format.default", "parquet") \
  .create()

# Verify table creation
spark.sql("SHOW TABLES IN polaris.db").show()

🧾 Notes

  • Replace your_client_id:your_client_secret with the credentials generated by bootstrap.py.

  • You may update the example script to explore different Polaris features.

⚠️ Important Note on File Permissions

When using the FILE storage type for your Polaris catalog, the Spark container may preemptively create the data and metadata directories during write operations. However, this can cause a permissions conflict when the Polaris container later tries to write metadata files to the same directories.

🐞 Problem

You may encounter an error like:

ServiceUnavailableException: Failed to create file: file:/data/db/events/metadata/00000-...

This happens because:

  • Spark (running as one user) creates the folder.
  • Polaris (running as another user) tries to write to it but lacks permissions.

✅ Solution

Before starting your environment or running your Spark job, manually set the appropriate permissions on the shared volume or directory:

mkdir -p /data/db/events/metadata/
chmod -R 777 /data

Or, if using Docker volumes, ensure the volume is mounted and writable by all relevant containers:

volumes:
  data-volume:
    driver: local

services:
  spark:
    volumes:
      - data-volume:/data

  polaris:
    volumes:
      - data-volume:/data

📁 Setting Up Iceberg Table Directories (Required for FILE Storage)

When using the FILE storage type in Polaris, it's your responsibility to ensure that the underlying directory structure for each Iceberg table exists before attempting to create or write to the table from Spark.

❓ Why is this necessary?

In this local setup:

  • Spark and Polaris share the same file-backed storage directory (./icebergdata, mounted as /data).
  • Polaris must be able to write metadata files, but if Spark creates the directory first (e.g., during a .create() call), the folder may end up with restrictive permissions.
  • This leads to errors like:
ServiceUnavailableException: Failed to create file: file:/data/...

To prevent this, use the helper script below to create the required folders with the correct permissions (777).

🛠️ How to Use table_setup.py

This script ensures the necessary folder structure exists for a given table and sets appropriate permissions.

✅ Usage

cd icebergdata
python table_setup.py <namespace.table>

Example:

cd icebergdata
python table_setup.py test.table

This will create:

./icebergdata/test/table/metadata/
./icebergdata/test/table/data/

And ensure both are chmod 777.

⚠️ Important: Always run this script before creating a table using Spark when using FILE storage in Polaris.

About

repo to quickly try out Apache Polaris locally on your laptop!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages