# Spark sanity-check notebook

This notebook creates a SparkSession against your cluster and runs a tiny job. It also enables event logging so the Spark History Server can display the run.

**Tips**:
- When running *inside the Docker JupyterLab* container, set `SPARK_MASTER_URL` to `spark://spark-master:7077` (already set by Terraform). The event logs will go to `/opt/bitnami/spark/tmp/spark-events`.
- When running *on the host*, set `SPARK_MASTER_URL` to `spark://localhost:7077`. The notebook will fallback to a writable local folder (e.g. `/tmp/spark-events`) if the Bitnami path isn't writable.
- Make sure a Java runtime (e.g. OpenJDK 17) is installed when running on the host.


In [None]:
import os, pathlib, sys
from pyspark.sql import SparkSession

def pick_event_log_dir():
    candidates = [
        "file:/opt/bitnami/spark/tmp/spark-events",  # Docker Bitnami image path
        "file:/event-logs",                           # extra mount (optional)
        "file:/tmp/spark-events",                    # host fallback
    ]
    for uri in candidates:
        path = uri.replace("file:", "")
        try:
            p = pathlib.Path(path)
            p.mkdir(parents=True, exist_ok=True)
            with open(p/"._write_test", "w") as f:
                f.write("ok")
            return uri
        except Exception:
            continue
    return "file:/tmp/spark-events"

master = os.getenv("SPARK_MASTER_URL", "spark://localhost:7077")
event_log_dir = pick_event_log_dir()

spark = (SparkSession.builder
         .master(master)
         .appName("hist-check")
         .config("spark.eventLog.enabled", "true")
         .config("spark.eventLog.dir", event_log_dir)
         .getOrCreate())

print("Spark version:", spark.version)
print("Master:", master)
print("eventLog.dir:", event_log_dir)


In [None]:
# Run a tiny job
df = spark.range(100000)
df.selectExpr("sum(id) AS total").show()
print("Done.")


In [None]:
# Clean shutdown (optional)
spark.stop()
print("Spark stopped.")
