# OpenLineage + Spark (Demo)

This notebook runs a simple Spark job that:
- creates a small CSV dataset
- reads it with Spark
- writes it out as Parquet

Because `spark.extraListeners` is set to the OpenLineage Spark listener, the job will emit OpenLineage events to the configured endpoint (`spark.openlineage.transport.url`).


In [1]:
import os
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("openlineage-demo")
    .getOrCreate()
)

spark.version

'3.5.3'

In [2]:
# Create a small CSV file locally inside the container
data_dir = "/home/jovyan/work/data"
os.makedirs(data_dir, exist_ok=True)
csv_path = os.path.join(data_dir, "people.csv")

with open(csv_path, "w", encoding="utf-8") as f:
    f.write("id,name,age\n")
    f.write("1,Ana,30\n")
    f.write("2,Nika,28\n")
    f.write("3,Salome,31\n")

csv_path

'/home/jovyan/work/data/people.csv'

In [3]:
df = spark.read.option("header", True).csv(csv_path)
df.show()

+---+------+---+
| id|  name|age|
+---+------+---+
|  1|   Ana| 30|
|  2|  Nika| 28|
|  3|Salome| 31|
+---+------+---+



In [4]:
# A tiny transform + write (this is what lineage tools typically care about)
out_path = os.path.join(data_dir, "people_parquet")
(
    df
    .withColumnRenamed("name", "full_name")
    .write
    .mode("overwrite")
    .parquet(out_path)
)

out_path

'/home/jovyan/work/data/people_parquet'

## Where to look for lineage

- If you run Marquez, open the UI and look for the job/run.
- If you point the transport URL to a different OpenLineage backend/collector, check that service for incoming events.

Config is in `${SPARK_HOME}/conf/spark-defaults.conf` inside the container.
