# Bronze Layer: Streaming Ingestion
This notebook reads incoming raw JSON events written by the RabbitMQ consumer to the `bronze` S3 bucket.

We'll use PySpark Structured Streaming to read, parse, and store these events in Delta Lake format for downstream processing.

### 1. Init Spark

In [1]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("BronzeIngest")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate()
)


### 2. Define Schema & Source Path

In [2]:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType

schema = StructType([
    StructField("user_id", StringType()),
    StructField("event_type", StringType()),
    StructField("timestamp", StringType()),  # will cast to Timestamp later
    StructField("location", StructType([
        StructField("lat", DoubleType()),
        StructField("lon", DoubleType())
    ]))
])

source_path = "/mnt/s3mock/bronze/realtime/events"
checkpoint_path = "/mnt/s3mock/bronze/realtime/_checkpoints"

### 3. Start Structured Stream

In [3]:
from pyspark.sql.functions import col, from_json, to_timestamp

raw_df = spark.readStream \
    .schema(schema) \
    .json(source_path)

cleaned_df = raw_df.withColumn("event_timestamp", to_timestamp(col("timestamp")))

query = cleaned_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_path) \
    .outputMode("append") \
    .start("/mnt/s3mock/silver/realtime/events")

### 4. Monitor Progress

In [None]:
query.awaitTermination()