# Part 2: Streaming application using Spark Structured Streaming  
In this task, you will implement Spark Structured Streaming to consume the data from task 1 and perform a prediction.    
Important:   
-	This task uses PySpark Structured Streaming with PySpark Dataframe APIs and PySpark ML.  
-	You also need your pipeline model from A2A to make predictions and persist the results.  

1.	Write code to create a SparkSession, which 1) uses four cores with a proper application name; 2) use the Melbourne timezone; 3) ensure a checkpoint location has been set.


In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 pyspark-shell'
# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[4]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Assignment2B"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name) \
                        .set("spark.sql.streaming.checkpointLocation", "checkpoints")

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.session.timeZone", "GMT+10").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

from pyspark.sql import functions as F

2.	Write code to define the data schema for the data files, following the data types suggested in the metadata file. Load the static datasets (e.g. building information) into data frames. (You can reuse your code from 2A.)


In [2]:
# Adapted from GPT
from pyspark.sql.types import (
    StructType, StructField,
    IntegerType, StringType, DecimalType, TimestampType, DateType, DoubleType
)

# 1. Meters Table
meters_schema = StructType([
    StructField("building_id", IntegerType(), False),
    StructField("meter_type", StringType(), False),   # Char(1) -> StringType
    StructField("ts", TimestampType(), False),
    StructField("value", DecimalType(15, 4), False),
    StructField("row_id", IntegerType(), False)
])

# 2. Buildings Table
buildings_schema = StructType([
    StructField("site_id", IntegerType(), False),
    StructField("building_id", IntegerType(), False),
    StructField("primary_use", StringType(), True),
    StructField("square_feet", IntegerType(), True),
    StructField("floor_count", IntegerType(), True),
    StructField("row_id", IntegerType(), False),
    StructField("year_built", IntegerType(), True),
    StructField("latent_y", DecimalType(6, 4), True),
    StructField("latent_s", DecimalType(6, 4), True),
    StructField("latent_r", DecimalType(6, 4), True)
])

# 3. Weather Table
weather_schema = StructType([
    StructField("site_id", StringType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("air_temperature", DecimalType(5, 3), True),
    StructField("cloud_coverage", DecimalType(5, 3), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("dew_temperature", DecimalType(5, 3), True),
    StructField("sea_level_pressure", DecimalType(8, 3), True),
    StructField("wind_direction", DecimalType(5, 3), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("wind_speed", DecimalType(5, 3), True),
    StructField("weather_ts", TimestampType(), False) # new field
])

buildings_df = spark.read.csv(
    "data/new_building_information.csv",
    header=True,
    schema=buildings_schema
)

weather_df = spark.read.csv(
    "data/weather.csv",
    header=True,
    schema=weather_schema
)


3.	Using the Kafka topic from the producer in Task 1, ingest the streaming data into Spark Streaming, assuming all data comes in the String format. Except for the 'weather_ts' column, you shall receive it as an Int type. Load the new building information CSV file into a dataframe. Then, the data frames should be transformed into the proper formats following the metadata file schema, similar to assignment 2A.


In [3]:


#configuration
hostip = "192.168.0.6"#"10.192.90.63" #change to your machine IP address
topic = 'weather_data'

df_raw = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", f'{hostip}:9092') \
    .option("subscribe", topic) \
    .load()

df_str = df_raw.selectExpr("CAST(value AS STRING) as json_str")

weather_stream = (
    df_str
    .withColumn("data", F.from_json(F.col("json_str"), F.ArrayType(weather_schema)))
    .select(F.explode(F.col("data")).alias("r"))
    .select("r.*")
)


4.	Use a watermark on weather_ts, if data points are received 5 seconds late, discard the data.

In [4]:
weather_stream = weather_stream.withWatermark("weather_ts", '5 seconds')

5.	Perform the necessary transformation you used in A2A. (note: every student may have used different features, feel free to reuse the code you have written in A2A. If you built an end-to-end pipeline, you can ignore this task.) 

In [5]:

# from A2A which was from GPT
# Get global_means, site_month_means, site_means
# weather_df is history, weather_stream is current

# Split timestamp to date, month, time bucket
weather_df = weather_df.withColumn("date", F.to_date("timestamp")).withColumn(
    "time",
    F.when(F.hour("timestamp") <= 5, "0-6h")
     .when(F.hour("timestamp") <= 11, "6-12h")
     .when(F.hour("timestamp") <= 17, "12-18h")
     .when(F.hour("timestamp") <= 23, "18-24h")
).withColumn("month", F.month("timestamp"))

# Choose which columns to impute
impute_cols = [
    "air_temperature",
    "cloud_coverage",
    "dew_temperature",
    "sea_level_pressure",
    "wind_direction",
    "wind_speed"
]

# Compute global_means, site_month_means, site_means
global_means = weather_df.select(
    *[F.mean(c).alias(c) for c in impute_cols]
).first().asDict()

site_month_means = weather_df.groupBy("site_id", "month").agg(
    *[F.mean(c).alias(f"{c}_site_month_mean") for c in impute_cols]
)

site_means = weather_df.groupBy("site_id").agg(
    *[F.mean(c).alias(f"{c}_site_mean") for c in impute_cols]
)
    
# Skip Garbage collection
# del site_month_means
# del site_means
# del global_means
# spark.catalog.clearCache()

# Transform weather_stream
# Split timestamp to date, month, time bucket
weather_stream = weather_stream.withColumn("date", F.to_date("timestamp")).withColumn(
    "time",
    F.when(F.hour("timestamp") <= 5, "0-6h")
     .when(F.hour("timestamp") <= 11, "6-12h")
     .when(F.hour("timestamp") <= 17, "12-18h")
     .when(F.hour("timestamp") <= 23, "18-24h")
).withColumn("month", F.month("timestamp"))


# Step 1: site_id + month
weather_stream = weather_stream.join(site_month_means, on=["site_id", "month"], how="left")
for c in impute_cols:
    weather_stream = weather_stream.withColumn(
        c, F.coalesce(c, F.col(f"{c}_site_month_mean"))
    ).drop(f"{c}_site_month_mean")
    
# Step 2: site_id
weather_stream = weather_stream.join(site_means, on="site_id", how="left")
for c in impute_cols:
    weather_stream = weather_stream.withColumn(
        c, F.coalesce(c, F.col(f"{c}_site_mean"))
    ).drop(f"{c}_site_mean")

# Step 3: global fallback
for c in impute_cols:
    weather_stream = weather_stream.withColumn(
        c, F.coalesce(c, F.lit(global_means[c]))
    )
# # Aggregate by time bucket
# weather_stream = (
#     weather_stream
#     .groupBy(
#         "site_id", "date", "time", "month",
#         F.window("weather_ts", "5 seconds")
#     )
#     .agg(
#         F.mean("air_temperature").cast(DecimalType(5, 3)).alias("air_temperature"),
#         F.mean("cloud_coverage").cast(DecimalType(5, 3)).alias("cloud_coverage"),
#         F.mean("dew_temperature").cast(DecimalType(5, 3)).alias("dew_temperature"),
#         F.mean("sea_level_pressure").cast(DecimalType(8, 3)).alias("sea_level_pressure"),
#         F.mean("wind_direction").cast(DecimalType(5, 3)).alias("wind_direction"),
#         F.mean("wind_speed").cast(DecimalType(5, 3)).alias("wind_speed")     
# #         F.max("weather_ts").alias("weather_ts")   # 👈 reattach representative timestamp
#     )    
# #     # Extract representative event time
# #     .withColumn("weather_ts", F.col("window.end"))
# #     .drop("window")
# )

# Add custom columns
weather_stream = (
    weather_stream
    .withColumn("dew_depression", F.col("air_temperature") - F.col("dew_temperature"))
    .withColumn("nonideal_temp", (F.col("air_temperature") - 18)**2)
    .drop("air_temperature")
    .drop("dew_temperature")
)

# No need to add median temp and peak-offpeak as our pipeline model later does not use them
feature_df = buildings_df.join(weather_stream, ["site_id"])


6.	Load your pipeline model and perform the following aggregations:  
a)	Print the prediction from your model as a stream comes in.  
b)	Every 7 seconds, print the total energy consumption for each 6-hour interval, aggregated by building, and print 20 records. (Note: This is simulating energy data each day in a week)  
c)	Every 14 seconds, for each site, print the daily total energy consumption.  

In [6]:

# --- 1. Spark + model setup ---
# from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
import time

model = PipelineModel.load("models/best_model_rmsle")

# --- 2. Apply model ---
predictions = model.transform(feature_df).withColumnRenamed("prediction", "log_power_usage")

checkpoint_dir = os.path.abspath("checkpoints/weather_stream")
os.makedirs(checkpoint_dir, exist_ok=True)

In [7]:
### 6a
# Show live predictions
query_live = (
    predictions
        .select("site_id", "building_id", "log_power_usage")
        .writeStream
        .outputMode("append")
        .format("memory")
        .queryName("live_predictions")
        .start()
)

print("Waiting for first batch...")
# Wait for the first microbatch to finish
while query_live.lastProgress is None:
    time.sleep(1)
time.sleep(5)
spark.sql("select * from live_predictions").show()


Waiting for first batch...
+-------+-----------+------------------+
|site_id|building_id|   log_power_usage|
+-------+-----------+------------------+
|      8|        861| 4.061107855980998|
|      8|        843| 4.428725433461416|
|      8|        822| 2.389271160411927|
|      8|        866|   4.7190681590531|
|      8|        865| 4.725603405704713|
|      8|        834| 4.037425349753118|
|      8|        811|2.6361125623621566|
|      9|        977|3.5224304499167083|
|      9|        993|7.8203338563111675|
|      9|        992| 5.751885548834196|
|      9|        990|5.9585415358470835|
|      9|        982| 6.099074356589539|
|      9|        981|6.5416565754162175|
|      9|        974| 7.254738639766967|
|      9|        973| 7.572680107210483|
|      9|        959| 6.602787329961556|
|      9|        957|6.6434303000431925|
|      9|        954| 7.653788027556944|
|      9|        952| 6.937003238959837|
|      9|        947| 6.230174505306105|
+-------+-----------+---------

In [8]:
### 6b
# (Task 6b requires aggregation by 6-hour interval)
building_6h = (
    predictions
        .groupBy(
            "building_id",
            "date",
            "time",  # Group by 6 hours of event-time
            F.window("weather_ts", "5 seconds") # Watermark 5 seconds of processing time
        )
        .agg(F.sum("log_power_usage").alias("total_power_6h"))
        .drop("window")
)

# --- Print to console every 7 seconds ---
query_building_6h = (
    building_6h
        .writeStream
        .outputMode("update") # 'update' mode is correct for windowed aggregations
        .format("memory")
        .queryName("building_6h")
        .trigger(processingTime="7 seconds")
        .start()
)

print("Waiting for first batch...")
# Wait for the first microbatch to finish
while query_building_6h.lastProgress is None:
    time.sleep(1)
time.sleep(5)    
spark.sql("select * from building_6h").show()

Waiting for first batch...
+-----------+----+----+--------------+
|building_id|date|time|total_power_6h|
+-----------+----+----+--------------+
+-----------+----+----+--------------+



In [9]:
### 6c
# (Task 6c requires daily aggregation by site)
site_daily = (
    predictions
        .groupBy(
            "site_id", 
            "date", # Group by daily windows of event-time
            F.window("weather_ts", "5 seconds") # Watermark 5 seconds of processing time
        )
        .agg(F.sum("log_power_usage").alias("total_power_day"))
        .drop("window")
)

# --- Print to console every 14 seconds ---
query_site_daily = (
    site_daily
        .writeStream
        .outputMode("update") # 'update' mode is correct
        .format("memory")
        .queryName("site_daily")
        .trigger(processingTime="14 seconds")
        .start()
)
print("Waiting for first batch...")
# Wait for the first microbatch to finish
while query_site_daily.lastProgress is None:
    time.sleep(1)
time.sleep(5)
spark.sql("select * from site_daily").show()

Waiting for first batch...
+-------+----------+------------------+
|site_id|      date|   total_power_day|
+-------+----------+------------------+
|     12|2022-01-05| 572.7342103477073|
|     10|2022-01-05| 469.9349332503669|
|     12|2022-01-04|162.95446694173177|
|     13|2022-01-05|2370.3350094118846|
|     13|2022-01-04| 674.4028832870381|
|      9|2022-01-04|300.89999848006653|
|      5|2022-01-05|1124.6531168216864|
|     14|2022-01-04|473.18583547849533|
|      0|2022-01-05|1523.0941193922542|
|      3|2022-01-05| 3986.921797130366|
|      2|2022-01-04|213.10652652331166|
|     14|2022-01-05|1666.5162041026665|
|      6|2022-01-04| 65.19596300214755|
|      5|2022-01-04|141.23216305172062|
|      1|2022-01-04| 98.12558570774381|
|     15|2022-01-04|467.56681801979727|
|      6|2022-01-05| 516.1672388089638|
|      4|2022-01-04| 174.6069838722403|
|      8|2022-01-05|192.43712424573005|
|      9|2022-01-05|2117.8548405524016|
+-------+----------+------------------+
only showing 

7.	Save the data from 6 to Parquet files as streams. (Hint: Parquet files support streaming writing/reading. The file keeps updating while new batches arrive.)

In [11]:
# 7a(save 6a)

# Save predictions to Parquet incrementally
query_live_parquet = (
    predictions
        .select("site_id", "building_id", "time", "log_power_usage")
        .writeStream
        .outputMode("append")
        .format("parquet")
        .option("path", "data/live_predictions")
        .option("checkpointLocation", checkpoint_dir + "/live_predictions")
        .start()
)
print("Waiting for Parquet microbatch...")
while query_live_parquet.lastProgress is None:
    time.sleep(1)
time.sleep(5)

Waiting for Parquet microbatch...


In [None]:
# 7b(save 6b)
query_building_6h_parquet = (
    building_6h
        .writeStream
        .outputMode("append")
        .format("parquet")
        .option("path", "data/building_6h")
        .option("checkpointLocation", checkpoint_dir + "/building_6h")
        .start()
)
print("Waiting for Parquet microbatch...")
while query_building_6h_parquet.lastProgress is None:
    time.sleep(1)
time.sleep(5)

Waiting for Parquet microbatch...


In [None]:
# 7c(save 6c)
query_site_daily_parquet = (
    site_daily
        .writeStream
        .outputMode("append")
        .format("parquet")
        .option("path", "data/site_daily")
        .option("checkpointLocation", checkpoint_dir + "/site_daily")
        .trigger(processingTime="14 seconds")
        .start()
)
print("Waiting for Parquet microbatch...")
while query_site_daily_parquet.lastProgress is None:
    time.sleep(1)
time.sleep(5)

Waiting for Parquet microbatch...


8.	Read the parquet files from task 7 as data streams and send them to Kafka topics with appropriate names.
(Note: You shall read the parquet files as a streaming data frame and send messages to the Kafka topic when new data appears in the parquet file.)

In [14]:
import json
kafka_ip = hostip + ":9092"

# Schema for 6a/7a/8a (live_predictions)
live_pred_schema = StructType([
    StructField("site_id", IntegerType(), True),
    StructField("building_id", IntegerType(), True),
    StructField("time", StringType(), True),
    StructField("log_power_usage", DoubleType(), True)
])

# Schema for 6b/7b/8b (building_6h)
building_6h_schema = StructType([
    StructField("building_id", IntegerType(), True),
    StructField("date", DateType(), True),
    StructField("time", StringType(), True),
#     StructField("window", StructType([
#         StructField("start", TimestampType(), True),
#         StructField("end", TimestampType(), True)
#     ]), True),
    StructField("total_power_6h", DoubleType(), True)
])

# Schema for 6c/7c/8c (site_daily)
site_daily_schema = StructType([
    StructField("site_id", IntegerType(), True),
    StructField("date", DateType(), True),
#     StructField("window", StructType([
#         StructField("start", TimestampType(), True),
#         StructField("end", TimestampType(), True)
#     ]), True),
    StructField("total_power_day", DoubleType(), True)
])

In [15]:

# Stream 1
read_parquet_live_predictions = (
    spark.readStream
         .format("parquet")
         .schema(live_pred_schema)
         .load("data/live_predictions")
)

# send predictions to Kafka
kafka_live_predictions = (
    read_parquet_live_predictions
        .selectExpr("\"predictions\" AS key", 
                    "to_json(struct(*)) AS value")
        .writeStream
        .format("kafka")
        .option("kafka.bootstrap.servers", kafka_ip)
        .option("topic", "live_predictions")
        .option("checkpointLocation", checkpoint_dir + "/kafka/live_predictions")
        .outputMode("append")
        .start()
)


In [16]:
# Stream 2
read_parquet_building_6h = (
    spark.readStream
         .format("parquet")
         .schema(building_6h_schema)
         .load("data/building_6h")
)

# send predictions to Kafka
kafka_building_6h = (
    read_parquet_building_6h
        .selectExpr("\"predictions\" AS key", 
                        "to_json(struct(*)) AS value")
        .writeStream
        .format("kafka")
        .option("kafka.bootstrap.servers", kafka_ip)
        .option("topic", "building_6h")
        .option("checkpointLocation", checkpoint_dir + "/kafka/building_6h")
        .outputMode("append")
        .start()
)


In [17]:
# Stream 3
read_parquet_site_daily = (
    spark.readStream
         .format("parquet")
         .schema(site_daily_schema)
         .load("data/site_daily")
)

# send predictions to Kafka
kafka_site_daily = (
    read_parquet_site_daily
        .selectExpr("\"predictions\" AS key", 
                        "to_json(struct(*)) AS value")
        .writeStream
        .format("kafka")
        .option("kafka.bootstrap.servers", kafka_ip)
        .option("topic", "site_daily")
        .option("checkpointLocation", checkpoint_dir + "/kafka/site_daily")
        .outputMode("append")
        .start()
)

In [None]:
while query_site_daily_parquet.lastProgress is None:
    time.sleep(1)
print(kafka_site_daily.lastProgress)
print("Is active?", kafka_site_daily.isActive)
print("Status:", kafka_site_daily.status)

None
Is active? True
Status: {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
<bound method StreamingQuery.exception of <pyspark.sql.streaming.query.StreamingQuery object at 0x71f5abf6c3a0>>


In [19]:
import time, json

for i in range(5):
    time.sleep(3)
    print(json.dumps(kafka_site_daily.lastProgress, indent=2))


null
null
null
null
{
  "id": "25a09da5-e0db-49d8-b517-fbbb795da9c6",
  "runId": "484509b0-772b-41ed-9bc6-fb201e01cec7",
  "name": null,
  "timestamp": "2025-10-24T05:48:53.160Z",
  "batchId": 0,
  "numInputRows": 0,
  "inputRowsPerSecond": 0.0,
  "processedRowsPerSecond": 0.0,
  "durationMs": {
    "addBatch": 12257,
    "commitOffsets": 289,
    "getBatch": 81,
    "latestOffset": 519,
    "queryPlanning": 2,
    "triggerExecution": 13428,
    "walCommit": 259
  },
  "stateOperators": [],
  "sources": [
    {
      "description": "FileStreamSource[file:/home/student/Assignment/data/site_daily]",
      "startOffset": null,
      "endOffset": {
        "logOffset": 0
      },
      "latestOffset": null,
      "numInputRows": 0,
      "inputRowsPerSecond": 0.0,
      "processedRowsPerSecond": 0.0
    }
  ],
  "sink": {
    "description": "org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable@48fc4c72",
    "numOutputRows": 0
  }
}


In [20]:
spark.read.parquet("data/live_predictions").show(5)
spark.read.parquet("data/building_6h").show(5)
spark.read.parquet("data/site_daily").show(5)


+-------+-----------+------+------------------+
|site_id|building_id|  time|   log_power_usage|
+-------+-----------+------+------------------+
|      6|        766|18-24h| 5.477508030876975|
|      6|        764|18-24h|5.1696915897919995|
|      6|        760|18-24h| 5.685122953716053|
|      6|        753|18-24h| 5.766261711592816|
|      6|        750|18-24h| 5.142296791595273|
+-------+-----------+------+------------------+
only showing top 5 rows

+-----------+----------+------+------------------+
|building_id|      date|  time|    total_power_6h|
+-----------+----------+------+------------------+
|       1185|2022-01-09|18-24h|17.251736703858818|
|        678|2022-01-11|  0-6h|  22.8457083760589|
|         97|2022-01-11|  0-6h| 45.60107833960657|
|       1083|2022-01-11|  0-6h|26.684846194365164|
|         15|2022-01-10|12-18h| 7.049988453300645|
+-----------+----------+------+------------------+
only showing top 5 rows

+-------+----------+------------------+
|site_id|      date

# Prompt: 
 i have the following code file, Task3_consumer, meant to get 2 kafka consumers to consume the 2 aggregated data streams. i want to visualize data flowing in, and then later plot the predicted vs actual energy consumption. this code block plots fine, but it keeps flickering so much you can barely actually see anything. the problem is that it keeps redrawing everything, especially as the incoming date changes quite frequently. for the site_daily stream, the site_id is statically 0-15 and can be kept there, but for the building_6h stream, the incoming building_ids vary widely and the rankings would keep changing. for both streams, i want the max value on the y axis to be set to at least 500 and 5000 for building_6h and site_daily streams respectively, updating to be more if higher values come in. help me to fix these issues.


to fix the redrawing problem, i want to collect all the data continuously, but to only redraw these graphs every 5 seconds. this should help fix the other problem where data spanning a few days come in out of order, due to lag or other limitations, so the kafka consumer can collect the full day's data before redrawing the graph. it is okay for the graphs to be late, as long as it can collect all the data first. it should store data until 7 days of data at once are stored, and then only start plotting 1 week old data, 1 day at a time (2 days of history for site_daily). plotting only 1 week old data should ensure that all the data we want to plot for that day has arrived. do not skip days of data being plotted, after a day's data has finished its plot and is no longer in use, it can be dropped.
the part where it redraws a graphs every 5 seconds should be in sync with it plotting data 1 week old, 1 day at a time. the producer publishes every 5 seconds, but it may not be a full day's data, and it may arrive out of order, hence the waiting for a whole day's data to come before redrawing the graph.


i would also like a debugger to show how many data points for each date is currently being stored. 
the A2B-Task3_consumer_atleo4 copy.ipynb file contains the code, the A2B_Specification_2025S2.pdf describes what was supposed to be done. 

note that "weather_ts" represents the processor timestamp of the producer, while "date" refers to the date of the data collection time, and "time" is a string corresponding to whether it is 0-6h, 6-12h, 12-18h, 18-24h.  

the current setup already receives data, and is able to plot some of it, so it can be assumed that the producer is mostly correct.  
the schema for the kafka stream that the kafka consumer receives, as per end of Task2_spark_streaming, is:
import json
kafka_ip = hostip + ":9092"

# Schema for 6a/7a/8a (live_predictions)
live_pred_schema = StructType([
    StructField("site_id", IntegerType(), True),
    StructField("building_id", IntegerType(), True),
    StructField("time", StringType(), True),
    StructField("log_power_usage", DoubleType(), True)
])

# Schema for 6b/7b/8b (building_6h)
building_6h_schema = StructType([
    StructField("building_id", IntegerType(), True),
    StructField("date", DateType(), True),
    StructField("time", StringType(), True),
#     StructField("window", StructType([
#         StructField("start", TimestampType(), True),
#         StructField("end", TimestampType(), True)
#     ]), True),
    StructField("total_power_6h", DoubleType(), True)
])

# Schema for 6c/7c/8c (site_daily)
site_daily_schema = StructType([
    StructField("site_id", IntegerType(), True),
    StructField("date", DateType(), True),
#     StructField("window", StructType([
#         StructField("start", TimestampType(), True),
#         StructField("end", TimestampType(), True)
#     ]), True),
    StructField("total_power_day", DoubleType(), True)
])

and they are written as such:
# Stream 3
read_parquet_site_daily = (
    spark.readStream
         .format("parquet")
         .schema(site_daily_schema)
         .load("data/site_daily")
)

# send predictions to Kafka
kafka_site_daily = (
    read_parquet_site_daily
        .selectExpr("\"predictions\" AS key", 
                        "to_json(struct(*)) AS value")
        .writeStream
        .format("kafka")
        .option("kafka.bootstrap.servers", kafka_ip)
        .option("topic", "site_daily")
        .option("checkpointLocation", checkpoint_dir + "/kafka/site_daily")
        .outputMode("append")
        .start()
)
with an identical one for building_6h