# Part 2: Streaming application using Spark Structured Streaming  
In this task, you will implement Spark Structured Streaming to consume the data from task 1 and perform a prediction.    
Important:   
-	This task uses PySpark Structured Streaming with PySpark Dataframe APIs and PySpark ML.  
-	You also need your pipeline model from A2A to make predictions and persist the results.  

1.	Write code to create a SparkSession, which 1) uses four cores with a proper application name; 2) use the Melbourne timezone; 3) ensure a checkpoint location has been set.


In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.0.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell'

# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[4]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Assignment2B"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name) \
                        .set("spark.checkpoint.dir", "checkpoints")

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.session.timeZone", "GMT+10").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')

from pyspark.sql import functions as F

2.	Write code to define the data schema for the data files, following the data types suggested in the metadata file. Load the static datasets (e.g. building information) into data frames. (You can reuse your code from 2A.)


In [2]:
# Adapted from GPT
from pyspark.sql.types import (
    StructType, StructField,
    IntegerType, StringType, DecimalType, TimestampType
)

# 1. Meters Table
meters_schema = StructType([
    StructField("building_id", IntegerType(), False),
    StructField("meter_type", StringType(), False),   # Char(1) -> StringType
    StructField("ts", TimestampType(), False),
    StructField("value", DecimalType(15, 4), False),
    StructField("row_id", IntegerType(), False)
])

# 2. Buildings Table
buildings_schema = StructType([
    StructField("site_id", IntegerType(), False),
    StructField("building_id", IntegerType(), False),
    StructField("primary_use", StringType(), True),
    StructField("square_feet", IntegerType(), True),
    StructField("floor_count", IntegerType(), True),
    StructField("row_id", IntegerType(), False),
    StructField("year_built", IntegerType(), True),
    StructField("latent_y", DecimalType(6, 4), True),
    StructField("latent_s", DecimalType(6, 4), True),
    StructField("latent_r", DecimalType(6, 4), True)
])

# 3. Weather Table
# weather_schema = StructType([
#     StructField("site_id", StringType(), False),
#     StructField("timestamp", StringType(), False),
#     StructField("air_temperature", StringType(), True),
#     StructField("cloud_coverage", StringType(), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
#     StructField("dew_temperature", StringType(), True),
#     StructField("sea_level_pressure", StringType(), True),
#     StructField("wind_direction", StringType(), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
#     StructField("wind_speed", StringType(), True),
#     StructField("weather_ts", StringType(), False) # new field
# ])

weather_schema = StructType([
    StructField("site_id", StringType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("air_temperature", DecimalType(5, 3), True),
    StructField("cloud_coverage", DecimalType(5, 3), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("dew_temperature", DecimalType(5, 3), True),
    StructField("sea_level_pressure", DecimalType(8, 3), True),
    StructField("wind_direction", DecimalType(5, 3), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("wind_speed", DecimalType(5, 3), True),
    StructField("weather_ts", TimestampType(), False) # new field
])


# buildings_df = spark.read.csv(
#     "data/building_information.csv",
#     header=True,
#     schema=buildings_schema
# )

buildings_df = spark.read.csv(
    "data/new_building_information.csv",
    header=True,
    schema=buildings_schema
)



3.	Using the Kafka topic from the producer in Task 1, ingest the streaming data into Spark Streaming, assuming all data comes in the String format. Except for the 'weather_ts' column, you shall receive it as an Int type. Load the new building information CSV file into a dataframe. Then, the data frames should be transformed into the proper formats following the metadata file schema, similar to assignment 2A.


In [3]:


#configuration
hostip = "192.168.0.6"#"10.192.90.63" #change to your machine IP address
topic = 'weather_data'

df_raw = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", f'{hostip}:9092') \
    .option("subscribe", topic) \
    .load()

df_str = df_raw.selectExpr("CAST(value AS STRING) as json_str")
# df_parsed = df_str.withColumn(
#     "data",
#     F.from_json(F.col("json_str"), F.ArrayType(weather_schema))
# )
# df_exploded = df_parsed.select(F.explode(F.col("data")).alias("record"))
# df_final = df_exploded






weather_df = (
    df_str
    .withColumn("data", F.from_json(F.col("json_str"), F.ArrayType(weather_schema)))
    .select(F.explode(F.col("data")).alias("r"))
    .select("r.*")
)


4.	Use a watermark on weather_ts, if data points are received 5 seconds late, discard the data.

In [4]:
weather_df.withWatermark("weather_ts", '5 seconds')

DataFrame[site_id: string, timestamp: timestamp, air_temperature: decimal(5,3), cloud_coverage: decimal(5,3), dew_temperature: decimal(5,3), sea_level_pressure: decimal(8,3), wind_direction: decimal(5,3), wind_speed: decimal(5,3), weather_ts: timestamp]

5.	Perform the necessary transformation you used in A2A. (note: every student may have used different features, feel free to reuse the code you have written in A2A. If you built an end-to-end pipeline, you can ignore this task.) 

In [None]:

# from A2A which was from GPT
# Weather df
# Split timestamp to date, month, time bucket
weather_df = weather_df.withColumn("date", F.to_date("timestamp")).withColumn(
    "time",
    F.when(F.hour("timestamp") <= 5, "0-6h")
     .when(F.hour("timestamp") <= 11, "6-12h")
     .when(F.hour("timestamp") <= 17, "12-18h")
     .when(F.hour("timestamp") <= 23, "18-24h")
).withColumn("month", F.month("timestamp"))

# Choose which columns to impute
impute_cols = [
    "air_temperature",
    "cloud_coverage",
    "dew_temperature",
    "sea_level_pressure",
    "wind_direction",
    "wind_speed"
]

# Global means once
global_means = weather_df.select(
    *[F.mean(c).alias(c) for c in impute_cols]
).first().asDict()

# Step 1: site_id + month
site_month_means = weather_df.groupBy("site_id", "month").agg(
    *[F.mean(c).alias(f"{c}_site_month_mean") for c in impute_cols]
)
weather_df = weather_df.join(site_month_means, on=["site_id", "month"], how="left")
for c in impute_cols:
    weather_df = weather_df.withColumn(
        c, F.coalesce(c, F.col(f"{c}_site_month_mean"))
    ).drop(f"{c}_site_month_mean")

# Garbage collection
weather_df = weather_df.unpersist()

# Step 2: site_id
site_means = weather_df.groupBy("site_id").agg(
    *[F.mean(c).alias(f"{c}_site_mean") for c in impute_cols]
)
weather_df = weather_df.join(site_means, on="site_id", how="left")
for c in impute_cols:
    weather_df = weather_df.withColumn(
        c, F.coalesce(c, F.col(f"{c}_site_mean"))
    ).drop(f"{c}_site_mean")

# Step 3: global fallback
for c in impute_cols:
    weather_df = weather_df.withColumn(
        c, F.coalesce(c, F.lit(global_means[c]))
    )
    
# Garbage collection
del site_month_means
del site_means
del global_means
spark.catalog.clearCache()
    
# Aggregate by time bucket
weather_df = (
    weather_df.groupBy("site_id", "date", "time", "month")
    .agg(
        F.mean("air_temperature").cast(DecimalType(5, 3)).alias("air_temperature"),
        F.mean("cloud_coverage").cast(DecimalType(5, 3)).alias("cloud_coverage"),
        F.mean("dew_temperature").cast(DecimalType(5, 3)).alias("dew_temperature"),
        F.mean("sea_level_pressure").cast(DecimalType(8, 3)).alias("sea_level_pressure"),
        F.mean("wind_direction").cast(DecimalType(5, 3)).alias("wind_direction"),
        F.mean("wind_speed").cast(DecimalType(5, 3)).alias("wind_speed"),        
    )
)

weather_df.show(3)

# No need to add median temp and peak-offpeak as our pipeline model later does not use them
feature_df = buildings_df.join(weather_df, ["site_id", "date", "time"])



6.	Load your pipeline model and perform the following aggregations:  
a)	Print the prediction from your model as a stream comes in.  
b)	Every 7 seconds, print the total energy consumption for each 6-hour interval, aggregated by building, and print 20 records. (Note: This is simulating energy data each day in a week)  
c)	Every 14 seconds, for each site, print the daily total energy consumption.  

In [None]:
# i want to predict (and print) from my earlier created model as the stream comes in. the model was earlier saved as:
# ```
# # Define the path to save the model
# model_path = "models/best_model_rmsle"

# # Save the better trained model. 
# if rf_cv_metrics["rmsle"] < gbt_cv_metrics["rmsle"]:
#     rf_cv_model.bestModel.save(model_path)
# else:
#     gbt_cv_model.bestModel.save(model_path)
# ```
# where exactly do i load and run the model (was it model.transform(feature_df)?) to predict and append another column "value"?
# feature_df is df_final as above, with some imputation and a join done. 

In [None]:
# 6a
# --- 1. Spark + model setup ---
# from pyspark.sql import SparkSession
# from pyspark.ml import PipelineModel
from pyspark.ml.tuning import CrossValidatorModel

model = CrossValidatorModel.load("models/best_model_rmsle")

# --- 3. Apply model ---
predictions = model.transform(feature_df).withColumnRenamed("prediction", "power_usage")

# --- 4. Output ---
query = (
    predictions
        .select("site_id", "timestamp", "power_usage")
        .writeStream
        .format("console")
        .outputMode("append")
        .option("truncate", False)
        .start()
)

query.awaitTermination()

In [6]:
def foreach_batch_function(df, epoch_id):
    df.show(10,False)
query = (
    weather_df.writeStream
        .format("console")
        .foreachBatch(foreach_batch_function)
        .outputMode("append")
        .start()
)

query.awaitTermination()

+-------+---------+---------------+--------------+---------------+------------------+--------------+----------+----------+----+----+-----+
|site_id|timestamp|air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts|date|time|month|
+-------+---------+---------------+--------------+---------------+------------------+--------------+----------+----------+----+----+-----+
+-------+---------+---------------+--------------+---------------+------------------+--------------+----------+----------+----+----+-----+

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-05-20 22:00:00|27.800         |NULL          |20.000         |1016.100          |NULL          |5.700     |2025-10-13 05:48:20.777813|2022-05-20|18-24h|5    |
|0      |2022-05-20 23:00:00|27.200         |NULL          |19.400         |1016.700          |0.000         |0.000     |2025-10-13 05:48:20.77783 |2022-05-20|18-24h|5    |
|0      |2022-05-21 00:00:00|26.700         |6.000         |19.400         |1017.500          |0.000         |0.000     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-06-09 22:00:00|25.000         |NULL          |23.300         |1014.800          |NULL          |3.100     |2025-10-13 05:48:40.811425|2022-06-09|18-24h|6    |
|0      |2022-06-09 23:00:00|25.600         |NULL          |23.300         |1015.000          |NULL          |2.100     |2025-10-13 05:48:40.811443|2022-06-09|18-24h|6    |
|0      |2022-06-10 00:00:00|25.000         |6.000         |23.300         |1015.200          |NULL          |1.500     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-06-29 22:00:00|26.100         |NULL          |23.300         |1014.700          |NULL          |3.100     |2025-10-13 05:49:00.842813|2022-06-29|18-24h|6    |
|0      |2022-06-29 23:00:00|27.800         |NULL          |23.900         |1014.700          |NULL          |3.100     |2025-10-13 05:49:00.842831|2022-06-29|18-24h|6    |
|0      |2022-06-30 00:00:00|26.700         |8.000         |23.900         |1015.300          |NULL          |3.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-07-19 22:00:00|25.000         |NULL          |22.800         |1020.900          |NULL          |2.100     |2025-10-13 05:49:20.877458|2022-07-19|18-24h|7    |
|0      |2022-07-19 23:00:00|25.600         |NULL          |22.800         |1019.700          |70.000        |1.500     |2025-10-13 05:49:20.87748 |2022-07-19|18-24h|7    |
|0      |2022-07-20 00:00:00|26.100         |4.000         |22.200         |1019.700          |0.000         |0.000     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-08-08 22:00:00|26.100         |NULL          |22.800         |1014.900          |NULL          |4.100     |2025-10-13 05:49:40.914788|2022-08-08|18-24h|8    |
|0      |2022-08-08 23:00:00|26.700         |NULL          |22.800         |1014.600          |NULL          |6.200     |2025-10-13 05:49:40.914804|2022-08-08|18-24h|8    |
|0      |2022-08-09 00:00:00|25.000         |6.000         |22.200         |1014.700          |NULL          |4.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-08-28 22:00:00|27.800         |NULL          |23.300         |1013.700          |NULL          |5.700     |2025-10-13 05:50:00.945254|2022-08-28|18-24h|8    |
|0      |2022-08-28 23:00:00|27.800         |NULL          |23.900         |1013.600          |70.000        |3.600     |2025-10-13 05:50:00.94527 |2022-08-28|18-24h|8    |
|0      |2022-08-29 00:00:00|27.200         |6.000         |23.300         |1013.900          |80.000        |4.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-09-17 22:00:00|28.300         |NULL          |23.900         |1015.300          |NULL          |2.100     |2025-10-13 05:50:20.978832|2022-09-17|18-24h|9    |
|0      |2022-09-17 23:00:00|28.300         |4.000         |24.400         |1015.300          |NULL          |2.100     |2025-10-13 05:50:20.978849|2022-09-17|18-24h|9    |
|0      |2022-09-18 00:00:00|28.300         |4.000         |23.300         |1015.600          |50.000        |4.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-10-07 22:00:00|24.400         |NULL          |22.200         |1000.100          |NULL          |12.900    |2025-10-13 05:50:41.011104|2022-10-07|18-24h|10   |
|0      |2022-10-07 23:00:00|24.400         |NULL          |22.200         |1000.800          |NULL          |10.800    |2025-10-13 05:50:41.01112 |2022-10-07|18-24h|10   |
|0      |2022-10-08 00:00:00|25.000         |8.000         |22.200         |1001.800          |NULL          |9.800     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-10-27 22:00:00|25.000         |NULL          |16.100         |1020.800          |70.000        |5.700     |2025-10-13 05:51:01.040945|2022-10-27|18-24h|10   |
|0      |2022-10-27 23:00:00|24.400         |NULL          |16.100         |1021.000          |60.000        |4.600     |2025-10-13 05:51:01.040961|2022-10-27|18-24h|10   |
|0      |2022-10-28 00:00:00|23.900         |8.000         |16.700         |1021.600          |40.000        |4.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-11-16 22:00:00|23.300         |2.000         |8.300          |1013.600          |NULL          |3.100     |2025-10-13 05:51:21.069106|2022-11-16|18-24h|11   |
|0      |2022-11-16 23:00:00|20.600         |0.000         |8.300          |1014.100          |70.000        |3.600     |2025-10-13 05:51:21.069124|2022-11-16|18-24h|11   |
|0      |2022-11-17 00:00:00|18.900         |0.000         |9.400          |1014.900          |70.000        |2.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-12-06 22:00:00|24.400         |NULL          |22.200         |1011.400          |NULL          |4.600     |2025-10-13 05:51:41.101218|2022-12-06|18-24h|12   |
|0      |2022-12-06 23:00:00|23.300         |NULL          |21.700         |1012.200          |NULL          |3.100     |2025-10-13 05:51:41.101235|2022-12-06|18-24h|12   |
|0      |2022-12-07 00:00:00|23.300         |2.000         |21.100         |1012.600          |NULL          |3.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|0      |2022-12-26 22:00:00|24.400         |NULL          |18.300         |1025.500          |NULL          |5.100     |2025-10-13 05:52:01.130622|2022-12-26|18-24h|12   |
|0      |2022-12-26 23:00:00|23.300         |NULL          |18.900         |1025.700          |NULL          |3.600     |2025-10-13 05:52:01.130641|2022-12-26|18-24h|12   |
|0      |2022-12-27 00:00:00|23.300         |8.000         |19.400         |1026.200          |90.000        |1.500     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-01-15 22:00:00|2.800          |NULL          |-0.700         |1025.100          |NULL          |3.600     |2025-10-13 05:52:21.16206 |2022-01-15|18-24h|1    |
|1      |2022-01-15 23:00:00|2.600          |NULL          |0.000          |1025.800          |NULL          |2.600     |2025-10-13 05:52:21.162078|2022-01-15|18-24h|1    |
|1      |2022-01-16 00:00:00|2.400          |NULL          |0.000          |1026.400          |NULL          |3.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-02-04 22:00:00|10.500         |NULL          |9.100          |1026.100          |NULL          |4.600     |2025-10-13 05:52:41.197672|2022-02-04|18-24h|2    |
|1      |2022-02-04 23:00:00|10.700         |NULL          |9.300          |1025.900          |NULL          |5.100     |2025-10-13 05:52:41.197688|2022-02-04|18-24h|2    |
|1      |2022-02-05 00:00:00|10.500         |NULL          |8.700          |1025.800          |NULL          |5.700     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-02-25 09:00:00|2.000          |NULL          |-0.600         |1017.400          |NULL          |2.600     |2025-10-13 05:53:01.227931|2022-02-25|6-12h |2    |
|1      |2022-02-25 10:00:00|3.000          |NULL          |-0.100         |1017.600          |NULL          |3.100     |2025-10-13 05:53:01.227947|2022-02-25|6-12h |2    |
|1      |2022-02-25 11:00:00|3.300          |NULL          |-0.200         |1017.800          |NULL          |3.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-03-16 09:00:00|7.900          |NULL          |2.800          |1030.000          |50.000        |5.700     |2025-10-13 05:53:21.260998|2022-03-16|6-12h |3    |
|1      |2022-03-16 10:00:00|7.900          |NULL          |2.100          |1030.200          |70.000        |6.200     |2025-10-13 05:53:21.261016|2022-03-16|6-12h |3    |
|1      |2022-03-16 11:00:00|7.900          |NULL          |1.800          |1030.100          |50.000        |5.700     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-04-05 14:00:00|13.500         |NULL          |3.300          |1007.600          |NULL          |6.700     |2025-10-13 05:53:41.295043|2022-04-05|12-18h|4    |
|1      |2022-04-05 15:00:00|15.200         |NULL          |2.200          |1007.800          |NULL          |5.100     |2025-10-13 05:53:41.295087|2022-04-05|12-18h|4    |
|1      |2022-04-05 16:00:00|15.000         |NULL          |1.100          |1008.100          |NULL          |5.700     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-04-25 15:00:00|9.100          |NULL          |5.600          |1007.300          |NULL          |5.700     |2025-10-13 05:54:01.328728|2022-04-25|12-18h|4    |
|1      |2022-04-25 16:00:00|11.400         |NULL          |6.000          |1006.400          |NULL          |6.200     |2025-10-13 05:54:01.328746|2022-04-25|12-18h|4    |
|1      |2022-04-25 17:00:00|11.800         |NULL          |2.200          |1006.100          |NULL          |5.700     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-05-15 15:00:00|16.200         |NULL          |2.800          |1021.500          |NULL          |3.600     |2025-10-13 05:54:21.365236|2022-05-15|12-18h|5    |
|1      |2022-05-15 16:00:00|16.200         |NULL          |3.500          |1021.300          |NULL          |4.100     |2025-10-13 05:54:21.365254|2022-05-15|12-18h|5    |
|1      |2022-05-15 17:00:00|16.900         |NULL          |3.700          |1021.000          |NULL          |3.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-06-04 15:00:00|18.300         |NULL          |15.000         |1017.400          |NULL          |2.600     |2025-10-13 05:54:41.397211|2022-06-04|12-18h|6    |
|1      |2022-06-04 16:00:00|18.600         |NULL          |15.700         |1017.300          |NULL          |2.600     |2025-10-13 05:54:41.397229|2022-06-04|12-18h|6    |
|1      |2022-06-04 17:00:00|18.600         |NULL          |15.800         |1017.300          |30.000        |2.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-06-24 15:00:00|20.600         |NULL          |11.900         |1017.100          |NULL          |9.300     |2025-10-13 05:55:01.429088|2022-06-24|12-18h|6    |
|1      |2022-06-24 16:00:00|21.200         |NULL          |12.700         |1016.900          |NULL          |7.700     |2025-10-13 05:55:01.429105|2022-06-24|12-18h|6    |
|1      |2022-06-24 17:00:00|19.400         |NULL          |12.300         |1017.100          |NULL          |8.200     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-07-14 15:00:00|20.300         |NULL          |8.000          |1024.800          |NULL          |3.600     |2025-10-13 05:55:21.462399|2022-07-14|12-18h|7    |
|1      |2022-07-14 16:00:00|21.800         |NULL          |7.100          |1024.800          |NULL          |3.600     |2025-10-13 05:55:21.462416|2022-07-14|12-18h|7    |
|1      |2022-07-14 17:00:00|20.200         |NULL          |6.600          |1025.000          |NULL          |4.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-08-03 15:00:00|23.300         |NULL          |14.400         |1006.700          |NULL          |7.700     |2025-10-13 05:55:41.494561|2022-08-03|12-18h|8    |
|1      |2022-08-03 16:00:00|22.600         |NULL          |13.700         |1006.500          |NULL          |8.800     |2025-10-13 05:55:41.494578|2022-08-03|12-18h|8    |
|1      |2022-08-03 17:00:00|21.800         |NULL          |13.700         |1006.300          |NULL          |8.200     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-08-23 15:00:00|30.200         |0.000         |13.100         |1020.300          |NULL          |6.700     |2025-10-13 05:56:01.5279  |2022-08-23|12-18h|8    |
|1      |2022-08-23 16:00:00|30.100         |0.000         |14.600         |1020.000          |NULL          |6.200     |2025-10-13 05:56:01.527915|2022-08-23|12-18h|8    |
|1      |2022-08-23 17:00:00|28.900         |0.000         |15.000         |1020.000          |NULL          |5.700     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-09-12 15:00:00|24.400         |NULL          |15.600         |1012.700          |NULL          |3.600     |2025-10-13 05:56:21.561351|2022-09-12|12-18h|9    |
|1      |2022-09-12 16:00:00|24.100         |NULL          |15.400         |1012.900          |NULL          |5.100     |2025-10-13 05:56:21.561366|2022-09-12|12-18h|9    |
|1      |2022-09-12 17:00:00|23.300         |NULL          |15.500         |1012.600          |NULL          |3.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-10-02 16:00:00|17.100         |NULL          |6.500          |1019.000          |NULL          |2.100     |2025-10-13 05:56:41.59554 |2022-10-02|12-18h|10   |
|1      |2022-10-02 17:00:00|15.700         |NULL          |5.600          |1019.600          |NULL          |2.600     |2025-10-13 05:56:41.595556|2022-10-02|12-18h|10   |
|1      |2022-10-02 18:00:00|13.900         |0.000         |6.800          |1020.600          |NULL          |2.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-10-22 15:00:00|13.200         |NULL          |6.300          |1015.200          |20.000        |2.100     |2025-10-13 05:57:01.62359 |2022-10-22|12-18h|10   |
|1      |2022-10-22 16:00:00|12.800         |NULL          |6.100          |1015.100          |30.000        |4.100     |2025-10-13 05:57:01.623607|2022-10-22|12-18h|10   |
|1      |2022-10-22 17:00:00|12.200         |NULL          |6.400          |1015.100          |40.000        |3.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-11-11 15:00:00|9.700          |0.000         |3.500          |1022.900          |NULL          |2.100     |2025-10-13 05:57:21.653509|2022-11-11|12-18h|11   |
|1      |2022-11-11 16:00:00|7.800          |0.000         |3.000          |1023.100          |NULL          |2.100     |2025-10-13 05:57:21.653527|2022-11-11|12-18h|11   |
|1      |2022-11-11 17:00:00|7.100          |0.000         |2.900          |1023.300          |NULL          |1.500     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-12-01 18:00:00|4.700          |0.000         |2.000          |1029.100          |NULL          |2.600     |2025-10-13 05:57:41.686399|2022-12-01|18-24h|12   |
|1      |2022-12-01 19:00:00|4.500          |0.000         |2.000          |1028.900          |NULL          |2.600     |2025-10-13 05:57:41.686417|2022-12-01|18-24h|12   |
|1      |2022-12-01 20:00:00|3.400          |0.000         |1.700          |1028.500          |NULL          |2.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|1      |2022-12-21 18:00:00|10.000         |NULL          |7.700          |1020.900          |NULL          |3.600     |2025-10-13 05:58:01.717329|2022-12-21|18-24h|12   |
|1      |2022-12-21 19:00:00|10.200         |NULL          |8.200          |1021.400          |NULL          |3.600     |2025-10-13 05:58:01.717347|2022-12-21|18-24h|12   |
|1      |2022-12-21 20:00:00|10.000         |NULL          |8.300          |1022.100          |NULL          |3.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-01-10 18:00:00|10.000         |8.000         |3.900          |1022.300          |90.000        |2.600     |2025-10-13 05:58:21.750936|2022-01-10|18-24h|1    |
|2      |2022-01-10 19:00:00|11.100         |NULL          |2.800          |1021.600          |50.000        |2.600     |2025-10-13 05:58:21.750952|2022-01-10|18-24h|1    |
|2      |2022-01-10 20:00:00|11.700         |NULL          |2.800          |1020.500          |0.000         |0.000     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-01-30 18:00:00|16.700         |2.000         |-1.700         |1014.300          |NULL          |4.100     |2025-10-13 05:58:41.783274|2022-01-30|18-24h|1    |
|2      |2022-01-30 19:00:00|18.900         |2.000         |-2.800         |1013.200          |NULL          |2.600     |2025-10-13 05:58:41.783291|2022-01-30|18-24h|1    |
|2      |2022-01-30 20:00:00|21.100         |4.000         |-3.900         |1011.900          |NULL          |2.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-02-19 18:00:00|21.700         |6.000         |0.000          |1016.200          |NULL          |3.100     |2025-10-13 05:59:01.814785|2022-02-19|18-24h|2    |
|2      |2022-02-19 19:00:00|23.900         |NULL          |-1.100         |1015.800          |80.000        |2.100     |2025-10-13 05:59:01.814801|2022-02-19|18-24h|2    |
|2      |2022-02-19 20:00:00|25.600         |NULL          |-2.200         |1015.000          |NULL          |2.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-03-10 18:00:00|22.800         |2.000         |-2.800         |1016.100          |NULL          |2.100     |2025-10-13 05:59:21.842457|2022-03-10|18-24h|3    |
|2      |2022-03-10 19:00:00|25.000         |2.000         |-3.900         |1015.600          |NULL          |2.100     |2025-10-13 05:59:21.842474|2022-03-10|18-24h|3    |
|2      |2022-03-10 20:00:00|27.200         |0.000         |-5.000         |1014.800          |NULL          |3.100     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-03-30 18:00:00|16.100         |2.000         |-1.100         |1012.800          |50.000        |2.600     |2025-10-13 05:59:41.874769|2022-03-30|18-24h|3    |
|2      |2022-03-30 19:00:00|17.200         |4.000         |-2.200         |1012.300          |NULL          |2.600     |2025-10-13 05:59:41.874786|2022-03-30|18-24h|3    |
|2      |2022-03-30 20:00:00|19.400         |4.000         |-2.200         |1011.500          |NULL          |4.600     |2025-10-13 05:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-04-19 18:00:00|26.700         |2.000         |-5.000         |1014.300          |0.000         |0.000     |2025-10-13 06:00:01.904812|2022-04-19|18-24h|4    |
|2      |2022-04-19 19:00:00|28.900         |2.000         |-5.000         |1013.500          |NULL          |2.100     |2025-10-13 06:00:01.904829|2022-04-19|18-24h|4    |
|2      |2022-04-19 20:00:00|30.000         |4.000         |-6.100         |1012.600          |NULL          |2.100     |2025-10-13 06:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-05-09 18:00:00|24.400         |2.000         |5.000          |1011.500          |0.000         |0.000     |2025-10-13 06:00:21.937982|2022-05-09|18-24h|5    |
|2      |2022-05-09 19:00:00|25.600         |2.000         |5.000          |1010.900          |NULL          |1.500     |2025-10-13 06:00:21.937998|2022-05-09|18-24h|5    |
|2      |2022-05-09 20:00:00|27.200         |2.000         |4.400          |1010.000          |NULL          |1.500     |2025-10-13 06:

+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-05-29 18:00:00|31.100         |0.000         |2.200          |1011.000          |NULL          |2.100     |2025-10-13 06:00:41.961786|2022-05-29|18-24h|5    |
|2      |2022-05-29 19:00:00|32.800         |0.000         |0.600          |1010.400          |NULL          |5.100     |2025-10-13 06:00:41.961805|2022-05-29|18-24h|5    |
|2      |2022-05-29 20:00:00|33.300         |0.000         |-1.100         |1009.500          |NULL          |NULL      |2025-10-13 06:

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/opt/conda/lib/python3.10/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/opt/conda/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|site_id|timestamp          |air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|weather_ts                |date      |time  |month|
+-------+-------------------+---------------+--------------+---------------+------------------+--------------+----------+--------------------------+----------+------+-----+
|2      |2022-06-03 18:00:00|37.200         |2.000         |0.600          |1010.800          |0.000         |0.000     |2025-10-13 06:00:46.970298|2022-06-03|18-24h|6    |
|2      |2022-06-03 19:00:00|39.400         |2.000         |-0.600         |1010.100          |NULL          |3.600     |2025-10-13 06:00:46.970313|2022-06-03|18-24h|6    |
|2      |2022-06-03 20:00:00|40.600         |2.000         |-0.600         |1009.300          |NULL          |3.100     |2025-10-13 06:

KeyboardInterrupt: 

In [None]:
# 6b


In [None]:
# 6c


7.	Save the data from 6 to Parquet files as streams. (Hint: Parquet files support streaming writing/reading. The file keeps updating while new batches arrive.)

In [None]:
# 7a(save 6a)


In [None]:
# 7b(save 6b)


In [None]:
# 7c(save 6c)

8.	Read the parquet files from task 7 as data streams and send them to Kafka topics with appropriate names.
(Note: You shall read the parquet files as a streaming data frame and send messages to the Kafka topic when new data appears in the parquet file.)

In [None]:
# Stream 1


In [None]:
# Stream 2


In [None]:
# Stream 3
