# FIT5202 Assignment 2A : Building Models for Building Energy Prediction

## Table of Contents
*  
    * [Part 1 : Data Loading, Transformation and Exploration](#part-1)
    * [Part 2 : Feature extraction and ML training](#part-2)
    * [Part 3 : Hyperparameter Tuning and Model Optimisation](#part-3)  
Please add code/markdown cells as needed.

# Part 1: Data Loading, Transformation and Exploration <a class="anchor" name="part-1"></a>
## 1.1 Data Loading
In this section, you must load the given datasets into PySpark DataFrames and use DataFrame functions to process the data. For plotting, various visualisation packages can be used, but please ensure that you have included instructions to install the additional packages and that the installation will be successful in the provided Docker container (in case your marker needs to clear the notebook and rerun it).

### 1.1.1 Data Loading <a class="anchor" name="1.1"></a>
1.1.1 Write the code to create a SparkSession. For creating the SparkSession, you need to use a SparkConf object to configure the Spark app with a proper application name, to ensure the maximum partition size does not exceed 32MB, and to run locally with all CPU cores on your machine

In [1]:
# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[*]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Assignment2A"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
# spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.session.timeZone", "GMT+10").getOrCreate()
spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.files.maxPartitionBytes", "33554432").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')
from pyspark.sql import functions as F

1.1.2 Write code to define the schemas for the datasets, following the data types suggested in the metadata file. 

In [2]:
# Adapted from GPT
from pyspark.sql.types import (
    StructType, StructField,
    IntegerType, StringType, DecimalType, TimestampType
)

# 1. Meters Table
meters_schema = StructType([
    StructField("building_id", IntegerType(), False),
    StructField("meter_type", StringType(), False),   # Char(1) -> StringType
    StructField("ts", TimestampType(), False),
    StructField("value", DecimalType(10, 4), False),  # Adjust precision/scale if needed
    StructField("row_id", IntegerType(), False)
])

# 2. Buildings Table
buildings_schema = StructType([
    StructField("site_id", IntegerType(), False),
    StructField("building_id", IntegerType(), False),
    StructField("primary_use", StringType(), True),
    StructField("square_feet", IntegerType(), True),
    StructField("floor_count", IntegerType(), True),
    StructField("row_id", IntegerType(), False),
    StructField("year_built", IntegerType(), True),
    StructField("latent_y", DecimalType(10, 4), True),
    StructField("latent_s", DecimalType(10, 4), True),
    StructField("latent_r", DecimalType(10, 4), True)
])

# 3. Weather Table
weather_schema = StructType([
    StructField("site_id", IntegerType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("air_temperature", DecimalType(6, 2), True),
    StructField("cloud_coverage", DecimalType(6, 2), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("dew_temperature", DecimalType(6, 2), True),
    StructField("sea_level_pressure", DecimalType(8, 2), True),
    StructField("wind_direction", DecimalType(6, 2), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("wind_speed", DecimalType(6, 2), True)
])


1.1.3 Using your schemas, load the CSV files into separate data frames. Print the schemas of all data frames. 

In [3]:
# from GPT
meters_df = spark.read.csv(
    "data/meters.csv",
    header=True,
    schema=meters_schema
)

buildings_df = spark.read.csv(
    "data/building_information.csv",
    header=True,
    schema=buildings_schema
)

weather_df = spark.read.csv(
    "data/weather.csv",
    header=True,
    schema=weather_schema
)
print("Meters DF:")
meters_df.printSchema()
print("Buildings DF:")
buildings_df.printSchema()
print("Weather DF")
weather_df.printSchema()


Meters DF:
root
 |-- building_id: integer (nullable = true)
 |-- meter_type: string (nullable = true)
 |-- ts: timestamp (nullable = true)
 |-- value: decimal(10,4) (nullable = true)
 |-- row_id: integer (nullable = true)

Buildings DF:
root
 |-- site_id: integer (nullable = true)
 |-- building_id: integer (nullable = true)
 |-- primary_use: string (nullable = true)
 |-- square_feet: integer (nullable = true)
 |-- floor_count: integer (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- year_built: integer (nullable = true)
 |-- latent_y: decimal(10,4) (nullable = true)
 |-- latent_s: decimal(10,4) (nullable = true)
 |-- latent_r: decimal(10,4) (nullable = true)

Weather DF
root
 |-- site_id: integer (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- air_temperature: decimal(6,2) (nullable = true)
 |-- cloud_coverage: decimal(6,2) (nullable = true)
 |-- dew_temperature: decimal(6,2) (nullable = true)
 |-- sea_level_pressure: decimal(8,2) (nullable = true)
 |-- 

### 1.2 Data Transformation to Create Features <a class="anchor" name="1.2"></a>
In this section, we primarily have three tasks:  
1.2.1 The dataset includes sensors with hourly energy measurements. However, as a grid operator, we don’t need this level of granularity and lowering it can reduce the amount of data we need to process. For each building, we will aggregate the metered energy consumption in 6-hour intervals (0:00-5:59, 6:00-11:59, 12:00-17:59, 18:00-23:59). This will be our target (label) column for this prediction. Perform the aggregation for each building.


In [4]:
# Adapted from A1 and GPT
# Meters, weather have ts, timestamp
# meters_df = meters_df.withColumn(
#     "time",
#     F.when(F.hour(F.col("ts")) <= 5, "0-6h")
#      .when(F.hour(F.col("ts")) <= 11, "6-12h")
#      .when(F.hour(F.col("ts")) <= 17, "12-18h")
#      .when(F.hour(F.col("ts")) <= 23, "18-24h")    
# ).filter(F.col("time").isNotNull())

# Meters
meters_df = meters_df.withColumn("date", F.to_date("ts")).withColumn(
    "time",
    F.when(F.hour("ts") <= 5, "0-6h")
     .when(F.hour("ts") <= 11, "6-12h")
     .when(F.hour("ts") <= 17, "12-18h")
     .when(F.hour("ts") <= 23, "18-24h")
)

meters_df = (
    meters_df.groupBy("building_id", "meter_type", "date", "time")
    .agg(F.sum("value").alias("value_sum"))
)
meters_df.show(5)


+-----------+----------+----------+------+---------+
|building_id|meter_type|      date|  time|value_sum|
+-----------+----------+----------+------+---------+
|        244|         c|2022-01-01| 6-12h|  36.3642|
|       1214|         c|2022-01-01| 6-12h| 444.3222|
|       1259|         c|2022-01-01|12-18h| 385.0202|
|        920|         c|2022-01-01|18-24h| 270.4458|
|        945|         c|2022-01-01|18-24h|1107.7720|
+-----------+----------+----------+------+---------+
only showing top 5 rows



In the weather dataset, there are some missing values (null or empty strings). It may lower the quality of our model. Imputation is a way to deal with those missing values. Imputation is the process of replacing missing values in a dataset with substituted, or "imputed," values. It's a way to handle gaps in your data so that you can still analyse it effectively without having to delete incomplete records.  
1.2.2 Refer to the Spark MLLib imputation API and fill in the missing values in the weather dataset. You can use mean values as the strategy.  https://spark.apache.org/docs/3.5.5/api/python/reference/api/pyspark.ml.feature.Imputer.html

In [5]:
# from GPT
from pyspark.sql import functions as F

impute_cols = ["cloud_coverage", "sea_level_pressure", "wind_direction"]
weather_df = (
    weather_df
    .withColumn("month", F.month(F.col("date")))    
)
# Compute site-level means
site_means = weather_df.groupBy("site_id","month").agg(
    *[F.mean(c).alias(f"{c}_mean") for c in impute_cols]
)

# Join means back to weather_df
weather_df_with_means = weather_df.join(site_means, on=["site_id","month"], how="left")

# Replace nulls with site-level mean
for c in impute_cols:
    weather_df_with_means = weather_df_with_means.withColumn(
        c,
        F.when(F.col(c).isNull(), F.col(f"{c}_mean")).otherwise(F.col(c))
    )

# Drop helper mean columns
weather_df_imputed = weather_df_with_means.drop(*[f"{c}_mean" for c in impute_cols])


# from pyspark.ml.feature import Imputer

# # Select the numeric columns in weather that may have missing values
# impute_cols = [
#     "cloud_coverage",
#     "sea_level_pressure",
#     "wind_direction"
# ]

# # Create the Imputer
# imputer = Imputer(
#     inputCols=impute_cols,        # columns with missing values
#     outputCols=impute_cols,       # overwrite same columns
#     strategy="mean"               # use mean values
# )

# # Fit on the dataframe and transform
# weather_df_imputed = imputer.fit(weather_df).transform(weather_df)

# Weather
weather_df_imputed = weather_df_imputed.withColumn("date", F.to_date("timestamp")).withColumn(
    "time",
    F.when(F.hour("timestamp") <= 5, "0-6h")
     .when(F.hour("timestamp") <= 11, "6-12h")
     .when(F.hour("timestamp") <= 17, "12-18h")
     .when(F.hour("timestamp") <= 23, "18-24h")
)
weather_agg = (
    weather_df_imputed.groupBy("site_id", "date", "time")
    .agg(
        F.mean("air_temperature").alias("air_temperature"),
        F.mean("cloud_coverage").alias("cloud_coverage"),
        F.mean("dew_temperature").alias("dew_temperature"),
        F.mean("sea_level_pressure").alias("sea_level_pressure"),
        F.mean("wind_direction").alias("wind_direction"),
        F.mean("wind_speed").alias("wind_speed"),
    )
)


weather_agg.show(5)


+-------+----------+------+---------------+--------------+---------------+------------------+--------------+----------+
|site_id|      date|  time|air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|
+-------+----------+------+---------------+--------------+---------------+------------------+--------------+----------+
|      0|2022-02-28|  0-6h|      13.425000|      2.550000|       6.150000|       1023.633333|     92.500000|  2.150000|
|      0|2022-03-19|  0-6h|      23.883333|      3.125000|      16.383333|       1013.966667|    328.333333|  2.300000|
|      0|2022-05-06|18-24h|      24.333333|      1.333333|       3.133333|       1011.016667|    283.333333|  6.516667|
|      0|2022-05-14|  0-6h|      26.400000|      2.333333|      17.316667|       1015.633333|    288.333333|  3.700000|
|      0|2022-05-20|  0-6h|      23.533333|      3.075000|      21.216667|       1016.500000|    136.666667|  1.883333|
+-------+----------+------+-------------

We know that different seasons may affect energy consumption—for instance, a heater in winter and a cooler in summer. Extracting peak seasons (summer and winter) or off-peak seasons (Spring and Autumn) might be more useful than directly using the month as numerical values.   
1.2.3 The dataset has 16 sites in total, whose locations may span across different countries. Add a column (peak/off-peak) to the weather data frame based on the average air temperature. The top 3 hottest months and the 3 coldest months are considered “peak”, and the rest of the year is considered “off-peak”. 

In [17]:
from pyspark.sql.window import Window
# adapted from A1, tuned with GPT

monthly_temp = (
    weather_agg
    .groupBy("site_id", "month")
    .agg(F.expr("percentile_approx(air_temperature, 0.5)").alias("median_temp"))
)

# Define windows
w_asc = Window.partitionBy("site_id").orderBy(F.col("median_temp").asc())
w_desc = Window.partitionBy("site_id").orderBy(F.col("median_temp").desc())

monthly_temp_ranked = (
    monthly_temp
    .withColumn("rank_cold", F.row_number().over(w_asc))
    .withColumn("rank_hot", F.row_number().over(w_desc))
)

weather_df_peak_seasons = (
    monthly_temp_ranked
    .withColumn(
        "peak_offpeak",
        F.when((F.col("rank_cold") <= 3) | (F.col("rank_hot") <= 3), "peak")
         .otherwise("off-peak")
    ).orderBy("site_id","month")
    .drop("rank_cold","rank_hot")
)

# # Peak seasons = bottom and top 3 hottest months
# bottom_3_months = [row["month"] for row in monthly_temp.orderBy(F.col("median_temp").asc()).limit(3).collect()]
# top_3_months = [row["month"] for row in monthly_temp.orderBy(F.col("median_temp").desc()).limit(3).collect()]
# peak_months = bottom_3_months + top_3_months

# # Add column with peak/off-peak
# weather_df_peak_seasons = (
#     monthly_temp
#     .withColumn(
#         "peak/off-peak",
#         F.when(F.col("month").isin(peak_months), "peak")
#          .otherwise("off-peak")
#     )
# )

# weather_df_peak_seasons.show()

weather_final = (
    weather_df_months
    .join(weather_df_peak_seasons, ["month"],"left")
)
weather_final.show(3)

+-------+-----+-----------+------------+
|site_id|month|median_temp|peak_offpeak|
+-------+-----+-----------+------------+
|      0|    1|  14.900000|        peak|
|      0|    2|  16.133333|        peak|
|      0|    3|  21.466667|    off-peak|
|      0|    4|  22.233333|    off-peak|
|      0|    5|  24.700000|    off-peak|
|      0|    6|  27.300000|        peak|
|      0|    7|  28.433333|        peak|
|      0|    8|  27.300000|        peak|
|      0|    9|  26.766667|    off-peak|
|      0|   10|  24.600000|    off-peak|
|      0|   11|  20.116667|        peak|
|      0|   12|  20.266667|    off-peak|
|      1|    1|   6.416667|        peak|
|      1|    2|   5.350000|        peak|
|      1|    3|   6.850000|    off-peak|
|      1|    4|   9.000000|    off-peak|
|      1|    5|  13.983333|    off-peak|
|      1|    6|  16.100000|    off-peak|
|      1|    7|  18.233333|        peak|
|      1|    8|  18.716667|        peak|
+-------+-----+-----------+------------+
only showing top

Create a data frame with all relevant columns at this stage, we refer to this data frame as feature_df.

+-----+-------+----------+----+---------------+--------------+---------------+------------------+--------------+----------+-------+-----------+------------+
|month|site_id|      date|time|air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|site_id|median_temp|peak_offpeak|
+-----+-------+----------+----+---------------+--------------+---------------+------------------+--------------+----------+-------+-----------+------------+
|    2|      0|2022-02-28|0-6h|      13.425000|      2.550000|       6.150000|       1023.633333|     92.500000|  2.150000|     15|  -1.283333|        peak|
|    2|      0|2022-02-28|0-6h|      13.425000|      2.550000|       6.150000|       1023.633333|     92.500000|  2.150000|     14|   2.500000|        peak|
|    2|      0|2022-02-28|0-6h|      13.425000|      2.550000|       6.150000|       1023.633333|     92.500000|  2.150000|     13|  -2.483333|        peak|
+-----+-------+----------+----+---------------+-----------

### 1.3 Exploring the Data <a class="anchor" name="1.3"></a>
You can use either the CDA or the EDA method mentioned in Lab 5.  
Some ideas for CDA:  
a)	Older building may not be as efficient as new ones, therefore need more energy for cooling/heating. It’s not necessarily true though, if the buildings are built with higher standard or renovated later.  
b)	A multifloored or larger building obviously consumes more energy.  

1.	With the feature_df, write code to show the basic statistics:  
a) For each numeric column, show count, mean, stddev, min, max, 25 percentile, 50 percentile, 75 percentile;  
b) For each non-numeric column, display the top-5 values and the corresponding counts;  
c) For each boolean column, display the value and count. (note: pandas describe is allowed for this task.) (5%)

In [8]:

print("Meters DF:")
meters_df.printSchema()
print("Buildings DF:")
buildings_df.printSchema()
print("Weather DF")
weather_df.printSchema()


Meters DF:
root
 |-- building_id: integer (nullable = true)
 |-- meter_type: string (nullable = true)
 |-- ts: timestamp (nullable = true)
 |-- value: decimal(10,4) (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- time: string (nullable = true)

Buildings DF:
root
 |-- site_id: integer (nullable = true)
 |-- building_id: integer (nullable = true)
 |-- primary_use: string (nullable = true)
 |-- square_feet: integer (nullable = true)
 |-- floor_count: integer (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- year_built: integer (nullable = true)
 |-- latent_y: decimal(10,4) (nullable = true)
 |-- latent_s: decimal(10,4) (nullable = true)
 |-- latent_r: decimal(10,4) (nullable = true)

Weather DF
root
 |-- site_id: integer (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- air_temperature: decimal(6,2) (nullable = true)
 |-- cloud_coverage: decimal(6,2) (nullable = true)
 |-- dew_temperature: decimal(6,2) (nullable

In [9]:
feature_df.printSchema()

root
 |-- month: integer (nullable = true)
 |-- site_id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- time: string (nullable = true)
 |-- air_temperature: decimal(10,6) (nullable = true)
 |-- cloud_coverage: decimal(10,6) (nullable = true)
 |-- dew_temperature: decimal(10,6) (nullable = true)
 |-- sea_level_pressure: decimal(12,6) (nullable = true)
 |-- wind_direction: decimal(10,6) (nullable = true)
 |-- wind_speed: decimal(10,6) (nullable = true)
 |-- site_id: integer (nullable = true)
 |-- median_temp: decimal(10,6) (nullable = true)
 |-- rank_cold: integer (nullable = true)
 |-- rank_hot: integer (nullable = true)
 |-- peak_offpeak: string (nullable = true)



2.	Explore the dataframe and write code to present two plots of multivariate analysis, describe your plots and discuss the findings from the plots. (5% each).  
○	150 words max for each plot’s description and discussion.  
○	Note: In the building metadata table, there are some latent columns (data that may or may not be helpful, their meanings is unknown due to privacy and data security concerns).  
○	Feel free to use any plotting libraries: matplotlib, seabon, plotly, etc. You can refer to https://samplecode.link  


In [10]:
# Multivariate analysis 1: how is electricity usage influenced by building size/type?

In [11]:
# Multivariate analysis 2: how is electricity usage influenced by weather?

## Part 2. Feature extraction and ML training <a class="anchor" name="part-2"></a>
In this section, you must use PySpark DataFrame functions and ML packages for data preparation, model building, and evaluation. Other ML packages, such as scikit-learn, should not be used to process the data; however, it’s fine to use them to display the result or evaluate your model.  
### 2.1 Discuss the feature selection and prepare the feature columns

2.1.1 Based on the data exploration from 1.2 and considering the use case, discuss the importance of those features (For example, which features may be useless and should be removed, which feature has a significant impact on the label column, which should be transformed), which features you are planning to use? Discuss the reasons for selecting them and how you plan to create/transform them.  
○	300 words max for the discussion  
○	Please only use the provided data for model building  
○	You can create/add additional features based on the dataset  
○	Hint - Use the insights from the data exploration/domain knowledge/statistical models to consider whether to create more feature columns, whether to remove some columns  

2.1.2 Write code to create/transform the columns based on your discussion above.

### 2.2 Preparing Spark ML Transformers/Estimators for features, labels, and models  <a class="anchor" name="2.2"></a>

**2.2.1 Write code to create Transformers/Estimators for transforming/assembling the columns you selected above in 2.1 and create ML model Estimators for Random Forest (RF) and Gradient-boosted tree (GBT) model.
Please DO NOT fit/transform the data yet.**

**2.2.2. Write code to include the above Transformers/Estimators into two pipelines.
Please DO NOT fit/transform the data yet.**

### 2.3 Preparing the training data and testing data  
Write code to split the data for training and testing, using 2025 as the random seed. You can decide the train/test split ratio based on the resources available on your laptop.  
Note: Due to the large dataset size, you can use random sampling (say 20% of the dataset). 

### 2.4 Training and evaluating models  
2.4.1 Write code to use the corresponding ML Pipelines to train the models on the training data from 2.3. And then use the trained models to predict the testing data from 2.3

2.4.2 For both models (RF and GBT): with the test data, decide on which metrics to use for model evaluation and discuss which one is the better model (no word limit; please keep it concise). You may also use a plot for visualisation (not mandatory).

2.4.3 3.	Save the better model (you’ll need it for A2B).
(Note: You may need to go through a few training loops or use more data to create a better-performing model.)

### Part 3. Hyperparameter Tuning and Model Optimisation <a class="anchor" name="part-3"></a>  
Apply the techniques you have learnt from the labs, for example, CrossValidator, TrainValidationSplit, ParamGridBuilder, etc., to perform further hyperparameter tuning and model optimisation.  
The assessment is based on the quality of your work/process, not the quality of your model. Please include your thoughts/ideas/discussions.

## References:
Please add your references below: