# FIT5202 Assignment 2A : Building Models for Building Energy Prediction

## Table of Contents
*  
    * [Part 1 : Data Loading, Transformation and Exploration](#part-1)
    * [Part 2 : Feature extraction and ML training](#part-2)
    * [Part 3 : Hyperparameter Tuning and Model Optimisation](#part-3)  
Please add code/markdown cells as needed.

# Part 1: Data Loading, Transformation and Exploration <a class="anchor" name="part-1"></a>
## 1.1 Data Loading
In this section, you must load the given datasets into PySpark DataFrames and use DataFrame functions to process the data. For plotting, various visualisation packages can be used, but please ensure that you have included instructions to install the additional packages and that the installation will be successful in the provided Docker container (in case your marker needs to clear the notebook and rerun it).

### 1.1.1 Data Loading <a class="anchor" name="1.1"></a>
1.1.1 Write the code to create a SparkSession. For creating the SparkSession, you need to use a SparkConf object to configure the Spark app with a proper application name, to ensure the maximum partition size does not exceed 32MB, and to run locally with all CPU cores on your machine

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("FIT5202 Assignment 2A") \
    .config("spark.sql.files.maxPartitionBytes", 32 * 1024 * 1024) \
    .master("local[*]") \
    .getOrCreate()

1.1.2 Write code to define the schemas for the datasets, following the data types suggested in the metadata file. 

In [2]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, CharType

meter_schema = StructType([
    StructField("building_id", IntegerType(), True),
    StructField("meter_type", StringType(), True), # Reading as String is safer than CharType
    StructField("ts", TimestampType(), True),
    StructField("value", DoubleType(), True),
    StructField("row_id", IntegerType(), True)
])

building_schema = StructType([
    StructField("site_id", IntegerType(), True),
    StructField("building_id", IntegerType(), True),
    StructField("primary_use", StringType(), True),
    StructField("square_feet", IntegerType(), True),
    StructField("floor_count", IntegerType(), True),
    StructField("row_id", IntegerType(), True),
    StructField("year_built", IntegerType(), True),
    StructField("latent_y", DoubleType(), True),
    StructField("latent_s", DoubleType(), True),
    StructField("latent_r", DoubleType(), True)
])

# Schema for weather.csv
# We match the exact order from the header, noting precip_depth_1_hr is missing.
weather_schema = StructType([
    StructField("site_id", IntegerType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("air_temperature", DoubleType(), True),
    StructField("cloud_coverage", DoubleType(), True),
    StructField("dew_temperature", DoubleType(), True),
    StructField("sea_level_pressure", DoubleType(), True),
    StructField("wind_direction", DoubleType(), True),
    StructField("wind_speed", DoubleType(), True)
])

1.1.3 Using your schemas, load the CSV files into separate data frames. Print the schemas of all data frames. 

In [3]:
building_df = spark.read.csv("dataset/building_information.csv", header=True, schema=building_schema)
weather_df = spark.read.csv("dataset/weather.csv", header=True, schema=weather_schema)
meter_df = spark.read.csv("dataset/meters.csv", header=True, schema=meter_schema)

# --- Verification Step ---
print("--- Correctly Loaded building_df ---")
building_df.printSchema()
building_df.show(5)

print("\n--- Correctly Loaded weather_df ---")
weather_df.printSchema()
weather_df.show(5)

print("\n--- Correctly Loaded meter_df ---")
meter_df.printSchema()
meter_df.show(5)

--- Correctly Loaded building_df ---
root
 |-- site_id: integer (nullable = true)
 |-- building_id: integer (nullable = true)
 |-- primary_use: string (nullable = true)
 |-- square_feet: integer (nullable = true)
 |-- floor_count: integer (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- year_built: integer (nullable = true)
 |-- latent_y: double (nullable = true)
 |-- latent_s: double (nullable = true)
 |-- latent_r: double (nullable = true)

+-------+-----------+-------------+-----------+-----------+------+----------+--------+---------+--------+
|site_id|building_id|  primary_use|square_feet|floor_count|row_id|year_built|latent_y| latent_s|latent_r|
+-------+-----------+-------------+-----------+-----------+------+----------+--------+---------+--------+
|      2|        165|    Warehouse|       3877|          1|   166|      1982|    18.0|3.5884957|     3.0|
|      2|        229|    Education|     140092|          1|   230|      1999|     1.0|5.1464133|     3.0|
|      1| 

### 1.2 Data Transformation to Create Features <a class="anchor" name="1.2"></a>
In this section, we primarily have three tasks:  
1.2.1 The dataset includes sensors with hourly energy measurements. However, as a grid operator, we don’t need this level of granularity and lowering it can reduce the amount of data we need to process. For each building, we will aggregate the metered energy consumption in 6-hour intervals (0:00-5:59, 6:00-11:59, 12:00-17:59, 18:00-23:59). This will be our target (label) column for this prediction. Perform the aggregation for each building.


In [4]:
from pyspark.sql.functions import window, sum as _sum

# --- 1.2.1 Aggregate Meter Readings ---

# We will group by the building_id and a 6-hour window on the 'ts' (timestamp) column.
# Then, we will sum the 'value' column to get the total energy consumption for that period.
# Note: We alias the sum function to avoid conflict with the Python built-in sum.
meter_df_agg = meter_df.groupBy(
    "building_id",
    window("ts", "6 hours")
).agg(
    _sum("value").alias("total_energy_consumption")
)

# Show the results. This is our target (label) column for the ML model.
print("--- Aggregated Meter Readings (6-hour intervals) ---")
meter_df_agg.show()

--- Aggregated Meter Readings (6-hour intervals) ---
+-----------+--------------------+------------------------+
|building_id|              window|total_energy_consumption|
+-----------+--------------------+------------------------+
|        258|{2022-01-01 12:00...|      242.17629999999997|
|        928|{2022-01-01 12:00...|                4054.541|
|        996|{2022-01-01 12:00...|               2030.5453|
|       1168|{2022-01-02 12:00...|             308836.8331|
|        920|{2022-01-02 18:00...|       576.0677000000001|
|       1342|{2022-01-02 18:00...|                903.0262|
|       1388|{2022-01-03 06:00...|                478.5848|
|       1211|{2022-01-03 18:00...|               12046.066|
|        801|{2022-01-04 00:00...|       48967.21000000001|
|       1312|{2022-01-04 00:00...|               2476.4169|
|       1088|{2022-01-04 06:00...|             153586.5986|
|       1169|{2022-01-04 12:00...|               31927.523|
|       1355|{2022-01-04 12:00...|       6878.9

In the weather dataset, there are some missing values (null or empty strings). It may lower the quality of our model. Imputation is a way to deal with those missing values. Imputation is the process of replacing missing values in a dataset with substituted, or "imputed," values. It's a way to handle gaps in your data so that you can still analyse it effectively without having to delete incomplete records.  
1.2.2 Refer to the Spark MLLib imputation API and fill in the missing values in the weather dataset. You can use mean values as the strategy.  https://spark.apache.org/docs/3.5.5/api/python/reference/api/pyspark.ml.feature.Imputer.html

In [5]:
from pyspark.ml.feature import Imputer
from pyspark.sql.functions import col, count, when

# --- 1.2.2 Impute Missing Values in Weather Data ---

# Columns that need imputation.
impute_cols = [
    "air_temperature",
    "cloud_coverage",
    "dew_temperature",
    "sea_level_pressure",
    "wind_direction",
    "wind_speed"
]

# Create the Imputer with the 'mean' strategy.
imputer = Imputer(
    inputCols=impute_cols,
    outputCols=[f"{c}_imputed" for c in impute_cols]
).setStrategy("mean")

# Fit the imputer to the data to learn the means.
imputer_model = imputer.fit(weather_df)

# Transform the data to fill the NULLs.
weather_df_imputed = imputer_model.transform(weather_df)

# Clean up the dataframe by dropping the original columns and renaming the imputed ones.
weather_df_cleaned = weather_df_imputed
for c in impute_cols:
    weather_df_cleaned = weather_df_cleaned.drop(c).withColumnRenamed(f"{c}_imputed", c)

# Verify that our imputation worked. The null counts should now be zero.
print("--- Null counts in weather_df after imputation ---")
weather_df_cleaned.select([count(when(col(c).isNull(), c)).alias(c) for c in impute_cols]).show()

print("\n--- Cleaned Weather Data ---")
weather_df_cleaned.show()

--- Null counts in weather_df after imputation ---
+---------------+--------------+---------------+------------------+--------------+----------+
|air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|
+---------------+--------------+---------------+------------------+--------------+----------+
|              0|             0|              0|                 0|             0|         0|
+---------------+--------------+---------------+------------------+--------------+----------+


--- Cleaned Weather Data ---
+-------+-------------------+---------------+------------------+---------------+------------------+--------------+----------+
|site_id|          timestamp|air_temperature|    cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|
+-------+-------------------+---------------+------------------+---------------+------------------+--------------+----------+
|      0|2022-01-01 22:00:00|           26.7|2.1493059490084985|      

In [6]:
from pyspark.sql.functions import month, col

# 1) Do we have all 12 months for site 0 in the base (cleaned/imputed) weather DF?
weather_df_cleaned.filter(col("site_id")==0) \
    .select(month("timestamp").alias("m")) \
    .groupBy("m").count().orderBy("m").show(100)

# Expect months 1..12 present with non-zero counts.

+---+-----+
|  m|count|
+---+-----+
|  1|  744|
|  2|  696|
|  3|  744|
|  4|  720|
|  5|  744|
|  6|  720|
|  7|  744|
|  8|  744|
|  9|  720|
| 10|  744|
| 11|  720|
| 12|  744|
+---+-----+



We know that different seasons may affect energy consumption—for instance, a heater in winter and a cooler in summer. Extracting peak seasons (summer and winter) or off-peak seasons (Spring and Autumn) might be more useful than directly using the month as numerical values.   
1.2.3 The dataset has 16 sites in total, whose locations may span across different countries. Add a column (peak/off-peak) to the weather data frame based on the average air temperature. The top 3 hottest months and the 3 coldest months are considered “peak”, and the rest of the year is considered “off-peak”. 

In [7]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# --- Base alias (must be the full cleaned weather DF; do NOT pre-filter it) ---
w = weather_df_cleaned.alias("w")

# 1) Monthly avg temperature per site
monthly_avg = (
    w.withColumn("month", F.month("w.timestamp"))
     .groupBy("w.site_id", "month")
     .agg(F.avg("w.air_temperature").alias("avg_temp"))
     .alias("ma")
)

# 2) Tie-safe top-3 hottest and top-3 coldest months per site using row_number
win_cold = Window.partitionBy("ma.site_id").orderBy(F.col("avg_temp").asc())
win_hot  = Window.partitionBy("ma.site_id").orderBy(F.col("avg_temp").desc())

cold3 = (
    monthly_avg
    .withColumn("cold_rank", F.row_number().over(win_cold))
    .filter(F.col("cold_rank") <= 3)
    .select(F.col("ma.site_id").alias("site_id"), "month")
)

hot3 = (
    monthly_avg
    .withColumn("hot_rank", F.row_number().over(win_hot))
    .filter(F.col("hot_rank") <= 3)
    .select(F.col("ma.site_id").alias("site_id"), "month")
)

peak_months = (
    cold3.unionByName(hot3)
         .dropDuplicates(["site_id", "month"])
         .alias("pm")
)

# 3) LEFT join back to full weather to label season_type
weather_with_season = (
    w.join(
        peak_months,
        on=(
            (F.col("w.site_id") == F.col("pm.site_id")) &
            (F.month(F.col("w.timestamp")) == F.col("pm.month"))
        ),
        how="left"
    )
    .withColumn(
        "season_type",
        F.when(F.col("pm.month").isNotNull(), F.lit("peak")).otherwise(F.lit("off-peak"))
    )
    .drop("pm.site_id", "pm.month")
)

# (Optional) keep a clean projection (nice for downstream joins)
weather_with_season = weather_with_season.select(
    "w.timestamp", "w.site_id",
    "w.air_temperature", "w.cloud_coverage", "w.dew_temperature",
    "w.sea_level_pressure", "w.wind_direction", "w.wind_speed",
    "season_type"
)


In [8]:
# --- VALIDATION SUITE ---

# 0) VALIDATE THE MONTHLY AVERAGE TEMPERATURES (NEW)
# This is the foundational data that the ranking is based on.
print("--- Validation 0: Average Monthly Temperature for Site 0 ---")
monthly_avg.filter(F.col("site_id") == 0).orderBy("month").show()


# 1) Site 0 must have BOTH labels now
print("\n--- Validation 1: Peak/Off-Peak Row Count for Site 0 ---")
weather_with_season.filter(F.col("site_id")==0) \
    .groupBy("season_type").count().show()

# 2) Exactly 6 peak months per site (months, not rows)
print("\n--- Validation 2: Count of Peak/Off-Peak Months Per Site ---")
months_by_label = (weather_with_season
    .select("site_id", F.month("timestamp").alias("m"), "season_type")
    .dropDuplicates(["site_id","m","season_type"])
    .groupBy("site_id","season_type").count()
)
months_by_label.orderBy("site_id","season_type").show(50)

# 3) Spot-check distinct months per site are 12
print("\n--- Validation 3: Total Distinct Months Per Site ---")
distinct_months = (weather_with_season
    .select("site_id", F.month("timestamp").alias("m"))
    .dropDuplicates()
    .groupBy("site_id").count()
)
distinct_months.orderBy("site_id").show(50)

--- Validation 0: Average Monthly Temperature for Site 0 ---
+-------+-----+------------------+
|site_id|month|          avg_temp|
+-------+-----+------------------+
|      0|    1|14.713110644374733|
|      0|    2| 16.13965517241379|
|      0|    3|  21.2662634408602|
|      0|    4| 22.43125000000002|
|      0|    5|  24.7342741935484|
|      0|    6|27.366388888888856|
|      0|    7| 28.55282258064521|
|      0|    8|27.613575268817204|
|      0|    9|26.871944444444473|
|      0|   10| 24.03817204301075|
|      0|   11|20.055416666666662|
|      0|   12|19.956989247311842|
+-------+-----+------------------+


--- Validation 1: Peak/Off-Peak Row Count for Site 0 ---
+-----------+-----+
|season_type|count|
+-----------+-----+
|   off-peak| 4392|
|       peak| 4392|
+-----------+-----+


--- Validation 2: Count of Peak/Off-Peak Months Per Site ---
+-------+-----------+-----+
|site_id|season_type|count|
+-------+-----------+-----+
|      0|   off-peak|    6|
|      0|       peak|    

Create a data frame with all relevant columns at this stage, we refer to this data frame as feature_df.

### 1.3 Exploring the Data <a class="anchor" name="1.3"></a>
You can use either the CDA or the EDA method mentioned in Lab 5.  
Some ideas for CDA:  
a)	Older building may not be as efficient as new ones, therefore need more energy for cooling/heating. It’s not necessarily true though, if the buildings are built with higher standard or renovated later.  
b)	A multifloored or larger building obviously consumes more energy.  

1.	With the feature_df, write code to show the basic statistics:  
a) For each numeric column, show count, mean, stddev, min, max, 25 percentile, 50 percentile, 75 percentile;  
b) For each non-numeric column, display the top-5 values and the corresponding counts;  
c) For each boolean column, display the value and count. (note: pandas describe is allowed for this task.) (5%)

2.	Explore the dataframe and write code to present two plots of multivariate analysis, describe your plots and discuss the findings from the plots. (5% each).  
○	150 words max for each plot’s description and discussion.  
○	Note: In the building metadata table, there are some latent columns (data that may or may not be helpful, their meanings is unknown due to privacy and data security concerns).  
○	Feel free to use any plotting libraries: matplotlib, seabon, plotly, etc. You can refer to https://samplecode.link  


## Part 2. Feature extraction and ML training <a class="anchor" name="part-2"></a>
In this section, you must use PySpark DataFrame functions and ML packages for data preparation, model building, and evaluation. Other ML packages, such as scikit-learn, should not be used to process the data; however, it’s fine to use them to display the result or evaluate your model.  
### 2.1 Discuss the feature selection and prepare the feature columns

2.1.1 Based on the data exploration from 1.2 and considering the use case, discuss the importance of those features (For example, which features may be useless and should be removed, which feature has a significant impact on the label column, which should be transformed), which features you are planning to use? Discuss the reasons for selecting them and how you plan to create/transform them.  
○	300 words max for the discussion  
○	Please only use the provided data for model building  
○	You can create/add additional features based on the dataset  
○	Hint - Use the insights from the data exploration/domain knowledge/statistical models to consider whether to create more feature columns, whether to remove some columns  

2.1.2 Write code to create/transform the columns based on your discussion above.

### 2.2 Preparing Spark ML Transformers/Estimators for features, labels, and models  <a class="anchor" name="2.2"></a>

**2.2.1 Write code to create Transformers/Estimators for transforming/assembling the columns you selected above in 2.1 and create ML model Estimators for Random Forest (RF) and Gradient-boosted tree (GBT) model.
Please DO NOT fit/transform the data yet.**

**2.2.2. Write code to include the above Transformers/Estimators into two pipelines.
Please DO NOT fit/transform the data yet.**

### 2.3 Preparing the training data and testing data  
Write code to split the data for training and testing, using 2025 as the random seed. You can decide the train/test split ratio based on the resources available on your laptop.  
Note: Due to the large dataset size, you can use random sampling (say 20% of the dataset). 

### 2.4 Training and evaluating models  
2.4.1 Write code to use the corresponding ML Pipelines to train the models on the training data from 2.3. And then use the trained models to predict the testing data from 2.3

2.4.2 For both models (RF and GBT): with the test data, decide on which metrics to use for model evaluation and discuss which one is the better model (no word limit; please keep it concise). You may also use a plot for visualisation (not mandatory).

2.4.3 3.	Save the better model (you’ll need it for A2B).
(Note: You may need to go through a few training loops or use more data to create a better-performing model.)

### Part 3. Hyperparameter Tuning and Model Optimisation <a class="anchor" name="part-3"></a>  
Apply the techniques you have learnt from the labs, for example, CrossValidator, TrainValidationSplit, ParamGridBuilder, etc., to perform further hyperparameter tuning and model optimisation.  
The assessment is based on the quality of your work/process, not the quality of your model. Please include your thoughts/ideas/discussions.

## References:
Please add your references below: