# FIT5202 Assignment 2A : Building Models for Building Energy Prediction

## Table of Contents
*  
    * [Part 1 : Data Loading, Transformation and Exploration](#part-1)
    * [Part 2 : Feature extraction and ML training](#part-2)
    * [Part 3 : Hyperparameter Tuning and Model Optimisation](#part-3)  
Please add code/markdown cells as needed.

# Part 1: Data Loading, Transformation and Exploration <a class="anchor" name="part-1"></a>
## 1.1 Data Loading
In this section, you must load the given datasets into PySpark DataFrames and use DataFrame functions to process the data. For plotting, various visualisation packages can be used, but please ensure that you have included instructions to install the additional packages and that the installation will be successful in the provided Docker container (in case your marker needs to clear the notebook and rerun it).

### 1.1.1 Data Loading <a class="anchor" name="1.1"></a>
1.1.1 Write the code to create a SparkSession. For creating the SparkSession, you need to use a SparkConf object to configure the Spark app with a proper application name, to ensure the maximum partition size does not exceed 32MB, and to run locally with all CPU cores on your machine

In [1]:
# Import SparkConf class into program
from pyspark import SparkConf

# local[*]: run Spark in local mode with as many working processors as logical cores on your machine
# If we want Spark to run locally with 'k' worker threads, we can specify as "local[k]".
master = "local[*]"
# The `appName` field is a name to be shown on the Spark cluster UI page
app_name = "Assignment2A"
# Setup configuration parameters for Spark
spark_conf = SparkConf().setMaster(master).setAppName(app_name)

# Import SparkContext and SparkSession classes
from pyspark import SparkContext # Spark
from pyspark.sql import SparkSession # Spark SQL

# Method 1: Using SparkSession
# spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.session.timeZone", "GMT+10").getOrCreate()
spark = SparkSession.builder.config(conf=spark_conf).config("spark.sql.files.maxPartitionBytes", "33554432").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('ERROR')
from pyspark.sql import functions as F

1.1.2 Write code to define the schemas for the datasets, following the data types suggested in the metadata file. 

In [2]:
# Adapted from GPT
from pyspark.sql.types import (
    StructType, StructField,
    IntegerType, StringType, DecimalType, TimestampType
)

# 1. Meters Table
meters_schema = StructType([
    StructField("building_id", IntegerType(), False),
    StructField("meter_type", StringType(), False),   # Char(1) -> StringType
    StructField("ts", TimestampType(), False),
    StructField("value", DecimalType(15, 4), False),
    StructField("row_id", IntegerType(), False)
])

# 2. Buildings Table
buildings_schema = StructType([
    StructField("site_id", IntegerType(), False),
    StructField("building_id", IntegerType(), False),
    StructField("primary_use", StringType(), True),
    StructField("square_feet", IntegerType(), True),
    StructField("floor_count", IntegerType(), True),
    StructField("row_id", IntegerType(), False),
    StructField("year_built", IntegerType(), True),
    StructField("latent_y", DecimalType(6, 4), True),
    StructField("latent_s", DecimalType(6, 4), True),
    StructField("latent_r", DecimalType(6, 4), True)
])

# 3. Weather Table
weather_schema = StructType([
    StructField("site_id", IntegerType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("air_temperature", DecimalType(5, 3), True),
    StructField("cloud_coverage", DecimalType(5, 3), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("dew_temperature", DecimalType(5, 3), True),
    StructField("sea_level_pressure", DecimalType(8, 3), True),
    StructField("wind_direction", DecimalType(5, 3), True), # Is an Integer, but ends with a ".0", so read as a DecimalType
    StructField("wind_speed", DecimalType(5, 3), True)
])


1.1.3 Using your schemas, load the CSV files into separate data frames. Print the schemas of all data frames. 

In [3]:
# from GPT
meters_df = spark.read.csv(
    "data/meters.csv",
    header=True,
    schema=meters_schema
)

buildings_df = spark.read.csv(
    "data/building_information.csv",
    header=True,
    schema=buildings_schema
)

weather_df = spark.read.csv(
    "data/weather.csv",
    header=True,
    schema=weather_schema
)
print("Meters DF:")
meters_df.printSchema()
print("Buildings DF:")
buildings_df.printSchema()
print("Weather DF")
weather_df.printSchema()


Meters DF:
root
 |-- building_id: integer (nullable = true)
 |-- meter_type: string (nullable = true)
 |-- ts: timestamp (nullable = true)
 |-- value: decimal(15,4) (nullable = true)
 |-- row_id: integer (nullable = true)

Buildings DF:
root
 |-- site_id: integer (nullable = true)
 |-- building_id: integer (nullable = true)
 |-- primary_use: string (nullable = true)
 |-- square_feet: integer (nullable = true)
 |-- floor_count: integer (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- year_built: integer (nullable = true)
 |-- latent_y: decimal(6,4) (nullable = true)
 |-- latent_s: decimal(6,4) (nullable = true)
 |-- latent_r: decimal(6,4) (nullable = true)

Weather DF
root
 |-- site_id: integer (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- air_temperature: decimal(5,3) (nullable = true)
 |-- cloud_coverage: decimal(5,3) (nullable = true)
 |-- dew_temperature: decimal(5,3) (nullable = true)
 |-- sea_level_pressure: decimal(8,3) (nullable = true)
 |-- win

### 1.2 Data Transformation to Create Features <a class="anchor" name="1.2"></a>
In this section, we primarily have three tasks:  
1.2.1 The dataset includes sensors with hourly energy measurements. However, as a grid operator, we don’t need this level of granularity and lowering it can reduce the amount of data we need to process. For each building, we will aggregate the metered energy consumption in 6-hour intervals (0:00-5:59, 6:00-11:59, 12:00-17:59, 18:00-23:59). This will be our target (label) column for this prediction. Perform the aggregation for each building.


In [4]:
# Adapted from A1 and GPT
# Meters df
# Split timestamp to date and time bucket
meters_df = meters_df.withColumn("date", F.to_date("ts")).withColumn(
    "time",
    F.when(F.hour("ts") <= 5, "0-6h")
     .when(F.hour("ts") <= 11, "6-12h")
     .when(F.hour("ts") <= 17, "12-18h")
     .when(F.hour("ts") <= 23, "18-24h")
     .otherwise("unknown")   # catch any unexpected cases
)

# Aggregate by time bucket
meters_df = (
    meters_df.groupBy("building_id", "meter_type", "date", "time")
    .agg(F.sum("value").cast(DecimalType(15, 4)).alias("power_usage"))
)
meters_df.show(3)


+-----------+----------+----------+------+-----------+
|building_id|meter_type|      date|  time|power_usage|
+-----------+----------+----------+------+-----------+
|        244|         c|2022-01-01| 6-12h|    36.3642|
|       1214|         c|2022-01-01| 6-12h|   444.3222|
|       1259|         c|2022-01-01|12-18h|   385.0202|
+-----------+----------+----------+------+-----------+
only showing top 3 rows



In the weather dataset, there are some missing values (null or empty strings). It may lower the quality of our model. Imputation is a way to deal with those missing values. Imputation is the process of replacing missing values in a dataset with substituted, or "imputed," values. It's a way to handle gaps in your data so that you can still analyse it effectively without having to delete incomplete records.  
1.2.2 Refer to the Spark MLLib imputation API and fill in the missing values in the weather dataset. You can use mean values as the strategy.  https://spark.apache.org/docs/3.5.5/api/python/reference/api/pyspark.ml.feature.Imputer.html

In [5]:
# from GPT
# Weather df
# Split timestamp to date, month, time bucket
weather_df = weather_df.withColumn("date", F.to_date("timestamp")).withColumn(
    "time",
    F.when(F.hour("timestamp") <= 5, "0-6h")
     .when(F.hour("timestamp") <= 11, "6-12h")
     .when(F.hour("timestamp") <= 17, "12-18h")
     .when(F.hour("timestamp") <= 23, "18-24h")
).withColumn("month", F.month("timestamp"))

# Choose which columns to impute
impute_cols = [
    "air_temperature",
    "cloud_coverage",
    "dew_temperature",
    "sea_level_pressure",
    "wind_direction",
    "wind_speed"
]

# Global means once
global_means = weather_df.select(
    *[F.mean(c).alias(c) for c in impute_cols]
).first().asDict()

# Step 1: site_id + month
site_month_means = weather_df.groupBy("site_id", "month").agg(
    *[F.mean(c).alias(f"{c}_site_month_mean") for c in impute_cols]
)
weather_df = weather_df.join(site_month_means, on=["site_id", "month"], how="left")
for c in impute_cols:
    weather_df = weather_df.withColumn(
        c, F.coalesce(c, F.col(f"{c}_site_month_mean"))
    ).drop(f"{c}_site_month_mean")

# Garbage collection
weather_df = weather_df.unpersist()

# Step 2: site_id
site_means = weather_df.groupBy("site_id").agg(
    *[F.mean(c).alias(f"{c}_site_mean") for c in impute_cols]
)
weather_df = weather_df.join(site_means, on="site_id", how="left")
for c in impute_cols:
    weather_df = weather_df.withColumn(
        c, F.coalesce(c, F.col(f"{c}_site_mean"))
    ).drop(f"{c}_site_mean")

# Step 3: global fallback
for c in impute_cols:
    weather_df = weather_df.withColumn(
        c, F.coalesce(c, F.lit(global_means[c]))
    )
    
# Garbage collection
del site_month_means
del site_means
del global_means
spark.catalog.clearCache()
    
# Aggregate by time bucket
weather_df = (
    weather_df.groupBy("site_id", "date", "time", "month")
    .agg(
        F.mean("air_temperature").cast(DecimalType(5, 3)).alias("air_temperature"),
        F.mean("cloud_coverage").cast(DecimalType(5, 3)).alias("cloud_coverage"),
        F.mean("dew_temperature").cast(DecimalType(5, 3)).alias("dew_temperature"),
        F.mean("sea_level_pressure").cast(DecimalType(8, 3)).alias("sea_level_pressure"),
        F.mean("wind_direction").cast(DecimalType(5, 3)).alias("wind_direction"),
        F.mean("wind_speed").cast(DecimalType(5, 3)).alias("wind_speed"),        
    )
)

weather_df.show(5)


+-------+----------+------+-----+---------------+--------------+---------------+------------------+--------------+----------+
|site_id|      date|  time|month|air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|
+-------+----------+------+-----+---------------+--------------+---------------+------------------+--------------+----------+
|      1|2022-01-01|  0-6h|    1|          2.800|         0.000|          1.950|          1022.067|        49.722|     1.617|
|      1|2022-01-01| 6-12h|    1|          3.117|         0.000|          1.683|          1021.850|        73.222|     2.300|
|      1|2022-01-01|12-18h|    1|          7.450|         0.000|          5.450|          1015.683|        59.667|     5.917|
|      1|2022-01-01|18-24h|    1|          8.183|         0.000|          6.433|          1008.167|        59.667|     8.050|
|      1|2022-01-02|  0-6h|    1|          8.800|         0.000|          8.017|          1001.400|        59.667|    

We know that different seasons may affect energy consumption—for instance, a heater in winter and a cooler in summer. Extracting peak seasons (summer and winter) or off-peak seasons (Spring and Autumn) might be more useful than directly using the month as numerical values.   
1.2.3 The dataset has 16 sites in total, whose locations may span across different countries. Add a column (peak/off-peak) to the weather data frame based on the average air temperature. The top 3 hottest months and the 3 coldest months are considered “peak”, and the rest of the year is considered “off-peak”. 

In [6]:
from pyspark.sql.window import Window
# adapted from A1, tuned with GPT
# Group median temp per site/month
monthly_temp = (
    weather_df
    .groupBy("site_id", "month")
    .agg(F.expr("percentile_approx(air_temperature, 0.5)").alias("median_temp"))
)

# Define windows
w_asc = Window.partitionBy("site_id").orderBy(F.col("median_temp").asc())
w_desc = Window.partitionBy("site_id").orderBy(F.col("median_temp").desc())

# Rank + add peak/offpeak column in one step, drop ranks immediately
monthly_temp_ranked = (
    monthly_temp
    .withColumn("rank_cold", F.row_number().over(w_asc))
    .withColumn("rank_hot", F.row_number().over(w_desc))
    .withColumn(
        "peak_offpeak",
        F.when((F.col("rank_cold") <= 3) | (F.col("rank_hot") <= 3), "peak")
         .otherwise("off-peak")
    )
    .select("site_id", "month", "median_temp", "peak_offpeak")    
)

# Join back
weather_df = weather_df.join(monthly_temp_ranked, ["site_id","month"], "left")

weather_df.show(3)
del monthly_temp_ranked

+-------+-----+----------+------+---------------+--------------+---------------+------------------+--------------+----------+-----------+------------+
|site_id|month|      date|  time|air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|median_temp|peak_offpeak|
+-------+-----+----------+------+---------------+--------------+---------------+------------------+--------------+----------+-----------+------------+
|      1|    1|2022-01-01|  0-6h|          2.800|         0.000|          1.950|          1022.067|        49.722|     1.617|      6.417|        peak|
|      1|    1|2022-01-01| 6-12h|          3.117|         0.000|          1.683|          1021.850|        73.222|     2.300|      6.417|        peak|
|      1|    1|2022-01-01|12-18h|          7.450|         0.000|          5.450|          1015.683|        59.667|     5.917|      6.417|        peak|
+-------+-----+----------+------+---------------+--------------+---------------+--------------

Create a data frame with all relevant columns at this stage, we refer to this data frame as feature_df.

In [7]:
# Meters -> Join with Buildings (building_id)
meters_buildings = meters_df.join(buildings_df, ["building_id"], "left")
# Meters + Buildings -> Join with Weather (site_id, date, time)
feature_df = meters_buildings.join(weather_df, ["site_id", "date", "time"])
feature_df.show(3)

+-------+----------+------+-----------+----------+-----------+-----------+-----------+-----------+------+----------+--------+--------+--------+-----+---------------+--------------+---------------+------------------+--------------+----------+-----------+------------+
|site_id|      date|  time|building_id|meter_type|power_usage|primary_use|square_feet|floor_count|row_id|year_built|latent_y|latent_s|latent_r|month|air_temperature|cloud_coverage|dew_temperature|sea_level_pressure|wind_direction|wind_speed|median_temp|peak_offpeak|
+-------+----------+------+-----------+----------+-----------+-----------+-----------+-----------+------+----------+--------+--------+--------+-----+---------------+--------------+---------------+------------------+--------------+----------+-----------+------------+
|      0|2022-11-04|12-18h|          0|         e|  1094.1450|  Education|       7432|          1|     1|      2015| 15.0000|  3.8711|  0.0000|   11|         22.850|         2.534|         17.417|   

In [8]:
meters_buildings.printSchema()
weather_df.printSchema()

root
 |-- building_id: integer (nullable = true)
 |-- meter_type: string (nullable = true)
 |-- date: date (nullable = true)
 |-- time: string (nullable = false)
 |-- power_usage: decimal(15,4) (nullable = true)
 |-- site_id: integer (nullable = true)
 |-- primary_use: string (nullable = true)
 |-- square_feet: integer (nullable = true)
 |-- floor_count: integer (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- year_built: integer (nullable = true)
 |-- latent_y: decimal(6,4) (nullable = true)
 |-- latent_s: decimal(6,4) (nullable = true)
 |-- latent_r: decimal(6,4) (nullable = true)

root
 |-- site_id: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- time: string (nullable = true)
 |-- air_temperature: decimal(5,3) (nullable = true)
 |-- cloud_coverage: decimal(5,3) (nullable = true)
 |-- dew_temperature: decimal(5,3) (nullable = true)
 |-- sea_level_pressure: decimal(8,3) (nullable = true)
 |-- wind_direction: decimal(

In [9]:
# Meters DF:
# root
#  |-- building_id: integer (nullable = true)
#  |-- meter_type: string (nullable = true)
#  |-- ts: timestamp (nullable = true)
#  |-- value: decimal(10,4) (nullable = true)
#  |-- row_id: integer (nullable = true)

# Buildings DF:
# root
#  |-- site_id: integer (nullable = true)
#  |-- building_id: integer (nullable = true)
#  |-- primary_use: string (nullable = true)
#  |-- square_feet: integer (nullable = true)
#  |-- floor_count: integer (nullable = true)
#  |-- row_id: integer (nullable = true)
#  |-- year_built: integer (nullable = true)
#  |-- latent_y: decimal(10,4) (nullable = true)
#  |-- latent_s: decimal(10,4) (nullable = true)
#  |-- latent_r: decimal(10,4) (nullable = true)

# Weather DF
# root
#  |-- site_id: integer (nullable = true)
#  |-- timestamp: timestamp (nullable = true)
#  |-- air_temperature: decimal(6,2) (nullable = true)
#  |-- cloud_coverage: decimal(6,2) (nullable = true)
#  |-- dew_temperature: decimal(6,2) (nullable = true)
#  |-- sea_level_pressure: decimal(8,2) (nullable = true)
#  |-- wind_direction: decimal(6,2) (nullable = true)
#  |-- wind_speed: decimal(6,2) (nullable = true)

# Extra columns: month, median_temp, peak_offpeak
print("Feature df: ")
feature_df.printSchema()

Feature df: 
root
 |-- site_id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- time: string (nullable = false)
 |-- building_id: integer (nullable = true)
 |-- meter_type: string (nullable = true)
 |-- power_usage: decimal(15,4) (nullable = true)
 |-- primary_use: string (nullable = true)
 |-- square_feet: integer (nullable = true)
 |-- floor_count: integer (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- year_built: integer (nullable = true)
 |-- latent_y: decimal(6,4) (nullable = true)
 |-- latent_s: decimal(6,4) (nullable = true)
 |-- latent_r: decimal(6,4) (nullable = true)
 |-- month: integer (nullable = true)
 |-- air_temperature: decimal(5,3) (nullable = true)
 |-- cloud_coverage: decimal(5,3) (nullable = true)
 |-- dew_temperature: decimal(5,3) (nullable = true)
 |-- sea_level_pressure: decimal(8,3) (nullable = true)
 |-- wind_direction: decimal(5,3) (nullable = true)
 |-- wind_speed: decimal(5,3) (nullable = true)
 |-- median_temp: decimal(5,3) (

### 1.3 Exploring the Data <a class="anchor" name="1.3"></a>
You can use either the CDA or the EDA method mentioned in Lab 5.  
Some ideas for CDA:  
a)	Older building may not be as efficient as new ones, therefore need more energy for cooling/heating. It’s not necessarily true though, if the buildings are built with higher standard or renovated later.  
b)	A multifloored or larger building obviously consumes more energy.  

1.	With the feature_df, write code to show the basic statistics:  
a) For each numeric column, show count, mean, stddev, min, max, 25 percentile, 50 percentile, 75 percentile;  
b) For each non-numeric column, display the top-5 values and the corresponding counts;  
c) For each boolean column, display the value and count. (note: pandas describe is allowed for this task.) (5%)

In [10]:
# ### IT WORKS, but I'm commenting it out so I don't recompute this on "run all"

# # Get columns grouped by type
# numeric_cols = [f.name for f in feature_df.schema.fields if f.dataType.simpleString().startswith(("int", "double", "float", "decimal", "long", "short"))]
# string_cols  = [f.name for f in feature_df.schema.fields if f.dataType.simpleString().startswith("string")]
# boolean_cols = [f.name for f in feature_df.schema.fields if f.dataType.simpleString().startswith("boolean")]

# # === A) Numeric columns stats ===
# for col in numeric_cols:
#     # Base aggregations
#     stats = feature_df.select(
#         F.count(F.col(col)).alias("count"),
#         F.mean(F.col(col)).cast(DecimalType(15, 4)).alias("mean"),
#         F.stddev(F.col(col)).cast(DecimalType(15, 4)).alias("stddev"),
#         F.min(F.col(col)).alias("min"),
#         F.max(F.col(col)).cast(DecimalType(15, 4)).alias("max"),
#     ).collect()[0].asDict()

#     # Add quantiles
#     quantiles = feature_df.approxQuantile(col, [0.25, 0.5, 0.75], 0.01)
#     stats.update({"25%": quantiles[0], "50%": quantiles[1], "75%": quantiles[2]})

#     # Convert to DataFrame for nice display
#     df_stats = feature_df.sparkSession.createDataFrame([stats])
#     print(f"--- Stats for numeric column: {col} ---")
#     df_stats.show(truncate=False)

# # === B) Non-numeric (string) columns top-5 values ===
# for col in string_cols:
#     print(f"--- Top 5 values for string column: {col} ---")
#     feature_df.groupBy(col).count().orderBy(F.desc("count")).show(5)

# # === C) Boolean columns counts ===
# for col in boolean_cols:
#     print(f"--- Value counts for boolean column: {col} ---")
#     feature_df.groupBy(col).count().show()



In [11]:

# print("Meters DF:")
# meters_df.printSchema()
# print("Buildings DF:")
# buildings_df.printSchema()
# print("Weather DF")
# weather_df.printSchema()
# print("Feature DF")
# feature_df.printSchema()

2.	Explore the dataframe and write code to present two plots of multivariate analysis, describe your plots and discuss the findings from the plots. (5% each).  
○	150 words max for each plot’s description and discussion.  
○	Note: In the building metadata table, there are some latent columns (data that may or may not be helpful, their meanings is unknown due to privacy and data security concerns).  
○	Feel free to use any plotting libraries: matplotlib, seabon, plotly, etc. You can refer to https://samplecode.link  


In [15]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml import Pipeline

def glm_gamma_pipeline(df, feature_cols, label_col="power_usage"):
    """
    Fit a Gamma GLM with log link to given features.
    """
    # Handle categorical features
    stages = []
    for colname in feature_cols:
        if df.schema[colname].dataType.simpleString() == "string":
            stages.append(StringIndexer(inputCol=colname, outputCol=colname + "_idx", handleInvalid="keep"))
    
    # Replace categorical names with indexed versions
    input_features = [col + "_idx" if df.schema[col].dataType.simpleString() == "string" else col for col in feature_cols]
    
    # Assemble into feature vector
    assembler = VectorAssembler(inputCols=input_features, outputCol="features")
    stages.append(assembler)
    
    # GLM with Gamma distribution + log link
    glm = GeneralizedLinearRegression(
        featuresCol="features",
        labelCol=label_col,
        family="gamma",
        link="log",
        maxIter=50,
        regParam=0.0
    )
    stages.append(glm)
    
    pipeline = Pipeline(stages=stages)
    model = pipeline.fit(df)
    
    # Extract GLM stage
    glm_model = model.stages[-1]
    
    # Collect feature importance (coefficients)
    feature_importance = list(zip(input_features, glm_model.coefficients.toArray()))
    
    return glm_model, feature_importance

# === Run separately for building and weather ===
building_features = ["primary_use", "square_feet", "floor_count", "year_built", "latent_y", "latent_s", "latent_r"]
weather_features  = ["air_temperature", "cloud_coverage", "dew_temperature", "sea_level_pressure", "wind_direction", "wind_speed", "median_temp", "peak_offpeak"]

# Log scale power usage column to smooth out outliers
df = feature_df.withColumn("power_usage", F.col("power_usage") + F.lit(1e-6))


glm_building, building_importance = glm_gamma_pipeline(df, building_features)
glm_weather, weather_importance   = glm_gamma_pipeline(df, weather_features)

# Print results
print("=== Building Features Importance (Gamma GLM) ===")
for f, c in building_importance:
    print(f"{f:20s} {c:.4f}")

print("\n=== Weather Features Importance (Gamma GLM) ===")
for f, c in weather_importance:
    print(f"{f:20s} {c:.4f}")


Py4JJavaError: An error occurred while calling o731.fit.
: java.lang.AssertionError: assertion failed: Sum of weights cannot be zero.
	at scala.Predef$.assert(Predef.scala:223)
	at org.apache.spark.ml.optim.WeightedLeastSquares$Aggregator.validate(WeightedLeastSquares.scala:426)
	at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:108)
	at org.apache.spark.ml.optim.IterativelyReweightedLeastSquares.fit(IterativelyReweightedLeastSquares.scala:91)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression.$anonfun$train$1(GeneralizedLinearRegression.scala:434)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:380)
	at org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:247)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)


In [24]:
# # Count total rows and how many have zero power_usage
# df_counts = feature_df.agg(
#     F.count("*").alias("total_rows"),
#     F.sum(F.when(F.col("power_usage") == 0, 1).otherwise(0)).alias("zero_rows")
# )

# df_counts.show()
# df.filter(F.col("power_usage") <= 0).count()
# assembler = VectorAssembler(inputCols=building_features, outputCol="features")
# assembled = assembler.transform(df)
# print(assembled.count())
# df.select([F.countDistinct(c).alias(c) for c in building_features + weather_features]).show()


# numeric_features = [
#     c for c in input_features 
#     if str(df.schema[c].dataType) != "StringType"
# ]

# assembler = VectorAssembler(
#     inputCols=numeric_features,
#     outputCol="features"
# )

# assembled = assembler.transform(df)

# print("Row count:", assembled.count())
# assembled.select("features", "power_usage").show(5, truncate=False)


In [None]:
# from pyspark.sql import functions as F
# from pyspark.sql.types import DoubleType

# def iqr_filter(df, numeric_cols):
#     """
#     Apply IQR filtering to all numeric columns in a Spark DataFrame.
#     Returns filtered DataFrame.
#     """
#     for col in numeric_cols:
#         # compute Q1, Q3
#         q1, q3 = df.approxQuantile(col, [0.25, 0.75], 0.01)
#         iqr = q3 - q1
#         lower = q1 - 1.5 * iqr
#         upper = q3 + 1.5 * iqr

#         df = df.filter((F.col(col) >= lower) & (F.col(col) <= upper))
#     return df

# # Identify numeric columns
# numeric_cols = [
#     f.name for f in feature_df.schema.fields
#     if f.dataType.simpleString().startswith(("int", "double", "float", "decimal", "long", "short"))
# ]

# # Cast to DoubleType (for MLlib)
# df = feature_df
# for col in numeric_cols:
#     df = df.withColumn(col, F.col(col).cast(DoubleType()))

# # Log scale power usage column to smooth out outliers
# df = df.withColumn("power_usage", F.log1p("power_usage"))
# # Drop nulls
# df = df.filter(F.col("power_usage").isNotNull())


# # Apply IQR filter
# df_filtered = iqr_filter(df, numeric_cols)
# print("Row count after IQR filtering:", df_filtered.count())

# # Feature groups
# building_feats = ["primary_use", "square_feet", "floor_count", "year_built",
#                   "latent_y", "latent_s", "latent_r"]

# weather_feats  = ["air_temperature", "cloud_coverage", "dew_temperature",
#                   "sea_level_pressure", "wind_direction", "wind_speed",
#                   "median_temp", "peak_offpeak"]



In [None]:
# from pyspark.ml.feature import VectorAssembler, StringIndexer, StandardScaler
# from pyspark.ml.regression import LinearRegression
# from pyspark.ml.evaluation import RegressionEvaluator

# def run_regression(df, features, label="power_usage"):
#     """
#     Runs a linear regression for given features.
#     Returns model, predictions, coef_df, metrics.
#     """
#     # Index categorical if present
#     indexers = []
#     input_cols = []
#     for col in features:
#         if col in ["primary_use", "peak_offpeak"]:
#             idx_col = col + "_idx"
#             indexers.append(StringIndexer(inputCol=col, outputCol=idx_col, handleInvalid="keep"))
#             input_cols.append(idx_col)
#         else:
#             input_cols.append(col)

#     assembler = VectorAssembler(inputCols=input_cols, outputCol="features", handleInvalid="keep")
#     scaler = StandardScaler(inputCol="features", outputCol="features_scaled", withMean=True, withStd=True)

#     lr = LinearRegression(featuresCol="features_scaled", labelCol=label, maxIter=100)

#     # Pipeline
#     from pyspark.ml import Pipeline
#     pipeline = Pipeline(stages=indexers + [assembler, scaler, lr])

#     # Train/test split
#     train, test = df.randomSplit([0.8, 0.2], seed=42)

#     model = pipeline.fit(train)
#     preds = model.transform(test)

#     # Metrics
#     evaluator_rmse = RegressionEvaluator(labelCol=label, predictionCol="prediction", metricName="rmse")
#     evaluator_r2   = RegressionEvaluator(labelCol=label, predictionCol="prediction", metricName="r2")
#     rmse = evaluator_rmse.evaluate(preds)
#     r2   = evaluator_r2.evaluate(preds)

#     # Extract coefficients
#     lr_model = model.stages[-1]  # last stage is LinearRegression
#     feature_order = assembler.getInputCols()
#     coefs = lr_model.coefficients.toArray().tolist()
#     coef_df = pd.DataFrame({"feature": feature_order, "coefficient": coefs})

#     return model, preds, coef_df, {"rmse": rmse, "r2": r2, "intercept": lr_model.intercept}

# # Analysis 1: Building features
# model_building, preds_building, coef_building, metrics_building = run_regression(df_filtered, building_feats)
# print("=== Building features regression ===")
# print(metrics_building)
# print(coef_building)

# # Analysis 2: Weather features
# model_weather, preds_weather, coef_weather, metrics_weather = run_regression(df_filtered, weather_feats)
# print("=== Weather features regression ===")
# print(metrics_weather)
# print(coef_weather)


In [None]:
# pred_pd = preds_building.select("power_usage", "prediction").toPandas()
# residuals = pred_pd["power_usage"] - pred_pd["prediction"]

# plt.figure(figsize=(7,6))
# plt.scatter(pred_pd["power_usage"], pred_pd["prediction"], alpha=0.5)
# plt.plot([pred_pd["power_usage"].min(), pred_pd["power_usage"].max()],
#          [pred_pd["power_usage"].min(), pred_pd["power_usage"].max()], "--")
# plt.xlabel("Actual power_usage")
# plt.ylabel("Predicted power_usage")
# plt.title("Building features: Predicted vs Actual")
# plt.show()


In [23]:
def check_nulls_nans(df):
    checks = []
    for c, dtype in df.dtypes:
        if dtype in ["double", "float", "decimal"]:  # check for NaN too
            checks.append(
                F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c + "_invalid")
            )
        else:  # only null check
            checks.append(
                F.count(F.when(F.col(c).isNull(), c)).alias(c + "_nulls")
            )
    return df.select(checks)

# wind speed wind direction sea level pressure

check_nulls_nans(df).show(truncate=False)

+-------------+----------+----------+-----------------+----------------+-------------------+-----------------+-----------------+-----------------+------------+----------------+--------------+--------------+--------------+-----------+---------------------+--------------------+---------------------+------------------------+--------------------+----------------+-----------------+------------------+
|site_id_nulls|date_nulls|time_nulls|building_id_nulls|meter_type_nulls|power_usage_invalid|primary_use_nulls|square_feet_nulls|floor_count_nulls|row_id_nulls|year_built_nulls|latent_y_nulls|latent_s_nulls|latent_r_nulls|month_nulls|air_temperature_nulls|cloud_coverage_nulls|dew_temperature_nulls|sea_level_pressure_nulls|wind_direction_nulls|wind_speed_nulls|median_temp_nulls|peak_offpeak_nulls|
+-------------+----------+----------+-----------------+----------------+-------------------+-----------------+-----------------+-----------------+------------+----------------+--------------+-----------

In [None]:
# check_nulls_nans(meters_df).show(truncate=False)
# # check_nulls_nans(pre_meters_df).show(truncate=False)

# check_nulls_nans(weather_df).show(truncate=False)

In [25]:
df.printSchema()

root
 |-- site_id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- time: string (nullable = false)
 |-- building_id: integer (nullable = true)
 |-- meter_type: string (nullable = true)
 |-- power_usage: double (nullable = true)
 |-- primary_use: string (nullable = true)
 |-- square_feet: integer (nullable = true)
 |-- floor_count: integer (nullable = true)
 |-- row_id: integer (nullable = true)
 |-- year_built: integer (nullable = true)
 |-- latent_y: decimal(6,4) (nullable = true)
 |-- latent_s: decimal(6,4) (nullable = true)
 |-- latent_r: decimal(6,4) (nullable = true)
 |-- month: integer (nullable = true)
 |-- air_temperature: decimal(5,3) (nullable = true)
 |-- cloud_coverage: decimal(5,3) (nullable = true)
 |-- dew_temperature: decimal(5,3) (nullable = true)
 |-- sea_level_pressure: decimal(8,3) (nullable = true)
 |-- wind_direction: decimal(5,3) (nullable = true)
 |-- wind_speed: decimal(5,3) (nullable = true)
 |-- median_temp: decimal(5,3) (nullable = true)
 |-

## Part 2. Feature extraction and ML training <a class="anchor" name="part-2"></a>
In this section, you must use PySpark DataFrame functions and ML packages for data preparation, model building, and evaluation. Other ML packages, such as scikit-learn, should not be used to process the data; however, it’s fine to use them to display the result or evaluate your model.  
### 2.1 Discuss the feature selection and prepare the feature columns

2.1.1 Based on the data exploration from 1.2 and considering the use case, discuss the importance of those features (For example, which features may be useless and should be removed, which feature has a significant impact on the label column, which should be transformed), which features you are planning to use? Discuss the reasons for selecting them and how you plan to create/transform them.  
○	300 words max for the discussion  
○	Please only use the provided data for model building  
○	You can create/add additional features based on the dataset  
○	Hint - Use the insights from the data exploration/domain knowledge/statistical models to consider whether to create more feature columns, whether to remove some columns  

2.1.2 Write code to create/transform the columns based on your discussion above.

### 2.2 Preparing Spark ML Transformers/Estimators for features, labels, and models  <a class="anchor" name="2.2"></a>

**2.2.1 Write code to create Transformers/Estimators for transforming/assembling the columns you selected above in 2.1 and create ML model Estimators for Random Forest (RF) and Gradient-boosted tree (GBT) model.
Please DO NOT fit/transform the data yet.**

**2.2.2. Write code to include the above Transformers/Estimators into two pipelines.
Please DO NOT fit/transform the data yet.**

### 2.3 Preparing the training data and testing data  
Write code to split the data for training and testing, using 2025 as the random seed. You can decide the train/test split ratio based on the resources available on your laptop.  
Note: Due to the large dataset size, you can use random sampling (say 20% of the dataset). 

### 2.4 Training and evaluating models  
2.4.1 Write code to use the corresponding ML Pipelines to train the models on the training data from 2.3. And then use the trained models to predict the testing data from 2.3

2.4.2 For both models (RF and GBT): with the test data, decide on which metrics to use for model evaluation and discuss which one is the better model (no word limit; please keep it concise). You may also use a plot for visualisation (not mandatory).

2.4.3 3.	Save the better model (you’ll need it for A2B).
(Note: You may need to go through a few training loops or use more data to create a better-performing model.)

### Part 3. Hyperparameter Tuning and Model Optimisation <a class="anchor" name="part-3"></a>  
Apply the techniques you have learnt from the labs, for example, CrossValidator, TrainValidationSplit, ParamGridBuilder, etc., to perform further hyperparameter tuning and model optimisation.  
The assessment is based on the quality of your work/process, not the quality of your model. Please include your thoughts/ideas/discussions.

## References:
Please add your references below: