#### Silver_to_Gold (Feature Engineering)
This notebook creates machine-learning features from clean Silver data.


%md
#### FEATURE ENGINEERING APPROACH (GOLD LAYER)

In this notebook, we prepare the GOLD dataset, which is the final <br>
machine-learning–ready version of the GCC climate data.

The goal here is to build the features and targets that future models will use.


#### 1. Converting Time-Series into Supervised ML Format
To enable machine learning, we transform the climate time-series into  
input/output pairs:

- Input (X): today’s weather  
- Output (y): tomorrow’s weather  

This makes the dataset usable for forecasting tasks later.


#### 2. Features Added to the GOLD Dataset

##### • Lag Features (yesterday’s values)
- `TMAX_lag1`  
- `TMIN_lag1`  
- `PRCP_lag1`  

These capture short-term changes and trends.


##### • Rolling Averages (weekly trends)
- `TMAX_7d_avg`  
- `TMIN_7d_avg`  
- `TAVG_7d_avg`  

These help detect weekly warming/cooling patterns.


##### • Date-Based Features
- `day_of_week`
- `month`
- `day_of_year`

These help future models learn seasonal behavior.


#### 3. Target Variables for Future Models
We prepare the two prediction targets:

- **`tmax_nextday`** → tomorrow’s expected max temperature  
- **`rain_nextday`** → will it rain tomorrow? (0/1)

These will be used later in the forecasting models.

The next stage (model training) will be done in a separate notebook.


In [0]:
spark.conf.set(
    "fs.azure.account.key.qatarclimateanalysis.dfs.core.windows.net",
    "<account key>"
)

In [0]:
# Read silver data
silver_path = "abfss://lakehouse@qatarclimateanalysis.dfs.core.windows.net/silver/gcc_silver.parquet"
df_silver = spark.read.parquet(silver_path)


In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import *


# 1. Create a time window per station ordered by date
# Look at each station separately
# Inside each station, sort rows by date
# Ensures the correct calculation order
w = Window.partitionBy("ID").orderBy("date")


# 2. Create next-day TMAX target
df_gold = df_silver.withColumn(
    "tmax_nextday",
    lead("TMAX").over(w)
)


# 3. Create next-day Rain target 
df_gold = df_gold.withColumn(
    "rain_nextday",
    (lead("PRCP").over(w) > 0).cast("int")
)


# 4. Create lag features (yesterday values)
# These help the model understand trends
df_gold = df_gold.withColumn("TMAX_lag1", lag("TMAX", 1).over(w))
df_gold = df_gold.withColumn("TMIN_lag1", lag("TMIN", 1).over(w))
df_gold = df_gold.withColumn("PRCP_lag1", lag("PRCP", 1).over(w))


# 5. Create 7-day rolling averages (weekly trend)
# it captures: weekly heatwaves, cooling trends, average humidity/temperature cycles
# These help the model understand climate patterns.
w7 = Window.partitionBy("ID").orderBy("date") \
     .rowsBetween(-7, -1)

df_gold = df_gold.withColumn("TMAX_7d_avg", avg("TMAX").over(w7))
df_gold = df_gold.withColumn("TMIN_7d_avg", avg("TMIN").over(w7))
df_gold = df_gold.withColumn("TAVG_7d_avg", avg("TAVG").over(w7))


# 6. Drop rows where lag/rolling features are NULL
# Because the first 7 days for each station produce NULL rolling averages.
df_gold = df_gold.dropna(subset=[
    "TMAX_lag1", "TMIN_lag1", "PRCP_lag1",
    "TMAX_7d_avg", "TMIN_7d_avg", "TAVG_7d_avg"
])

# 7. Add date-based features
# These features help the model learn seasonal behavior in climate.
df_gold = df_gold.withColumn("day_of_week", dayofweek("date"))
df_gold = df_gold.withColumn("month", month("date"))
df_gold = df_gold.withColumn("day_of_year", dayofyear("date"))



# 8. Remove rows where next-day targets are null
# last day for each station has no next day
df_gold = df_gold.dropna(subset=["tmax_nextday", "rain_nextday"])


# 9. Sort final GOLD table
df_gold = df_gold.orderBy("ID", "date")
display(df_gold.limit(10))


ID,date,PRCP,TAVG,TMAX,TMIN,tmax_nextday,rain_nextday,TMAX_lag1,TMIN_lag1,PRCP_lag1,TMAX_7d_avg,TMIN_7d_avg,TAVG_7d_avg,day_of_week,month,day_of_year
AE000041196,2015-01-02,0.0,19.9,27.4,12.7,27.4,0,27.4,12.5,0.0,27.4,12.5,20.6,6,1,2
AE000041196,2015-01-03,0.0,20.6,27.4,14.0,27.4,0,27.4,12.7,0.0,27.4,12.6,20.25,7,1,3
AE000041196,2015-01-04,0.0,19.7,27.4,14.0,27.0,0,27.4,14.0,0.0,27.399999999999995,13.066666666666668,20.366666666666667,1,1,4
AE000041196,2015-01-05,0.0,19.6,27.0,14.0,27.0,0,27.4,14.0,0.0,27.4,13.3,20.2,2,1,5
AE000041196,2015-01-06,0.0,20.1,27.0,14.0,27.0,0,27.0,14.0,0.0,27.32,13.44,20.08,3,1,6
AE000041196,2015-01-07,0.0,20.1,27.0,13.6,27.2,0,27.0,14.0,0.0,27.266666666666666,13.533333333333331,20.08333333333333,4,1,7
AE000041196,2015-01-08,0.0,20.3,27.2,13.6,23.5,0,27.0,13.6,0.0,27.228571428571428,13.542857142857144,20.085714285714285,5,1,8
AE000041196,2015-01-09,0.0,19.4,23.5,13.6,26.1,0,27.2,13.6,0.0,27.2,13.7,20.042857142857144,6,1,9
AE000041196,2015-01-10,0.0,19.6,26.1,13.6,25.0,0,23.5,13.6,0.0,26.642857142857142,13.828571428571426,19.971428571428568,7,1,10
AE000041196,2015-01-11,0.0,21.1,25.0,10.0,23.1,0,26.1,13.6,0.0,26.457142857142856,13.77142857142857,19.828571428571426,1,1,11


Save Gold

In [0]:
# Save the GOLD dataset
gold_path = "abfss://lakehouse@qatarclimateanalysis.dfs.core.windows.net/gold/gcc_gold.parquet"
df_gold.write.mode("overwrite").parquet(gold_path)
