## 06 - Transform Hourly Weather Data (Silver Layer)

This notebook reads daily-ingested weather data from the Bronze Delta table and processes it into a cleaned and enriched Silver table for analysis and dashboarding.

### Purpose
To standardize and prepare Seattle hourly weather forecasts for reliable downstream use by filtering and enriching the raw API data.

### Workflow Summary
- Reads today’s hourly weather forecast from the Bronze Delta table
- Filters out records with missing `forecast_time` or `temperature`
- Casts and renames relevant columns for consistency
- Writes clean data to `dbfs:/silver/weather/` partitioned by `ingestion_date`


In [0]:
# dbutils.fs.rm("dbfs:/silver/weather/", recurse=True)

In [0]:
from pyspark.sql import functions as F
import datetime as dt

# Use today's date dynamically
INGESTION_DATE = dt.date.today().isoformat()

# Define input/output paths
BRONZE_PATH = "dbfs:/bronze/weather/"     # Source: raw weather forecast
SILVER_PATH = "dbfs:/silver/weather/"     # Destination: cleaned weather data


In [0]:
# Read only today's ingested weather records from Bronze table
df_bronze = (
    spark.read.format("delta")
    .load(BRONZE_PATH)
    .filter(F.col("ingestion_date") == INGESTION_DATE)
)
# Preview important fields
df_bronze.select("forecast_time", "temperature", "shortForecast", "windSpeed").show(5, truncate=False)


In [0]:
# ✅ Remove rows with missing critical fields (timestamp or temperature)
df_bronze = df_bronze.filter(
    F.col("forecast_time").isNotNull() &
    F.col("temperature").isNotNull()
)


In [0]:
# Select and rename key columns, add metadata
df_silver = (
    df_bronze.select(
        F.col("forecast_time"),
        F.col("temperature").cast("int"),                   # Cast temperature to integer
        F.col("shortForecast").alias("condition"),          # Rename shortForecast to condition
        F.col("windSpeed"),
        F.col("forecast_retrieved_at")
    )
    .withColumn("processed_at", F.current_timestamp())      # Timestamp for transformation time
    .withColumn("ingestion_date", F.lit(INGESTION_DATE))
    .dropDuplicates(["forecast_time"])                      # Prevent duplicate forecast records
)


In [0]:
# Save transformed data to Silver Delta table, partitioned by ingestion date
df_silver.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("ingestion_date") \
    .save(SILVER_PATH)

print("✓ Weather Silver table saved (partitioned append)")
