## 05 - Ingest Hourly Weather Data from NWS API (Bronze Layer)

This notebook fetches hourly weather forecasts for Seattle from the National Weather Service (NWS) API and stores the data in a partitioned Bronze Delta Lake table.

### Purpose
To gather time-stamped weather conditions aligned with transit activity for later analysis and visualization.

### Workflow Summary
- Calls NOAA’s NWS REST API for hourly Seattle forecasts
- Parses and filters JSON response to retain relevant rows
- Adds ingestion date and timestamp for data lineage
- Writes to `dbfs:/bronze/weather/` partitioned by `ingestion_date`


In [0]:
# dbutils.fs.rm("dbfs:/bronze/weather/", recurse=True)

In [0]:
import datetime as dt
import requests
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Get today's date for partitioning
INGESTION_DATE = dt.date.today().isoformat()

# Define Delta table path (Bronze layer, no date in path)
BRONZE_WEATHER_PATH = "dbfs:/bronze/weather/"


In [0]:
# NOAA NWS API endpoint for hourly forecast in Seattle area
URL = "https://api.weather.gov/gridpoints/SEW/123,67/forecast/hourly"


In [0]:
# Custom User-Agent is required by NOAA API for identification
headers = {
    "User-Agent": "Elham Weather Data Ingest - elham.afruzi@gmail.com"
}


In [0]:
# Make API request to fetch the weather data
response = requests.get(URL, headers=headers)


In [0]:
# Print status and a preview of the response for debugging
print("Status Code:", response.status_code)
print("Response Headers:", response.headers)
print("Raw Response:", response.text[:500])  # Show only first 500 characters


In [0]:
# Parse the JSON response
data = response.json()

In [0]:
# Extract 'periods' (hourly forecast entries) from JSON
props = data.get("properties", {})
periods = props.get("periods", [])

print(f"Updated at: {props.get('updated', 'N/A')}")
print(f"Hours returned: {len(periods)}")


In [0]:
# Step 1: Convert JSON periods into a DataFrame
df_weather = spark.createDataFrame(periods)

# Step 2: Filter out incomplete records (missing time or temperature)
df_weather = df_weather.filter(
    F.col("startTime").isNotNull() & F.col("temperature").isNotNull()
)

# Step 3: Add ingestion and forecast timestamps
df_weather = (
    df_weather
    .withColumn("forecast_retrieved_at", F.current_timestamp())
    .withColumn("forecast_time", F.to_timestamp("startTime"))
    .withColumn("ingestion_date", F.lit(INGESTION_DATE))
)
# Step 4: Display a few rows for verification
df_weather.select("forecast_time", "temperature", "shortForecast", "ingestion_date").show(5, truncate=False)


In [0]:
# Write to Delta table (Bronze layer), partitioned by ingestion date
df_weather.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("ingestion_date") \
    .save(BRONZE_WEATHER_PATH)

print("✓ Hourly forecast saved to Bronze (partitioned)")
