In [0]:
# ==============================================================
# 🧱 DATABRICKS HACKATHON PROJECT
# ==============================================================
# NOTEBOOK: 00 Environment Setup & Master Data
# PURPOSE:  Initializes the environment, creates foundational tables, and seeds initial data 
# AUTHOR:   Chintan Shah
# ==============================================================

# 🧱 Setup & Master Data

This notebook initializes the environment, creates foundational tables,
and seeds initial data for the project.
It includes:
  1. Catalog & Schema setup
  2. Volume creation
  3. City Master (dimension table)
  4. Run configuration
  5. Pipeline logging

## 🧭 Step 1 – Environment Initialization

In this step:
- Create a **Catalog** and **Schema** to organize project assets.
- Set the active context for all operations.

Schema: `env_data`  
Catalog: `env_catalog`

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.getActiveSession()

catalog_name = "env_catalog"
schema_name = "env_data"

# Create catalog & schema if not already present
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema_name} COMMENT 'Schema for Weather + Pollution Data'")
spark.sql(f"USE CATALOG {catalog_name}")
spark.sql(f"USE SCHEMA {schema_name}")

print(f"✅ Environment initialized: Catalog = {catalog_name}, Schema = {schema_name}")

✅ Environment initialized: Catalog = env_catalog, Schema = env_data


## 📦 Step 2 – Volume Setup

Volumes serve as **managed storage zones** for raw and processed data.  
We’ll create:
- Landing zones for **weather** and **pollution** data (daily/nightly & 15-min feeds)
- A **checkpoint** volume for streaming job state tracking

In [0]:
volumes = {
    "weather_landing": "Landing zone for raw weather API responses",
    "pollution_landing": "Landing zone for raw pollution API responses",
    "weather_live": "Landing zone for near-live 15-min weather data",
    "pollution_live": "Landing zone for near-live 15-min pollution data",
    "checkpoints": "Checkpoint locations for Autoloader and streaming jobs"
}

# Create each volume with comments
for vol, comment in volumes.items():
    spark.sql(f"CREATE VOLUME IF NOT EXISTS {catalog_name}.{schema_name}.{vol} COMMENT '{comment}'")

print("✅ All volumes created successfully.")

✅ All volumes created successfully.


## 🌍 Step 3 – City Master Dimension Table

**Purpose:** Maintain a controlled, versioned list of cities.  

**Design Highlights:**
- Surrogate key (`dim_key`) generated automatically  
- Insert-only: prevents updates/deletes to preserve history  
- `is_latest` flag → latest record if city details (lat/long) ever change  
- `active_flag` → determines which cities participate in regular ingestion  
- `effective_from` / `effective_to` → record validity window


In [0]:
table_name = f"{catalog_name}.{schema_name}.city_master"

# Drop if exists for clean reruns
spark.sql(f"DROP TABLE IF EXISTS {table_name}")

# 1️⃣ Create basic structure (without defaults)
spark.sql(f"""
CREATE TABLE {table_name} (
  dim_key BIGINT GENERATED ALWAYS AS IDENTITY,
  city_name STRING NOT NULL,
  latitude DOUBLE,
  longitude DOUBLE,
  country_code STRING NOT NULL,
  country_name STRING,
  region STRING,
  is_latest BOOLEAN,
  is_active BOOLEAN,
  effective_from TIMESTAMP,
  effective_to TIMESTAMP,
  created_ts TIMESTAMP
)
COMMENT 'Dimension table for City Master (insert-only, version controlled)'
""")

# 2️⃣ Enable default support on the table
spark.sql(f"""
ALTER TABLE {table_name}
SET TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported')
""")

# 3️⃣ Apply default expressions now
spark.sql(f"""
ALTER TABLE {table_name}
ALTER COLUMN is_latest SET DEFAULT TRUE
""")
spark.sql(f"""
ALTER TABLE {table_name}
ALTER COLUMN is_active SET DEFAULT TRUE
""")
spark.sql(f"""
ALTER TABLE {table_name}
ALTER COLUMN effective_from SET DEFAULT current_timestamp()
""")
spark.sql(f"""
ALTER TABLE {table_name}
ALTER COLUMN created_ts SET DEFAULT current_timestamp()
""")

print("✅ City master table created successfully with defaults enabled.")

✅ City master table created successfully with defaults enabled.


### ✨ Load Initial City Master Data

Load top **global** and **Indian** cities as seed data.

These cities will be the base for all weather & pollution data collection.

In [0]:
# We insert only into user-controlled columns.
# Auto-generated / default columns (dim_key, created_ts, etc.) are filled automatically.

city_data = [
    # --- Global Cities ---
    ("New York", 40.7128, -74.0060, "US", "United States", "North America"),
    ("London", 51.5072, -0.1276, "GB", "United Kingdom", "Europe"),
    ("Dubai", 25.276987, 55.296249, "AE", "United Arab Emirates", "Middle East"),
    ("Singapore", 1.3521, 103.8198, "SG", "Singapore", "Asia"),
    ("Sydney", -33.8688, 151.2093, "AU", "Australia", "Oceania"),
    ("Amsterdam", 52.3676, 4.9041, "NL", "Netherlands", "Europe"),
    ("Tokyo", 35.6762, 139.6503, "JP", "Japan", "Asia"),
    ("Moscow", 55.7558, 37.6173, "RU", "Russia", "Europe"),
    ("Hong Kong", 22.3193, 114.1694, "HK", "Hong Kong", "Asia"),
    ("Shanghai", 31.2304, 121.4737, "CN", "China", "Asia"),
    ("Paris", 48.8566, 2.3522, "FR", "France", "Europe"),
    ("San Francisco", 37.7749, -122.4194, "US", "United States", "North America"),
    ("Los Angeles", 34.0522, -118.2437, "US", "United States", "North America"),
    ("Frankfurt", 50.1109, 8.6821, "DE", "Germany", "Europe"),
    ("Zurich", 47.3769, 8.5417, "CH", "Switzerland", "Europe"),

    # --- Indian Cities ---
    ("Mumbai", 19.0760, 72.8777, "IN", "India", "Asia"),
    ("Delhi", 28.6139, 77.2090, "IN", "India", "Asia"),
    ("Bengaluru", 12.9716, 77.5946, "IN", "India", "Asia"),
    ("Hyderabad", 17.3850, 78.4867, "IN", "India", "Asia"),
    ("Chennai", 13.0827, 80.2707, "IN", "India", "Asia"),
    ("Kolkata", 22.5726, 88.3639, "IN", "India", "Asia"),
    ("Pune", 18.5204, 73.8567, "IN", "India", "Asia"),
    ("Ahmedabad", 23.0225, 72.5714, "IN", "India", "Asia"),
    ("Jaipur", 26.9124, 75.7873, "IN", "India", "Asia"),
    ("Surat", 21.1702, 72.8311, "IN", "India", "Asia")
]

columns = ["city_name", "latitude", "longitude", "country_code", "country_name", "region"]
city_df = spark.createDataFrame(city_data, columns)

# Explicitly specify the target columns to avoid column count mismatch
(
    city_df
    .write
    .format("delta")
    .mode("append")
    .option("mergeSchema", "true")
    .saveAsTable(f"{catalog_name}.{schema_name}.city_master", 
                 mode="append")
)

print(f"✅ Inserted {city_df.count()} seed cities into city_master.")

✅ Inserted 25 seed cities into city_master.


## ⚙️ Step 4 – Run Configuration Table

The **run_config** table defines pipeline parameters.
These help control execution logic dynamically without code changes.

In [0]:
table_name = f"{catalog_name}.{schema_name}.run_config"

spark.sql(f"""
CREATE OR REPLACE TABLE {table_name} (
  config_name STRING,
  config_value STRING,
  updated_ts TIMESTAMP
)
COMMENT 'Configuration parameters for data pipelines'
""")

# Enable default support on the table
spark.sql(f"""
ALTER TABLE {table_name}
SET TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported')
""")

# Apply default expressions now
spark.sql(f"""
ALTER TABLE {table_name}
ALTER COLUMN updated_ts SET DEFAULT current_timestamp()
""")

config_data = [
    ("default_timezone", "GMT"),
    ("historical_start_date", "2023-01-01"),
    ("forecast_horizon_days", "7"),
    ("refresh_frequency_minutes", "15")
]

columns = ["config_name", "config_value"]
config_df = spark.createDataFrame(config_data, columns)

# Explicitly specify the target columns to avoid column count mismatch
(
    config_df
    .write
    .format("delta")
    .mode("append")
    .option("mergeSchema", "true")
    .saveAsTable(f"{table_name}", 
                 mode="append")
)

print("✅ run_config table created and populated.")


✅ run_config table created and populated.


## 🧾 Step 5 – Pipeline Log Table

The **pipeline_log** table captures audit information for all pipeline runs.  
Fields:
- `pipeline_name`, `run_id`, `run_type`  
- `start_time`, `end_time`, `status`  
- `records_processed`, `remarks`

In [0]:
table_name = f"{catalog_name}.{schema_name}.pipeline_log"
spark.sql(f"""
CREATE OR REPLACE TABLE {table_name} (
  run_id STRING,
  pipeline_name STRING,
  run_type STRING COMMENT 'HISTORICAL, DAILY, 15MIN', 
  start_time TIMESTAMP COMMENT 'Run start time', 
  end_time TIMESTAMP COMMENT 'Run end time', 
  status STRING COMMENT 'SUCCESS, FAILED, RUNNING',
  records_processed BIGINT,
  earliest_ts TIMESTAMP COMMENT 'Earliest data timestamp', 
  latest_ts TIMESTAMP COMMENT 'Latest data timestamp', 
  triggered_by STRING COMMENT 'Manual, Scheduled, Event',
  remarks STRING,
  created_ts TIMESTAMP
)
COMMENT 'Pipeline execution audit and monitoring log'
""")

# Enable default support on the table
spark.sql(f"""
ALTER TABLE {table_name}
SET TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported')
""")

# Apply default expressions now
spark.sql(f"""
ALTER TABLE {table_name}
ALTER COLUMN created_ts SET DEFAULT current_timestamp()
""")

print("✅ pipeline_log table created.")

✅ pipeline_log table created.


## 🔍 Step 6 – Verification

Quick validation to confirm setup success:
- All tables exist under `env_data`
- City Master contains seed data
- Config table shows expected parameters

In [0]:
print("📋 Tables in schema:")
display(spark.sql(f"SHOW TABLES IN {catalog_name}.{schema_name}"))

print("🌍 City Master sample:")
display(spark.sql(f"SELECT city_name, country_name, active_flag FROM {catalog_name}.{schema_name}.city_master LIMIT 10"))

print("⚙️ Config settings:")
display(spark.sql(f"SELECT * FROM {catalog_name}.{schema_name}.run_config"))

📋 Tables in schema:


database,tableName,isTemporary
env_data,city_master,False
env_data,pipeline_log,False
env_data,run_config,False


🌍 City Master sample:


city_name,country_name,active_flag
New York,United States,True
London,United Kingdom,True
Dubai,United Arab Emirates,True
Singapore,Singapore,True
Sydney,Australia,True
Amsterdam,Netherlands,True
Tokyo,Japan,True
Moscow,Russia,True
Hong Kong,Hong Kong,True
Shanghai,China,True


⚙️ Config settings:


config_name,config_value,updated_ts
default_timezone,GMT,2025-11-11T09:03:39.386Z
historical_start_date,2023-01-01,2025-11-11T09:03:39.386Z
forecast_horizon_days,7,2025-11-11T09:03:39.386Z
refresh_frequency_minutes,15,2025-11-11T09:03:39.386Z


---
## 🏁 Summary

**Setup Completed Successfully**

| Component | Description |
|------------|-------------|
| 📚 Catalog/Schema | `env_catalog.env_data` created |
| 📦 Volumes | Weather, Pollution, Live Feeds, Checkpoints |
| 🌍 City Master | 25 Cities (Insert-only, Versioned) |
| ⚙️ Config | Runtime parameters for dynamic control |
| 🧾 Pipeline Log | Execution tracking & monitoring |

Next → Proceed to **Phase 1: Data Ingestion Pipelines (Historical, Daily, 15-min)**  
using the `city_master` and configuration tables as references.
---
