# Day 1: Environment Setup & Data Exploration
# Smart City IoT Analytics Pipeline

---

## 🎯 LEARNING OBJECTIVES:
- Configure Spark cluster and development environment
- Understand IoT data characteristics and challenges  
- Implement basic data ingestion patterns
- Explore PySpark DataFrame operations

## 📅 SCHEDULE:
**Morning (4 hours):**
1. Environment Setup (2 hours)
2. Data Exploration (2 hours)

**Afternoon (4 hours):**  
3. Basic Data Ingestion (2 hours)
4. Initial Data Transformations (2 hours)

## ✅ DELIVERABLES:
- Working Spark cluster with all services running
- Data ingestion notebook with basic EDA
- Documentation of data quality findings  
- Initial data loading pipeline functions

---

In [1]:
print("🚀 Welcome to the Smart City IoT Analytics Pipeline!")
print("=" * 60)

🚀 Welcome to the Smart City IoT Analytics Pipeline!


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import json
import warnings
warnings.filterwarnings('ignore')

# Import PySpark libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as F
import pyspark.testing

---

# SECTION 1: ENVIRONMENT SETUP (Morning - 2 hours)

---

## TODO 1.1: Initialize Spark Session (15 minutes)

🎯 **TASK:** Create a Spark session configured for local development  
💡 **HINT:** Use SparkSession.builder with appropriate configurations  
📚 **DOCS:** https://spark.apache.org/docs/latest/sql-getting-started.html

**TODO:** Create Spark session with the following configurations:
- App name: "SmartCityIoTPipeline-Day1"
- Master: "local[*]" (use all available cores)
- Memory: "4g" for driver
- Additional configs for better performance

In [3]:
# TODO: Create Spark session with the following configurations:
jdbc_jar_path = "/Users/sai/Documents/Projects/ninth-week/SparkCity/postgresql-42.7.3.jar"
spark = (SparkSession.builder
         .appName("SmartCityIoTPipeline-Day1")  # TODO: Add your app name
         .master("local[*]")   # TODO: Add master configuration
         .config("spark.driver.memory", "4g")  # TODO: Set memory
         .config("spark.sql.adaptive.enabled", "true")
         .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
         .config("spark.jars", jdbc_jar_path)  # <-- Add this line for JDBC driver
         .getOrCreate())

# TODO: Verify Spark session is working
print("✅ Spark Session Details:")
print(f"   App Name: {spark.sparkContext.appName}")
print(f"   Spark Version: {spark.version}")
print(f"   Master: {spark.sparkContext.master}")
print(f"   Default Parallelism: {spark.sparkContext.defaultParallelism}")

25/09/04 19:13:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


✅ Spark Session Details:
   App Name: SmartCityIoTPipeline-Day1
   Spark Version: 4.0.0
   Master: local[*]
   Default Parallelism: 8


## TODO 1.2: Verify Infrastructure (15 minutes)

🎯 **TASK:** Check that all infrastructure services are running  
💡 **HINT:** Test database connectivity and file system access

In [43]:
# TODO: Test PostgreSQL connection
def test_database_connection():
    """Test connection to PostgreSQL database"""
    try:
        # Database connection parameters
        db_properties = {
            "user": "postgres",
            "password": "password", 
            "driver": "org.postgresql.Driver"
        }
        
        # TODO: Replace with actual connection test
        # Test query - should create a simple DataFrame from database
        test_df = spark.read.jdbc(
            url="jdbc:postgresql://localhost:5432/smartcity",
            table="(SELECT 1 as test_column) as test_table",
            properties=db_properties
        )
        
        # TODO: Collect and display result
        result = test_df.collect()
        print("✅ Database connection successful!")
        return True
        
    except Exception as e:
        print(f"❌ Database connection failed: {str(e)}")
        print("💡 Make sure PostgreSQL container is running: docker-compose up -d")
        return False

# TODO: Run the database connection test
db_connected = test_database_connection()

# TODO: Check Spark UI accessibility
print("\n🌐 Spark UI should be accessible at: http://localhost:4040")
print("   (Open this in your browser to monitor Spark jobs)")

✅ Database connection successful!

🌐 Spark UI should be accessible at: http://localhost:4040
   (Open this in your browser to monitor Spark jobs)


## TODO 1.3: Generate Sample Data (30 minutes)

🎯 **TASK:** Run the data generation script to create sample IoT data  
💡 **HINT:** Use the provided data generation script or run it manually

In [4]:
# Generate Sample IoT Data
import os
import subprocess

def check_and_generate_data():
    """Check if data exists, generate if missing"""
    data_files = ["traffic_sensors.csv", "air_quality.json", "weather_data.parquet", 
                  "energy_meters.csv", "city_zones.csv"]
    data_path = "data/raw"
    
    # Check existing files
    existing = [f for f in data_files if os.path.exists(f"{data_path}/{f}")]
    
    if len(existing) == len(data_files):
        print(f"✅ All {len(data_files)} data files found!")
        return True
    
    # Generate missing data
    print(f"🔄 Generating data... ({len(existing)}/{len(data_files)} files exist)")
    
    try:
        # Get the project root (go up one level from notebooks folder)
        notebook_dir = os.getcwd()  # Current directory (notebooks/)
        project_root = os.path.dirname(notebook_dir)  # Go up one level to SparkCity/
        
        print(f"   Project root: {project_root}")
        print(f"   Running script from: {project_root}")
        
        # Run from project root directory
        result = subprocess.run(
            ["python", "scripts/generate_data.py"], 
            cwd=project_root,  # Run from SparkCity/ directory
            capture_output=True, 
            text=True, 
            timeout=300
        )
        
        if result.returncode == 0:
            print("✅ Data generation successful!")
            return True
        else:
            print("❌ Generation failed!")
            if result.stderr:
                print(f"   Error: {result.stderr.strip()}")
            if result.stdout:
                print(f"   Output: {result.stdout.strip()}")
            return False
            
    except Exception as e:
        print(f"❌ Error: {e}")
        return False

# Run data check/generation
data_ready = check_and_generate_data()

# If failed, provide clear manual instructions
if not data_ready:
    print("\n🔧 MANUAL FIX:")
    print("1. Open terminal")
    print("2. Run: cd /Users/sai/Documents/Projects/ninth-week/SparkCity")
    print("3. Run: python scripts/generate_data.py")
    print("4. Re-run this cell")

🔄 Generating data... (0/5 files exist)
   Project root: /Users/sai/Documents/Projects/ninth-week/SparkCity
   Running script from: /Users/sai/Documents/Projects/ninth-week/SparkCity
✅ Data generation successful!


---

# SECTION 2: DATA EXPLORATION (Morning - 2 hours)

---

In [5]:
print("\n" + "=" * 60)
print("📊 SECTION 2: EXPLORATORY DATA ANALYSIS")
print("=" * 60)


📊 SECTION 2: EXPLORATORY DATA ANALYSIS


## TODO 2.1: Load and Examine Data Sources (45 minutes)

🎯 **TASK:** Load each data source and examine its structure  
💡 **HINT:** Use appropriate Spark readers for different file formats  
📚 **CONCEPTS:** Schema inference, file formats, data types

In [6]:
# Define data directory
data_dir = "../data/raw"

In [7]:
# TODO: Load city zones reference data
print("📍 Loading City Zones Reference Data...")
try:
    zones_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{data_dir}/city_zones.csv")
    
    # TODO: Display basic information about zones
    print(f"   📊 Records: {zones_df.count()}")
    print(f"   📋 Schema:")
    zones_df.printSchema()
    
    # TODO: Show sample data
    print(f"   🔍 Sample Data:")
    zones_df.show(5, truncate=False)
    
except Exception as e:
    print(f"❌ Error loading zones data: {str(e)}")

📍 Loading City Zones Reference Data...
   📊 Records: 8
   📋 Schema:
root
 |-- zone_id: string (nullable = true)
 |-- zone_name: string (nullable = true)
 |-- zone_type: string (nullable = true)
 |-- lat_min: double (nullable = true)
 |-- lat_max: double (nullable = true)
 |-- lon_min: double (nullable = true)
 |-- lon_max: double (nullable = true)
 |-- population: integer (nullable = true)

   🔍 Sample Data:
+--------+------------------+-----------+-------+-------+-------+-------+----------+
|zone_id |zone_name         |zone_type  |lat_min|lat_max|lon_min|lon_max|population|
+--------+------------------+-----------+-------+-------+-------+-------+----------+
|ZONE_001|Downtown          |commercial |40.72  |40.74  |-74.01 |-73.99 |25000     |
|ZONE_002|Financial District|commercial |40.7   |40.72  |-74.02 |-74.0  |15000     |
|ZONE_003|Residential North |residential|40.76  |40.8   |-74.0  |-73.98 |45000     |
|ZONE_004|Residential South |residential|40.7   |40.72  |-73.98 |-73.96 |38000

In [8]:
# TODO: Load traffic sensors data  
print("\n🚗 Loading Traffic Sensors Data...")
try:
    # TODO: Load CSV file with proper options
    traffic_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{data_dir}/traffic_sensors.csv")
    
    # TODO: Display basic information
    print(f"   📊 Records: {traffic_df.count()}")
    print(f"   📋 Schema:")
    traffic_df.printSchema()
    
    # TODO: Show sample data
    print(f"   🔍 Sample Data:")
    traffic_df.show(5)
    
except Exception as e:
    print(f"❌ Error loading traffic data: {str(e)}")


🚗 Loading Traffic Sensors Data...
   📊 Records: 100850
   📋 Schema:
root
 |-- sensor_id: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- location_lat: double (nullable = true)
 |-- location_lon: double (nullable = true)
 |-- vehicle_count: integer (nullable = true)
 |-- avg_speed: double (nullable = true)
 |-- congestion_level: string (nullable = true)
 |-- road_type: string (nullable = true)

   🔍 Sample Data:
+-----------+--------------------+------------------+------------------+-------------+------------------+----------------+-----------+
|  sensor_id|           timestamp|      location_lat|      location_lon|vehicle_count|         avg_speed|congestion_level|  road_type|
+-----------+--------------------+------------------+------------------+-------------+------------------+----------------+-----------+
|TRAFFIC_001|2025-08-28 19:13:...| 40.78941849302994| -74.0041930409108|           32|16.417025239365007|            high|residential|
|TRAFFIC_002|2025-

In [9]:
# TODO: Load air quality data (JSON format) - CORRECTED VERSION
print("\n🌫️ Loading Air Quality Data...")
try:
    # First, try loading without corrupt record handling since the JSON seems valid
    print("   🔍 Loading JSON file...")
    
    air_quality_df = (spark.read
                     .option("multiline", "true")
                     .json(f"{data_dir}/air_quality.json"))
    
    # Check if we loaded successfully
    total_records = air_quality_df.count()
    print(f"   📊 Total records loaded: {total_records}")
    
    # Check the schema
    print(f"   📋 Schema:")
    air_quality_df.printSchema()
    
    # Show sample data
    print(f"   🔍 Sample Data:")
    air_quality_df.show(5)
    
    # Check for any missing sensor_id values (basic quality check)
    missing_sensor_ids = air_quality_df.filter(F.col("sensor_id").isNull()).count()
    if missing_sensor_ids > 0:
        print(f"   ⚠️  Found {missing_sensor_ids} records with missing sensor_id")
    else:
        print("   ✅ All records have sensor_id values")
    
    print("   ✅ Air quality data loaded successfully!")
        
except Exception as e:
    print(f"❌ Error loading air quality data: {str(e)}")
    
    # If the above fails, try alternative approach
    print("   🔄 Trying alternative loading method...")
    try:
        # Try without multiline option
        air_quality_df = spark.read.json(f"{data_dir}/air_quality.json")
        
        total_records = air_quality_df.count()
        print(f"   📊 Alternative method - Records loaded: {total_records}")
        
        if total_records > 0:
            print("   ✅ Alternative method successful!")
            air_quality_df.printSchema()
            air_quality_df.show(5)
        else:
            print("   ❌ No records loaded with alternative method")
            air_quality_df = None
            
    except Exception as e2:
        print(f"   ❌ Alternative method also failed: {str(e2)}")
        air_quality_df = None


🌫️ Loading Air Quality Data...
   🔍 Loading JSON file...
   📊 Total records loaded: 13460
   📋 Schema:
root
 |-- co: double (nullable = true)
 |-- humidity: double (nullable = true)
 |-- location_lat: double (nullable = true)
 |-- location_lon: double (nullable = true)
 |-- no2: double (nullable = true)
 |-- pm10: double (nullable = true)
 |-- pm25: double (nullable = true)
 |-- sensor_id: string (nullable = true)
 |-- temperature: double (nullable = true)
 |-- timestamp: string (nullable = true)

   🔍 Sample Data:
+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+---------+------------------+--------------------+
|                co|         humidity|      location_lat|      location_lon|               no2|              pm10|              pm25|sensor_id|       temperature|           timestamp|
+------------------+-----------------+------------------+------------------+------------------+---------------

In [10]:
# TODO: Load weather data (Parquet format)
print("\n🌤️ Loading Weather Data...")
try:
    # TODO: Load Parquet file - another different format!
    weather_df = spark.read.parquet(f"{data_dir}/weather_data.parquet")
    
    # TODO: Display basic information
    print(f"   📊 Records: {weather_df.count()}")
    print(f"   📋 Schema:")
    weather_df.printSchema()
    
    # TODO: Show sample data
    print(f"   🔍 Sample Data:")
    weather_df.show(5)
    
except Exception as e:
    print(f"❌ Error loading weather data: {str(e)}")


🌤️ Loading Weather Data...
   📊 Records: 3370
   📋 Schema:
root
 |-- station_id: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- location_lat: double (nullable = true)
 |-- location_lon: double (nullable = true)
 |-- temperature: double (nullable = true)
 |-- humidity: double (nullable = true)
 |-- wind_speed: double (nullable = true)
 |-- wind_direction: double (nullable = true)
 |-- precipitation: double (nullable = true)
 |-- pressure: double (nullable = true)

   🔍 Sample Data:
+-----------+--------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+
| station_id|           timestamp|      location_lat|      location_lon|       temperature|          humidity|        wind_speed|    wind_direction|     precipitation|          pressure|
+-----------+--------------------+------------------+------------------+------------------+------------------+-----

In [11]:
# TODO: Load energy meters data
print("\n⚡ Loading Energy Meters Data...")
try:
    # TODO: Load CSV file
    energy_df = spark.read.option("header", "true").option("inferSchema", "true").csv(f"{data_dir}/energy_meters.csv")
    
    # TODO: Display basic information
    print(f"   📊 Records: {energy_df.count()}")
    print(f"   📋 Schema:")
    energy_df.printSchema()
    
    # TODO: Show sample data
    print(f"   🔍 Sample Data:")
    energy_df.show(5)
    
except Exception as e:
    print(f"❌ Error loading energy data: {str(e)}")


⚡ Loading Energy Meters Data...
   📊 Records: 201800
   📋 Schema:
root
 |-- meter_id: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- building_type: string (nullable = true)
 |-- location_lat: double (nullable = true)
 |-- location_lon: double (nullable = true)
 |-- power_consumption: double (nullable = true)
 |-- voltage: double (nullable = true)
 |-- current: double (nullable = true)
 |-- power_factor: double (nullable = true)

   🔍 Sample Data:
+-----------+--------------------+-------------+------------------+------------------+------------------+------------------+------------------+------------------+
|   meter_id|           timestamp|building_type|      location_lat|      location_lon| power_consumption|           voltage|           current|      power_factor|
+-----------+--------------------+-------------+------------------+------------------+------------------+------------------+------------------+------------------+
|ENERGY_0001|2025-08-28 19:13:..

## TODO 2.2: Basic Data Quality Assessment (45 minutes)

🎯 **TASK:** Assess data quality across all datasets  
💡 **HINT:** Check for missing values, duplicates, data ranges  
📚 **CONCEPTS:** Data profiling, quality metrics, anomaly detection

In [12]:
def assess_data_quality(df, dataset_name):
    """
    Perform basic data quality assessment on a DataFrame
    
    Args:
        df: Spark DataFrame to assess
        dataset_name: Name of the dataset for reporting
    """
    print(f"\n📋 Data Quality Assessment: {dataset_name}")
    print("-" * 50)
    
    # TODO: Basic statistics
    total_rows = df.count()
    total_cols = len(df.columns)
    print(f"   📊 Dimensions: {total_rows:,} rows × {total_cols} columns")
    
    # TODO: Check for missing values
    print(f"   🔍 Missing Values:")
    for col in df.columns:
        missing_count = df.filter(F.col(col).isNull()).count()
        missing_pct = (missing_count / total_rows) * 100
        if missing_count > 0:
            print(f"      {col}: {missing_count:,} ({missing_pct:.2f}%)")
    
    # TODO: Check for duplicate records
    duplicate_count = total_rows - df.dropDuplicates().count()
    if duplicate_count > 0:
        print(f"   🔄 Duplicate Records: {duplicate_count:,}")
    else:
        print(f"   ✅ No duplicate records found")
    
    # TODO: Numeric column statistics
    numeric_cols = [field.name for field in df.schema.fields 
                   if field.dataType in [IntegerType(), DoubleType(), FloatType(), LongType()]]
    
    if numeric_cols:
        print(f"   📈 Numeric Columns Summary:")
        # Show basic statistics for numeric columns
        df.select(numeric_cols).describe().show()

In [13]:
# TODO: Assess quality for each dataset
datasets = [
    (zones_df, "City Zones"),
    (traffic_df, "Traffic Sensors"), 
    (air_quality_df, "Air Quality"),
    (weather_df, "Weather Stations"),
    (energy_df, "Energy Meters")
]

for df, name in datasets:
    try:
        assess_data_quality(df, name)
    except Exception as e:
        print(f"❌ Error assessing {name}: {str(e)}")


📋 Data Quality Assessment: City Zones
--------------------------------------------------
   📊 Dimensions: 8 rows × 8 columns
   🔍 Missing Values:
   ✅ No duplicate records found
   📈 Numeric Columns Summary:


25/09/04 19:14:34 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+--------------------+--------------------+-------------------+-------------------+------------------+
|summary|             lat_min|             lat_max|            lon_min|            lon_max|        population|
+-------+--------------------+--------------------+-------------------+-------------------+------------------+
|  count|                   8|                   8|                  8|                  8|                 8|
|   mean|  40.730000000000004|             40.7525| -73.99125000000001|          -73.97125|           21250.0|
| stddev|0.023904572186687328|0.028157719063465373|0.02474873734153055|0.02474873734153458|14260.334598358582|
|    min|                40.7|               40.72|             -74.02|              -74.0|              5000|
|    max|               40.76|                40.8|             -73.96|             -73.94|             45000|
+-------+--------------------+--------------------+-------------------+-------------------+------------------+



## TODO 2.3: Temporal Analysis (30 minutes)

🎯 **TASK:** Analyze temporal patterns in the IoT data  
💡 **HINT:** Look at data distribution over time, identify patterns  
📚 **CONCEPTS:** Time series analysis, temporal patterns, data distribution

In [14]:
print("\n" + "=" * 60) 
print("⏰ TEMPORAL PATTERN ANALYSIS")
print("=" * 60)


⏰ TEMPORAL PATTERN ANALYSIS


In [15]:
# TODO: Analyze traffic patterns by hour
print("\n🚗 Traffic Patterns by Hour:")
try:
    # TODO: Extract hour from timestamp and analyze vehicle counts
    traffic_hourly = (traffic_df
                     .withColumn("hour", F.hour("timestamp"))
                     .groupBy("hour")
                     .agg(F.avg("vehicle_count").alias("avg_vehicles"),
                          F.count("*").alias("readings"))
                     .orderBy("hour"))
    
    # TODO: Show the results
    traffic_hourly.show(24)
    
    # TODO: What patterns do you notice? Add your observations here:
    print("📝 OBSERVATIONS:")
    print("   - Rush hour patterns: [YOUR ANALYSIS HERE]")
    print("   - Off-peak periods: [YOUR ANALYSIS HERE]")
    print("   - Peak traffic hours: [YOUR ANALYSIS HERE]")
    
except Exception as e:
    print(f"❌ Error analyzing traffic patterns: {str(e)}")


🚗 Traffic Patterns by Hour:
+----+------------------+--------+
|hour|      avg_vehicles|readings|
+----+------------------+--------+
|   0|18.821190476190477|    4200|
|   1|18.846190476190475|    4200|
|   2|18.988095238095237|    4200|
|   3|18.810238095238095|    4200|
|   4|18.785476190476192|    4200|
|   5|18.865238095238094|    4200|
|   6|19.023809523809526|    4200|
|   7| 32.33428571428571|    4200|
|   8|32.309761904761906|    4200|
|   9| 32.45738095238095|    4200|
|  10|18.930238095238096|    4200|
|  11| 18.48857142857143|    4200|
|  12|18.714523809523808|    4200|
|  13|18.919285714285714|    4200|
|  14|18.798333333333332|    4200|
|  15|18.811666666666667|    4200|
|  16|            18.595|    4200|
|  17| 32.43785714285714|    4200|
|  18| 32.24738095238095|    4200|
|  19|32.567764705882354|    4250|
|  20|             18.66|    4200|
|  21|18.568571428571428|    4200|
|  22| 18.61452380952381|    4200|
|  23|19.010238095238094|    4200|
+----+------------------+-

In [16]:
# TODO: Analyze air quality patterns by day of week
print("\n🌫️ Air Quality Patterns by Day of Week:")
try:
    # TODO: Extract day of week and analyze PM2.5 levels
    air_quality_daily = (air_quality_df
                        .withColumn("day_of_week", F.dayofweek("timestamp"))
                        .groupBy("day_of_week")
                        .agg(F.avg("pm25").alias("avg_pm25"),
                             F.avg("no2").alias("avg_no2"))
                        .orderBy("day_of_week"))
    
    # TODO: Show results
    air_quality_daily.show()
    
    # TODO: Add your observations
    print("📝 OBSERVATIONS:")
    print("   - Weekday vs weekend patterns: [YOUR ANALYSIS HERE]")
    print("   - Pollution trends: [YOUR ANALYSIS HERE]")
    
except Exception as e:
    print(f"❌ Error analyzing air quality patterns: {str(e)}")


🌫️ Air Quality Patterns by Day of Week:
+-----------+------------------+------------------+
|day_of_week|          avg_pm25|           avg_no2|
+-----------+------------------+------------------+
|          1|26.913545169521978| 32.50367568259643|
|          2|26.748436329345452| 32.07425375361587|
|          3|26.622715112231916| 32.06719516571267|
|          4|26.569904062775556|32.443047171509015|
|          5|26.922534687700107| 32.17436968878897|
|          6|27.208702754080225|32.356298105760736|
|          7|26.967586432048975|32.481645560951094|
+-----------+------------------+------------------+

📝 OBSERVATIONS:
   - Weekday vs weekend patterns: [YOUR ANALYSIS HERE]
   - Pollution trends: [YOUR ANALYSIS HERE]


---

# SECTION 3: BASIC DATA INGESTION (Afternoon - 2 hours)

---

In [17]:
print("\n" + "=" * 60)
print("📥 SECTION 3: DATA INGESTION PIPELINE")
print("=" * 60)


📥 SECTION 3: DATA INGESTION PIPELINE


## TODO 3.1: Create Reusable Data Loading Functions (60 minutes)

🎯 **TASK:** Create reusable functions for loading different data formats  
💡 **HINT:** Handle schema validation and error handling  
📚 **CONCEPTS:** Function design, error handling, schema enforcement

In [19]:
def load_csv_data(file_path, expected_schema=None):
    """
    Load CSV data with proper error handling and schema validation
    
    Args:
        file_path: Path to CSV file
        expected_schema: Optional StructType for schema enforcement
        
    Returns:
        Spark DataFrame or None if error
    """
    try:
        # TODO: Implement CSV loading with options
        df = spark.read.option("header", "true").csv(file_path)

        # TODO: Add schema validation if provided
        if expected_schema:
            # Validate schema matches expected
            pass
            
        print(f"✅ Successfully loaded CSV: {file_path}")
        return df
        
    except Exception as e:
        print(f"❌ Error loading CSV {file_path}: {str(e)}")
        return None

def load_json_data(file_path):
    """
    Load JSON data with error handling
    
    Args:
        file_path: Path to JSON file
        
    Returns:
        Spark DataFrame or None if error
    """
    try:
        # TODO: Implement JSON loading
        df = spark.read.json(file_path)
        
        print(f"✅ Successfully loaded JSON: {file_path}")
        return df
        
    except Exception as e:
        print(f"❌ Error loading JSON {file_path}: {str(e)}")
        return None

def load_parquet_data(file_path):
    """
    Load Parquet data with error handling
    
    Args:
        file_path: Path to Parquet file
        
    Returns:
        Spark DataFrame or None if error
    """
    try:
        # TODO: Implement Parquet loading
        df = spark.read.parquet(file_path)
        
        print(f"✅ Successfully loaded Parquet: {file_path}")
        return df
        
    except Exception as e:
        print(f"❌ Error loading Parquet {file_path}: {str(e)}")
        return None

In [20]:
# TODO: Test your loading functions
print("🧪 Testing Data Loading Functions:")

test_files = [
    (f"{data_dir}/city_zones.csv", "CSV", load_csv_data),
    (f"{data_dir}/air_quality.json", "JSON", load_json_data), 
    (f"{data_dir}/weather_data.parquet", "Parquet", load_parquet_data)
]

for file_path, file_type, load_func in test_files:
    print(f"\n   Testing {file_type} loader...")
    test_df = load_func(file_path)
    if test_df:
        print(f"      Records loaded: {test_df.count():,}")

🧪 Testing Data Loading Functions:

   Testing CSV loader...
✅ Successfully loaded CSV: ../data/raw/city_zones.csv
      Records loaded: 8

   Testing JSON loader...
✅ Successfully loaded JSON: ../data/raw/air_quality.json
      Records loaded: 161,522

   Testing Parquet loader...
✅ Successfully loaded Parquet: ../data/raw/weather_data.parquet
      Records loaded: 3,370


## TODO 3.2: Schema Definition and Enforcement (60 minutes)

🎯 **TASK:** Define explicit schemas for data consistency  
💡 **HINT:** Use StructType and StructField for schema definition  
📚 **CONCEPTS:** Schema design, data types, schema enforcement

In [21]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType

# TODO: Define schema for traffic sensors
traffic_schema = StructType([
    StructField("sensor_id", StringType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("location_lat", DoubleType(), False),
    StructField("location_lon", DoubleType(), False),
    # TODO: Add remaining fields
    StructField("vehicle_count", IntegerType(), False),
    StructField("avg_speed", DoubleType(), False),
    StructField("congestion_level", StringType(), False),
    StructField("road_type", StringType(), False),
])

# TODO: Define schema for air quality data
air_quality_schema = StructType([
    # TODO: Define all fields for air quality data
    # Hint: Look at the JSON structure and define appropriate types
    StructField("sensor_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("location_lat", DoubleType(), True),
    StructField("location_lon", DoubleType(), True),
    StructField("pm25", DoubleType(), True),
    StructField("pm10", DoubleType(), True),
    StructField("no2", DoubleType(), True),
    StructField("co", DoubleType(), True),
    StructField("temperature", DoubleType(), True),
    StructField("humidity", DoubleType(), True)
])

# TODO: Define schema for weather data
weather_schema = StructType([
    # TODO: Define all fields for weather data
    StructField("station_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("location_lat", DoubleType(), True),
    StructField("location_lon", DoubleType(), True),
    StructField("temperature", DoubleType(), True),
    StructField("humidity", DoubleType(), True),    
    StructField("wind_speed", DoubleType(), True),
    StructField("wind_direction", StringType(), True),
    StructField("precipitation", DoubleType(), True),
    StructField("pressure", DoubleType(), True)
])

# TODO: Define schema for energy data
energy_schema = StructType([
    # TODO: Define all fields for energy data
    StructField("meter_id", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("location_lat", DoubleType(), True),
    StructField("location_lon", DoubleType(), True),
    StructField("energy_consumption_kwh", DoubleType(), True),
    StructField("peak_demand_kw", DoubleType(), True),
    StructField("voltage", DoubleType(), True),
    StructField("current", DoubleType(), True),
    StructField("power_factor", DoubleType(), True),
    StructField("frequency", DoubleType(), True)
])

In [35]:
# TODO: Test schema enforcement
print("\n🔍 Testing Schema Enforcement:")

def load_with_schema(file_path, schema, file_format="csv"):
    """Load data with explicit schema enforcement"""
    try:
        if file_format == "csv":
            df = spark.read.schema(schema).option("header", "true").csv(file_path)
        elif file_format == "json":
            df = spark.read.schema(schema).json(file_path)
        elif file_format == "parquet":
            df = spark.read.schema(schema).parquet(file_path)
        
        print(f"✅ Schema enforcement successful for {file_path}")
        return df
        
    except Exception as e:
        print(f"❌ Schema enforcement failed for {file_path}: {str(e)}")
        return None

# TODO: Test with one of your schemas
test_schema_df = load_with_schema(f"{data_dir}/traffic_sensors.csv", air_quality_schema, "csv")
if test_schema_df:
    print("   Schema enforcement test passed!")
    test_schema_df.printSchema()


🔍 Testing Schema Enforcement:
✅ Schema enforcement successful for ../data/raw/traffic_sensors.csv
   Schema enforcement test passed!
root
 |-- sensor_id: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- location_lat: double (nullable = true)
 |-- location_lon: double (nullable = true)
 |-- pm25: double (nullable = true)
 |-- pm10: double (nullable = true)
 |-- no2: double (nullable = true)
 |-- co: double (nullable = true)
 |-- temperature: double (nullable = true)
 |-- humidity: double (nullable = true)



---

# SECTION 4: INITIAL DATA TRANSFORMATIONS (Afternoon - 2 hours)

---

In [22]:
print("\n" + "=" * 60)
print("🔄 SECTION 4: DATA TRANSFORMATIONS")
print("=" * 60)


🔄 SECTION 4: DATA TRANSFORMATIONS


## TODO 4.1: Timestamp Standardization (45 minutes)

🎯 **TASK:** Standardize timestamp formats across all datasets  
💡 **HINT:** Some datasets may have different timestamp formats  
📚 **CONCEPTS:** Date/time handling, format standardization, timezone handling

In [23]:
def standardize_timestamps(df, timestamp_col="timestamp"):
    """
    Standardize timestamp column across datasets
    
    Args:
        df: Input DataFrame
        timestamp_col: Name of timestamp column
        
    Returns:
        DataFrame with standardized timestamps
    """
    try:
        # TODO: Convert timestamps to standard format
        standardized_df = (df
                          .withColumn("timestamp_std", F.to_timestamp(F.col(timestamp_col)))
                          .drop(timestamp_col)
                          .withColumnRenamed("timestamp_std", timestamp_col))
        
        # TODO: Add derived time columns
        result_df = (standardized_df
                    .withColumn("year", F.year(timestamp_col))
                    .withColumn("month", F.month(timestamp_col))
                    .withColumn("day", F.dayofmonth(timestamp_col))
                    .withColumn("hour", F.hour(timestamp_col))
                    .withColumn("day_of_week", F.dayofweek(timestamp_col))
                    .withColumn("is_weekend", F.when(F.dayofweek(timestamp_col).isin([1, 7]), True).otherwise(False)))
        
        return result_df
        
    except Exception as e:
        print(f"❌ Error standardizing timestamps: {str(e)}")
        return df

In [26]:
# TODO: Test timestamp standardization
print("⏰ Testing Timestamp Standardization:")

# Test with traffic data
traffic_std = standardize_timestamps(traffic_df)
print("   Traffic data timestamp standardization:")
traffic_std.select("timestamp", "year", "month", "day", "hour", "day_of_week", "is_weekend").show(5)

⏰ Testing Timestamp Standardization:
   Traffic data timestamp standardization:
+--------------------+----+-----+---+----+-----------+----------+
|           timestamp|year|month|day|hour|day_of_week|is_weekend|
+--------------------+----+-----+---+----+-----------+----------+
|2025-08-28 19:13:...|2025|    8| 28|  19|          5|     false|
|2025-08-28 19:13:...|2025|    8| 28|  19|          5|     false|
|2025-08-28 19:13:...|2025|    8| 28|  19|          5|     false|
|2025-08-28 19:13:...|2025|    8| 28|  19|          5|     false|
|2025-08-28 19:13:...|2025|    8| 28|  19|          5|     false|
+--------------------+----+-----+---+----+-----------+----------+
only showing top 5 rows


## TODO 4.2: Geographic Zone Mapping (45 minutes)

🎯 **TASK:** Map sensor locations to city zones  
💡 **HINT:** Join sensor coordinates with zone boundaries  
📚 **CONCEPTS:** Spatial joins, geographic data, coordinate systems

In [28]:
def map_to_zones(sensor_df, zones_df):
    """
    Map sensor locations to city zones
    
    Args:
        sensor_df: DataFrame with sensor locations (lat, lon)
        zones_df: DataFrame with zone boundaries
        
    Returns:
        DataFrame with zone information added
    """
    try:
        # TODO: Create join condition for geographic mapping
        # A sensor is in a zone if its coordinates fall within zone boundaries
        join_condition = (
            (sensor_df.location_lat >= zones_df.lat_min) &
            (sensor_df.location_lat <= zones_df.lat_max) &
            (sensor_df.location_lon >= zones_df.lon_min) &
            (sensor_df.location_lon <= zones_df.lon_max)
        )
        
        # TODO: Perform the join
        result_df = (sensor_df
                    .join(zones_df, join_condition, "left")
                    .select(sensor_df["*"], 
                           zones_df.zone_id, 
                           zones_df.zone_name, 
                           zones_df.zone_type))
        
        return result_df
        
    except Exception as e:
        print(f"❌ Error mapping to zones: {str(e)}")
        return sensor_df

In [29]:
# TODO: Test zone mapping
print("\n🗺️ Testing Geographic Zone Mapping:")

# Test with traffic sensors
traffic_with_zones = map_to_zones(traffic_std, zones_df)
print("   Traffic sensors with zone mapping:")
traffic_with_zones.select("sensor_id", "location_lat", "location_lon", "zone_id", "zone_type").show(10)

# TODO: Verify mapping worked correctly
zone_distribution = traffic_with_zones.groupBy("zone_type").count().orderBy(F.desc("count"))
print("   Sensors by zone type:")
zone_distribution.show()


🗺️ Testing Geographic Zone Mapping:
   Traffic sensors with zone mapping:
+-----------+------------------+------------------+--------+----------+
|  sensor_id|      location_lat|      location_lon| zone_id| zone_type|
+-----------+------------------+------------------+--------+----------+
|TRAFFIC_001| 40.78941849302994| -74.0041930409108|    NULL|      NULL|
|TRAFFIC_002| 40.77304212963235|-73.91903939743946|    NULL|      NULL|
|TRAFFIC_003|40.797412843124846|-73.91180691102876|    NULL|      NULL|
|TRAFFIC_004| 40.76447004554071|-73.91102045449436|    NULL|      NULL|
|TRAFFIC_005|  40.7580006756147|-73.94232435909976|ZONE_008|    retail|
|TRAFFIC_006|40.705001678241466|-73.98101112420137|    NULL|      NULL|
|TRAFFIC_007| 40.75002147634578|-73.90253544928802|    NULL|      NULL|
|TRAFFIC_008| 40.75922357184977|-74.01290766686316|ZONE_005|industrial|
|TRAFFIC_009| 40.72301729656365|-73.92172955478621|    NULL|      NULL|
|TRAFFIC_010|40.781269783589245|-73.92301264097411|    NULL| 

## TODO 4.3: Data Type Conversions and Validations (30 minutes)

🎯 **TASK:** Ensure proper data types and add validation columns  
💡 **HINT:** Cast columns to appropriate types, add data quality flags  
📚 **CONCEPTS:** Data type conversion, validation rules, data quality flags

In [30]:
def add_data_quality_flags(df, sensor_type):
    """
    Add data quality validation flags to DataFrame
    
    Args:
        df: Input DataFrame
        sensor_type: Type of sensor for specific validations
        
    Returns:
        DataFrame with quality flags added
    """
    try:
        result_df = df
        
        # TODO: Add general quality flags
        result_df = result_df.withColumn("has_missing_values", 
                                        F.when(F.col("sensor_id").isNull(), True).otherwise(False))
        
        # TODO: Add sensor-specific validations
        if sensor_type == "traffic":
            # Traffic-specific validations
            result_df = (result_df
                        .withColumn("valid_speed", 
                                   F.when((F.col("avg_speed") >= 0) & (F.col("avg_speed") <= 100), True).otherwise(False))
                        .withColumn("valid_vehicle_count",
                                   F.when(F.col("vehicle_count") >= 0, True).otherwise(False)))
        
        elif sensor_type == "air_quality":
            # Air quality specific validations
            result_df = (result_df
                        .withColumn("valid_pm25",
                                   F.when((F.col("pm25") >= 0) & (F.col("pm25") <= 500), True).otherwise(False))
                        .withColumn("valid_temperature",
                                   F.when((F.col("temperature") >= -50) & (F.col("temperature") <= 50), True).otherwise(False)))
        
        # TODO: Add more sensor-specific validations
        
        return result_df
        
    except Exception as e:
        print(f"❌ Error adding quality flags: {str(e)}")
        return df

In [31]:
# TODO: Test data quality flags
print("\n🏷️ Testing Data Quality Flags:")

# Test with traffic data
traffic_with_flags = add_data_quality_flags(traffic_with_zones, "traffic")
print("   Traffic data with quality flags:")
traffic_with_flags.select("sensor_id", "avg_speed", "vehicle_count", "valid_speed", "valid_vehicle_count").show(10)

# TODO: Check quality flag distribution
quality_stats = (traffic_with_flags
                .agg(F.sum(F.when(F.col("valid_speed"), 1).otherwise(0)).alias("valid_speed_count"),
                     F.sum(F.when(F.col("valid_vehicle_count"), 1).otherwise(0)).alias("valid_vehicle_count_count"),
                     F.count("*").alias("total_records")))

print("   Quality statistics:")
quality_stats.show()


🏷️ Testing Data Quality Flags:
   Traffic data with quality flags:
+-----------+------------------+-------------+-----------+-------------------+
|  sensor_id|         avg_speed|vehicle_count|valid_speed|valid_vehicle_count|
+-----------+------------------+-------------+-----------+-------------------+
|TRAFFIC_001|16.417025239365007|           32|       true|               true|
|TRAFFIC_002|27.763808497333677|           25|       true|               true|
|TRAFFIC_003|13.221915930299499|           31|       true|               true|
|TRAFFIC_004|30.487846213776308|           28|       true|               true|
|TRAFFIC_005| 25.34875715626183|           32|       true|               true|
|TRAFFIC_006| 39.08658395648244|           31|       true|               true|
|TRAFFIC_007|16.090411274953127|           69|       true|               true|
|TRAFFIC_008|  23.6876171366908|           23|       true|               true|
|TRAFFIC_009|              10.0|           23|       true|     

---

# DAY 1 DELIVERABLES & CHECKPOINTS

---

In [32]:
print("\n" + "=" * 60)
print("📋 DAY 1 COMPLETION CHECKLIST")
print("=" * 60)


📋 DAY 1 COMPLETION CHECKLIST


In [49]:
# TODO: Complete this checklist by running the validation functions

def validate_day1_completion():
    """Validate that Day 1 objectives have been met"""
    
    checklist = {
        "spark_session_created": False,
        "database_connection_tested": False,
        "data_loaded_successfully": False,
        "data_quality_assessed": False,
        "loading_functions_created": False,
        "schemas_defined": False,
        "timestamp_standardization_working": False,
        "zone_mapping_implemented": False,
        "quality_flags_added": False
    }
    
    # TODO: Add validation logic for each item
    try:
        # Check Spark session
        if spark and spark.sparkContext._jsc:
            checklist["spark_session_created"] = True
            
        # Check if data exists
        if ('traffic_df' in globals() and traffic_df.count() > 0) and ('weather_df' in globals() and weather_df.count() > 0) and ('air_quality_df' in globals() and air_quality_df is not None and air_quality_df.count() > 0) and ('energy_df' in globals() and energy_df.count() > 0) and ('zones_df' in globals() and zones_df.count() > 0):
            checklist["data_loaded_successfully"] = True
            
        # TODO: Add more validation checks
        if 'db_connected' in globals() and db_connected:
            checklist["database_connection_tested"] = True

        if  "assess_data_quality" in globals():
            checklist["data_quality_assessed"] = True
        
        loading_functions = ['load_csv_data', 'load_json_data', 'load_parquet_data']
        if all(func in globals() for func in loading_functions):
            checklist["loading_functions_created"] = True
        
        schema_vars = ['traffic_schema', 'air_quality_schema', 'weather_schema', 'energy_schema']
        if all(schema in globals() for schema in schema_vars):
            checklist["schemas_defined"] = True
        
        if 'standardize_timestamps' in globals() and 'traffic_std' in globals():
            checklist["timestamp_standardization_working"] = True
        
        if 'map_to_zones' in globals() and 'traffic_with_zones' in globals():
            checklist["zone_mapping_implemented"] = True
         
        if 'add_data_quality_flags' in globals() and 'traffic_with_flags' in globals():
            checklist["quality_flags_added"] = True
        
    except Exception as e:
        print(f"❌ Validation error: {str(e)}")
    
    # Display results
    print("✅ COMPLETION STATUS:")
    for item, status in checklist.items():
        status_icon = "✅" if status else "❌"
        print(f"   {status_icon} {item.replace('_', ' ').title()}")
    
    import builtins
    completion_rate = builtins.sum(checklist.values()) / len(checklist) * 100
    print(f"\n📊 Overall Completion: {completion_rate:.1f}%")
    
    if completion_rate >= 80:
        print("🎉 Great job! You're ready for Day 2!")
    else:
        print("📝 Please review incomplete items before proceeding to Day 2.")
    
    return checklist

# TODO: Run the validation
completion_status = validate_day1_completion()

✅ COMPLETION STATUS:
   ✅ Spark Session Created
   ✅ Database Connection Tested
   ✅ Data Loaded Successfully
   ✅ Data Quality Assessed
   ✅ Loading Functions Created
   ✅ Schemas Defined
   ✅ Timestamp Standardization Working
   ✅ Zone Mapping Implemented
   ✅ Quality Flags Added

📊 Overall Completion: 100.0%
🎉 Great job! You're ready for Day 2!


---

# 🚀 WHAT'S NEXT?

---

## 📅 DAY 2 PREVIEW: Data Quality & Cleaning Pipeline

Tomorrow you'll work on:
1. 🔍 Comprehensive data quality assessment
2. 🧹 Advanced cleaning procedures for IoT sensor data  
3. 📊 Missing data handling and interpolation strategies
4. 🚨 Outlier detection and treatment methods
5. 📏 Data standardization and normalization

## 📚 RECOMMENDED PREPARATION:
- Review PySpark DataFrame operations
- Read about time series data quality challenges
- Familiarize yourself with statistical outlier detection methods

## 💾 SAVE YOUR WORK:
- Commit your notebook to Git
- Document any issues or questions for tomorrow
- Save any custom functions you created

## 🤝 QUESTIONS?
- Post in the class discussion forum
- Review Spark documentation for any unclear concepts
- Prepare questions for tomorrow's Q&A session

In [None]:
# TODO: Save your progress
print("\n💾 Don't forget to save your notebook and commit your changes!")

# Clean up (optional)
# spark.stop()