# Extreme Weather Data Analysis

This notebook analyzes historical precipitation data from 2020-2024 to find the timestamps with the highest mean precipitation values. We'll then generate grids and contours for these extreme weather events to test how the map visualization handles more "wild" data.

## Setup and Configuration

First, let's import the required libraries and set up database connection.

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime, timezone
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add the services directory to the path so we can import ETL modules
services_path = Path('../services').resolve()
sys.path.append(str(services_path))

print(f"Services path: {services_path}")
print(f"Current working directory: {os.getcwd()}")

## Database Connection Setup

Set up the database connection. You'll need to provide your DATABASE_URL.

In [None]:
# Database configuration
# Replace with your actual database URL
DATABASE_URL = os.getenv('DATABASE_URL', 'postgresql://username:password@host:port/database')

def get_db_connection():
    """Create a database connection"""
    return psycopg2.connect(DATABASE_URL)

# Test connection
try:
    with get_db_connection() as conn:
        with conn.cursor() as cur:
            cur.execute("SELECT 1")
            print("✅ Database connection successful")
except Exception as e:
    print(f"❌ Database connection failed: {e}")
    print("Please set the DATABASE_URL environment variable")

## Analyze Historical Data (2020-2024)

Let's find the timestamps with the highest mean precipitation values from the historical data.

In [None]:
def find_extreme_weather_timestamps(limit=10):
    """Find timestamps with highest mean precipitation values from 2020-2024"""
    
    query = """
    SELECT 
        DATE_TRUNC('hour', ts) as hour_slot,
        COUNT(*) as sensor_count,
        AVG(value_mm) as mean_precipitation,
        MAX(value_mm) as max_precipitation,
        STDDEV(value_mm) as std_precipitation,
        SUM(value_mm) as total_precipitation
    FROM clean_measurements 
    WHERE ts >= '2020-01-01' 
      AND ts < '2025-01-01'
      AND value_mm > 0  -- Only consider timestamps with actual precipitation
    GROUP BY hour_slot
    HAVING COUNT(*) >= 5  -- At least 5 sensors reporting
       AND AVG(value_mm) > 1.0  -- Significant precipitation events
    ORDER BY mean_precipitation DESC
    LIMIT %s;
    """
    
    with get_db_connection() as conn:
        df = pd.read_sql_query(query, conn, params=[limit])
    
    return df

# Find the top 10 extreme weather events
extreme_events = find_extreme_weather_timestamps(10)
print(f"Found {len(extreme_events)} extreme weather events:")
print(extreme_events)

## Visualize the Extreme Events

Let's visualize these extreme precipitation events.

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Mean precipitation over time
axes[0, 0].bar(range(len(extreme_events)), extreme_events['mean_precipitation'])
axes[0, 0].set_title('Mean Precipitation by Event Rank')
axes[0, 0].set_xlabel('Event Rank (1 = highest)')
axes[0, 0].set_ylabel('Mean Precipitation (mm)')
axes[0, 0].set_xticks(range(len(extreme_events)))
axes[0, 0].set_xticklabels([f"{i+1}" for i in range(len(extreme_events))])

# 2. Max vs Mean precipitation
axes[0, 1].scatter(extreme_events['mean_precipitation'], extreme_events['max_precipitation'], 
                   s=extreme_events['sensor_count']*10, alpha=0.6)
axes[0, 1].set_title('Max vs Mean Precipitation\n(bubble size = sensor count)')
axes[0, 1].set_xlabel('Mean Precipitation (mm)')
axes[0, 1].set_ylabel('Max Precipitation (mm)')

# 3. Sensor count distribution
axes[1, 0].bar(range(len(extreme_events)), extreme_events['sensor_count'])
axes[1, 0].set_title('Number of Sensors Reporting')
axes[1, 0].set_xlabel('Event Rank')
axes[1, 0].set_ylabel('Sensor Count')
axes[1, 0].set_xticks(range(len(extreme_events)))
axes[1, 0].set_xticklabels([f"{i+1}" for i in range(len(extreme_events))])

# 4. Timeline of events
extreme_events['hour_slot'] = pd.to_datetime(extreme_events['hour_slot'])
axes[1, 1].scatter(extreme_events['hour_slot'], extreme_events['mean_precipitation'], 
                   s=100, alpha=0.7)
axes[1, 1].set_title('Timeline of Extreme Events')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Mean Precipitation (mm)')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Display detailed information about the top 5 events
print("\n🌧️ TOP 5 EXTREME WEATHER EVENTS:")
print("=" * 80)
for i, row in extreme_events.head(5).iterrows():
    print(f"Rank {i+1}: {row['hour_slot']}")
    print(f"  Mean: {row['mean_precipitation']:.2f}mm | Max: {row['max_precipitation']:.2f}mm")
    print(f"  Sensors: {row['sensor_count']} | Total: {row['total_precipitation']:.2f}mm")
    print()

## Check Existing Grid Runs

Let's check if grids already exist for these timestamps.

In [None]:
def check_existing_grids(timestamps):
    """Check which timestamps already have grid runs"""
    
    # Convert timestamps to string format for SQL query
    timestamp_list = [ts.strftime('%Y-%m-%d %H:00:00+00') for ts in timestamps]
    
    query = """
    SELECT ts, status, blob_url_json, blob_url_contours, message
    FROM grid_runs 
    WHERE ts = ANY(%s)
    ORDER BY ts DESC;
    """
    
    with get_db_connection() as conn:
        df = pd.read_sql_query(query, conn, params=[timestamp_list])
    
    return df

# Check existing grids
existing_grids = check_existing_grids(extreme_events['hour_slot'])
print(f"Existing grid runs for extreme events: {len(existing_grids)}")
if len(existing_grids) > 0:
    print(existing_grids[['ts', 'status', 'message']].head())
else:
    print("No existing grids found for these timestamps.")

# Identify timestamps that need grid generation
if len(existing_grids) > 0:
    existing_timestamps = set(existing_grids['ts'])
    needed_timestamps = [ts for ts in extreme_events['hour_slot'] 
                        if ts not in existing_timestamps]
else:
    needed_timestamps = list(extreme_events['hour_slot'])

print(f"\nTimestamps needing grid generation: {len(needed_timestamps)}")
for ts in needed_timestamps[:5]:  # Show first 5
    print(f"  - {ts}")

## Generate Grid Runs for Extreme Events

Now let's create grid run entries for the extreme weather timestamps.

In [None]:
def create_grid_run_entries(timestamps, resolution_m=500):
    """Create grid run entries for the specified timestamps"""
    
    # Default bounding box for Antioquia region
    bbox = {
        "west": -77.2,
        "south": 4.8, 
        "east": -73.5,
        "north": 8.8
    }
    
    insert_query = """
    INSERT INTO grid_runs (ts, res_m, bbox, status, message)
    VALUES (%s, %s, %s, 'pending', 'Created for extreme weather analysis')
    ON CONFLICT (ts, res_m) DO NOTHING
    RETURNING id, ts;
    """
    
    created_count = 0
    
    with get_db_connection() as conn:
        with conn.cursor() as cur:
            for ts in timestamps:
                try:
                    cur.execute(insert_query, (
                        ts.strftime('%Y-%m-%d %H:00:00+00'),
                        resolution_m,
                        json.dumps(bbox)
                    ))
                    result = cur.fetchone()
                    if result:
                        created_count += 1
                        print(f"✅ Created grid run {result[0]} for {result[1]}")
                    else:
                        print(f"⚠️  Grid run already exists for {ts}")
                except Exception as e:
                    print(f"❌ Error creating grid run for {ts}: {e}")
            
            conn.commit()
    
    return created_count

# Import json for bbox serialization
import json

# Create grid run entries for timestamps that need them
if needed_timestamps:
    print(f"Creating grid run entries for {len(needed_timestamps)} timestamps...")
    created = create_grid_run_entries(needed_timestamps[:10])  # Limit to top 10
    print(f"\n✅ Created {created} new grid run entries")
else:
    print("All extreme weather timestamps already have grid runs")

## Run the ETL Service

Now let's run the ETL service to generate the actual grids and contours.

In [None]:
# Import ETL modules
try:
    from etl.main import run as run_etl
    from etl.config import load as load_config
    print("✅ ETL modules imported successfully")
except ImportError as e:
    print(f"❌ Failed to import ETL modules: {e}")
    print("Make sure you have the required environment variables set:")
    print("- DATABASE_URL")
    print("- VERCEL_BLOB_RW_TOKEN")
    print("- VERCEL_BLOB_BASE_URL")

In [None]:
# Set environment variables for ETL (you need to provide these)
os.environ['GRID_INTERVAL_MIN'] = '60'
os.environ['GRID_RESOLUTION_M'] = '500'
os.environ['ETL_MAX_SLOTS'] = '10'  # Process up to 10 slots
os.environ['DRY_RUN'] = 'false'  # Set to 'true' for testing without uploads

# Check required environment variables
required_vars = ['DATABASE_URL', 'VERCEL_BLOB_RW_TOKEN', 'VERCEL_BLOB_BASE_URL']
missing_vars = [var for var in required_vars if not os.getenv(var)]

if missing_vars:
    print(f"❌ Missing required environment variables: {missing_vars}")
    print("Please set these before running the ETL:")
    for var in missing_vars:
        print(f"export {var}=your_value_here")
else:
    print("✅ All required environment variables are set")
    print("Ready to run ETL service")

In [None]:
# Run the ETL service (uncomment when ready)
# WARNING: This will process pending grid runs and upload to Vercel Blob storage

if not missing_vars:
    print("🚀 Starting ETL service for extreme weather data...")
    print("This may take several minutes depending on the amount of data.")
    
    try:
        run_etl()
        print("✅ ETL service completed successfully")
    except Exception as e:
        print(f"❌ ETL service failed: {e}")
        import traceback
        traceback.print_exc()
else:
    print("⚠️  Skipping ETL run due to missing environment variables")
    print("Set DRY_RUN=true if you want to test without uploads")

## Check Results

Let's check the results of the grid generation.

In [None]:
def check_grid_results():
    """Check the status of grid runs for extreme weather events"""
    
    # Get the extreme event timestamps again
    timestamp_list = [ts.strftime('%Y-%m-%d %H:00:00+00') for ts in extreme_events['hour_slot']]
    
    query = """
    SELECT 
        ts,
        status,
        CASE 
            WHEN blob_url_json IS NOT NULL THEN 'Yes'
            ELSE 'No'
        END as has_grid_data,
        CASE 
            WHEN blob_url_contours IS NOT NULL THEN 'Yes'
            ELSE 'No'
        END as has_contours,
        message,
        updated_at
    FROM grid_runs 
    WHERE ts = ANY(%s)
    ORDER BY ts DESC;
    """
    
    with get_db_connection() as conn:
        df = pd.read_sql_query(query, conn, params=[timestamp_list])
    
    return df

# Check results
results = check_grid_results()
print(f"Grid run results for {len(results)} extreme weather events:")
print("=" * 80)
print(results[['ts', 'status', 'has_grid_data', 'has_contours', 'message']].to_string())

# Summary statistics
status_counts = results['status'].value_counts()
print(f"\n📊 SUMMARY:")
print(f"✅ Done: {status_counts.get('done', 0)}")
print(f"⏳ Pending: {status_counts.get('pending', 0)}")
print(f"❌ Failed: {status_counts.get('failed', 0)}")

successful_grids = results[results['status'] == 'done']
if len(successful_grids) > 0:
    print(f"\n🎉 Successfully generated {len(successful_grids)} grids for extreme weather events!")
    print("\nThese timestamps are now available in your Flutter app for testing:")
    for _, row in successful_grids.iterrows():
        print(f"  - {row['ts']} ({row['has_grid_data']} grid, {row['has_contours']} contours)")
else:
    print("\n⚠️  No grids were successfully generated yet.")

## Test the Map Visualization

Instructions for testing the extreme weather data in your Flutter app.

### 🚀 Testing Instructions

Now that we've generated grids for extreme weather events, you can test them in your Flutter app:

1. **Start your Flutter app** and open the precipitation viewer
2. **Use the timeline slider** to navigate to the extreme weather timestamps we generated
3. **Observe how the map handles** the higher precipitation values:
   - Check if the color scale properly represents the extreme values
   - Verify that contours display correctly for high precipitation areas
   - Test the performance with more complex contour data

### 🔍 What to Look For:

- **Color saturation**: Do the highest values show distinct colors?
- **Contour density**: Are there more contour lines in areas of high precipitation?
- **Performance**: Does the app remain responsive with complex grid data?
- **Scale accuracy**: Do the legend values match the displayed data?

### 🛠️ Potential Adjustments:

Based on your testing, you might need to:
- Adjust the color scale thresholds in `app_constants.dart`
- Modify contour generation parameters
- Optimize rendering for complex datasets
- Update the legend to better represent extreme values