# CONFLUENCE Tutorial: Continental-Scale Modeling - North America

This notebook demonstrates how to set up a continental-scale SUMMA model for North America. We'll move quickly through the workflow, focusing on the scale differences from previous tutorials.

## Key Points
- **Scale**: From country (Iceland) to continent (North America)
- **Computational considerations**: Much larger domain
- **Simplified approach**: Quick setup for demonstration
- **High-performance computing**: Required for continental modeling

## 1. Quick Setup

In [1]:
# Import required libraries
import sys
from pathlib import Path
import yaml
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import numpy as np
from shapely.geometry import box
import contextily as cx
from datetime import datetime
import xarray as xr

# Add CONFLUENCE to path
confluence_path = Path('../').resolve()
sys.path.append(str(confluence_path))

# Import CONFLUENCE
from CONFLUENCE import CONFLUENCE

plt.style.use('default')
%matplotlib inline

print(f"Working from: {confluence_path}")

KeyboardInterrupt: 

## 2. Initialize CONFLUENCE for Continental Domain
We'll configure CONFLUENCE for a continental-scale domain with appropriate settings for North America.

In [None]:
# Set directory paths
CONFLUENCE_CODE_DIR = confluence_path
CONFLUENCE_DATA_DIR = Path('/work/comphyd_lab/data/CONFLUENCE_data')  # ← User should modify this path

# Verify paths exist
if not CONFLUENCE_CODE_DIR.exists():
    raise FileNotFoundError(f"CONFLUENCE code directory not found: {CONFLUENCE_CODE_DIR}")

if not CONFLUENCE_DATA_DIR.exists():
    print(f"Data directory doesn't exist. Creating: {CONFLUENCE_DATA_DIR}")
    CONFLUENCE_DATA_DIR.mkdir(parents=True, exist_ok=True)

# Load template configuration
config_template_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_template.yaml'
with open(config_template_path, 'r') as f:
    config_dict = yaml.safe_load(f)

# Update with North America-specific settings
config_dict['CONFLUENCE_CODE_DIR'] = str(CONFLUENCE_CODE_DIR)
config_dict['CONFLUENCE_DATA_DIR'] = str(CONFLUENCE_DATA_DIR)

# Set North America domain and continental-specific settings
config_dict['DOMAIN_NAME'] = "North_America_continent"  
config_dict['EXPERIMENT_ID'] = "continental_run_1"
config_dict['EXPERIMENT_TIME_START'] = "2018-01-01 01:00"
config_dict['EXPERIMENT_TIME_END'] = "2019-12-31 23:00"  # Shorter period for demonstration
config_dict['SPATIAL_MODE'] = "Distributed"

# North America continent regional domain settings
config_dict['BOUNDING_BOX_COORDS'] = "83.0/-170.0/5.0/-50.0"  # North/West/South/East
config_dict['POUR_POINT_COORDS'] = "n/a"  # Not needed for continental domain
config_dict['DELINEATE_BY_POURPOINT'] = False
config_dict['DELINEATE_COASTAL_WATERSHEDS'] = True
config_dict['DOMAIN_DEFINITION_METHOD'] = "delineate_geofabric" 
config_dict['STREAM_THRESHOLD'] = 7500  # Larger threshold for continental scale
config_dict['DOMAIN_DISCRETIZATION'] = "GRUs"
config_dict['MIN_GRU_SIZE'] = 50  # Larger minimum size for continental domain
config_dict['MPI_PROCESSES'] = 40  # Higher for parallel processing

# Write to temporary config file
temp_config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_north_america.yaml'
with open(temp_config_path, 'w') as f:
    yaml.dump(config_dict, f)

# Initialize CONFLUENCE
confluence = CONFLUENCE(temp_config_path)

# Parse bounding box for visualization
bbox = config_dict['BOUNDING_BOX_COORDS'].split('/')
lat_max, lon_min, lat_min, lon_max = map(float, bbox)

# Display configuration
print("=== North America Continental Configuration ===")
print(f"Domain Name: {confluence.config['DOMAIN_NAME']}")
print(f"Bounding Box: {confluence.config['BOUNDING_BOX_COORDS']}")
print(f"Delineate by Pour Point: {confluence.config['DELINEATE_BY_POURPOINT']} (Full continent!)")
print(f"Include Coastal Watersheds: {confluence.config['DELINEATE_COASTAL_WATERSHEDS']}")
print(f"Stream Threshold: {confluence.config['STREAM_THRESHOLD']} (larger for continental scale)")
print(f"Min GRU Size: {confluence.config['MIN_GRU_SIZE']} km²")
print(f"MPI Processes: {confluence.config['MPI_PROCESSES']} (high for parallel processing)")

# Display geographic extent
print(f"\nGeographic extent:")
print(f"  North: {lat_max}°N (Arctic)")
print(f"  South: {lat_min}°N (Central America)")
print(f"  West: {lon_min}°E (Pacific)")
print(f"  East: {lon_max}°E (Atlantic)")
print(f"  Extent: {abs(lat_max - lat_min):.1f}° latitude × {abs(lon_max - lon_min):.1f}° longitude")

## 3. Visualize Continental Domain
Let's visualize the North America domain to understand the scale we're working with.

In [None]:
# Create a visualization of North America's extent
fig, ax = plt.subplots(figsize=(14, 10))

# Create bounding box
na_box = box(lon_min, lat_min, lon_max, lat_max)
na_bbox = gpd.GeoDataFrame({'geometry': [na_box]}, crs='EPSG:4326')

# Plot bounding box
na_bbox.boundary.plot(ax=ax, color='red', linewidth=2, label='Study Area')

# Add some context
ax.set_xlim(lon_min - 5, lon_max + 5)
ax.set_ylim(lat_min - 5, lat_max + 5)
ax.set_xlabel('Longitude', fontsize=12)
ax.set_ylabel('Latitude', fontsize=12)
ax.set_title('North America Continental Domain', fontsize=16, fontweight='bold')
ax.grid(True, alpha=0.3)

# Add annotation
ax.text(0.5, 0.95, 'Continental-scale domain with coastal watersheds',
        ha='center', va='top', transform=ax.transAxes,
        bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7),
        fontsize=12)

plt.tight_layout()
plt.show()

# Create scale comparison visualization
fig, ax = plt.subplots(figsize=(12, 8))

# Define area scales for comparison
scales = {
    'Bow River at Banff': 2_210,
    'Iceland': 103_000,
    'North America': 24_700_000
}

# Create logarithmic scale bar chart
ax.bar(scales.keys(), np.log10(list(scales.values())), color=['lightblue', 'lightgreen', 'coral'])
ax.set_ylabel('Area (log₁₀ km²)', fontsize=12)
ax.set_title('Scale Comparison: Watershed to Continent', fontsize=14, fontweight='bold')

# Add area labels
for i, (domain, area) in enumerate(scales.items()):
    ax.text(i, np.log10(area) + 0.1, f'{area:,} km²', ha='center', fontsize=10)

ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 4. Project Setup - Continental Scale
Let's set up our continental-scale project structure. This will be similar to the regional domain but will need to handle much larger data volumes.

In [None]:
# Step 1: Project Initialization
print("=== Step 1: Project Initialization ===")
print("Setting up continental project...")

# Setup project
project_dir = confluence.managers['project'].setup_project()

# Note: We skip pour point creation for continental domains
print("Pour point creation skipped (continental domain)")

# List created directories
print("\nCreated directories:")
for item in sorted(project_dir.iterdir()):
    if item.is_dir():
        print(f"  📁 {item.name}")

print("\nDirectory purposes:")
print("  📁 shapefiles: Domain geometry (thousands of watersheds, river networks)")
print("  📁 attributes: Static characteristics (elevation, soil, land cover)")
print("  📁 forcing: Meteorological inputs (precipitation, temperature)")
print("  📁 simulations: Model outputs")
print("  📁 evaluation: Performance metrics and comparisons")
print("  📁 plots: Visualizations")

print("\n⚠️ Note: Continental scale will require much more disk space!")
print("   Expected total storage requirement: 100s of GB to multiple TB")

## 5. Geospatial Domain Definition - Continental Data Acquisition
We need to acquire continental-scale geospatial data, which will be much larger than for regional domains.

In [None]:
# Step 2: Geospatial Domain Definition and Analysis
print("=== Step 2: Geospatial Domain Definition and Analysis ===")

# Acquire attributes
print("Acquiring continental-scale attributes (DEM, soil, land cover)...")
print("Note: This downloads LARGE datasets")
print("Expected data size: Several GB")
confluence.managers['data'].acquire_attributes()
print("✓ Continental attributes acquired\n")

## 6. Continental Domain Delineation
Now we'll delineate the entire continent, creating thousands of watersheds. This is computationally intensive.

In [None]:
# Define continental domain
print("Delineating continental domain...")
print(f"Method: {confluence.config['DOMAIN_DEFINITION_METHOD']}")
print(f"Stream threshold: {confluence.config['STREAM_THRESHOLD']} (high for continent)")
print(f"MPI processes: {confluence.config['MPI_PROCESSES']} (parallel processing)")
print("This creates thousands of watersheds across North America...")
print("⚠️ Warning: This step may take several hours on high-performance computing resources")

watershed_path = confluence.managers['domain'].define_domain()

# Check results
basin_path = project_dir / 'shapefiles' / 'river_basins'
network_path = project_dir / 'shapefiles' / 'river_network'

basin_count = 0
if basin_path.exists():
    basin_files = list(basin_path.glob('*.shp'))
    if basin_files:
        basins = gpd.read_file(basin_files[0])
        basin_count = len(basins)
        print(f"\n✓ Created {basin_count} watersheds across North America")
        print(f"Total area: {basins.geometry.area.sum() / 1e6:.0f} km²")

network_count = 0
if network_path.exists():
    network_files = list(network_path.glob('*.shp'))
    if network_files:
        rivers = gpd.read_file(network_files[0])
        network_count = len(rivers)
        print(f"✓ Created river network with {network_count} segments")

## 7. Continental Watershed Discretization
Now we need to discretize our continental domain into GRUs and HRUs, which will create tens of thousands of computational units.

In [None]:
# Discretize continental domain
print(f"Creating continental HRUs using method: {confluence.config['DOMAIN_DISCRETIZATION']}")
print("⚠️ Warning: This step may take many hours and require significant memory")
print(f"Minimum GRU size: {confluence.config['MIN_GRU_SIZE']} km² (larger than regional domain)")

hru_path = confluence.managers['domain'].discretize_domain()

# Check results
hru_path = project_dir / 'shapefiles' / 'catchment'
if hru_path.exists():
    hru_files = list(hru_path.glob('*.shp'))
    if hru_files:
        # Note: For continental scale, we might not want to load all HRUs at once
        print("\n⚠️ Continental HRU file may be very large. Loading sample statistics instead.")
        
        # Get basic file stats without loading entire shapefile
        hru_file_size = hru_files[0].stat().st_size / (1024**2)  # Size in MB
        print(f"HRU shapefile size: {hru_file_size:.1f} MB")
        
        # Option to load a small sample of HRUs for statistics
        print("Loading small sample of HRUs for statistics...")
        sample_size = min(1000, basin_count)  # Limit sample size
        hru_sample = gpd.read_file(hru_files[0], rows=slice(0, sample_size))
        
        print(f"Sample contains {len(hru_sample)} HRUs")
        sample_grus = hru_sample['GRU_ID'].nunique()
        print(f"GRUs in sample: {sample_grus}")
        
        # Estimate total counts
        if basin_count > 0 and sample_grus > 0:
            est_hru_total = len(hru_sample) * (basin_count / sample_grus)
            print(f"Estimated total HRUs: ~{est_hru_total:.0f} (based on sample)")
        
        print("\nContinental statistics:")
        print(f"  - Total watersheds (GRUs): ~{basin_count}")
        print(f"  - Computational units (HRUs): Tens to hundreds of thousands")
        print(f"  - Domain extent: {abs(lat_max - lat_min):.1f}° latitude × {abs(lon_max - lon_min):.1f}° longitude")

## 8. Model Agnostic Processing - Continental Forcing Data
For continental domains, forcing data acquisition and processing is particularly data-intensive.

In [None]:
# Step 3: Model Agnostic Data Pre-Processing
print("=== Step 3: Model Agnostic Data Pre-Processing ===")

# Process observed data
print("Processing observed streamflow data...")
print("Note: For continental modeling, we often use a subset of gauged watersheds for evaluation")
confluence.managers['data'].process_observed_data()

# Acquire continental-scale forcings
print(f"\nAcquiring continental forcing data: {confluence.config['FORCING_DATASET']}")
print("Expected data size: 10s to 100s of GB")
print(f"Period: {confluence.config['EXPERIMENT_TIME_START']} to {confluence.config['EXPERIMENT_TIME_END']}")
print("⚠️ Warning: This step will take several hours and require significant storage")
confluence.managers['data'].acquire_forcings()

# Run model-agnostic preprocessing
print("\nRunning continental-scale model-agnostic preprocessing...")
print("This step remaps climate data to tens of thousands of HRUs")
print("⚠️ Warning: High memory requirements (10s of GB)")
confluence.managers['data'].run_model_agnostic_preprocessing()

## 9. Model-Specific Preprocessing for Continental Domain
Preparing model input files for continental-scale modeling presents unique challenges.

In [None]:
# Step 4: Model Specific Processing and Initialization
print("=== Step 4: Model Specific Processing and Initialization ===")

# Preprocess models
print(f"Preparing continental-scale {confluence.config['HYDROLOGICAL_MODEL']} input files...")
print(f"Model: {confluence.config['HYDROLOGICAL_MODEL']}")
print(f"Routing: {confluence.config['ROUTING_MODEL']}")
print("⚠️ Warning: This will create very large input files")
print("Expected file sizes: Several GB per input file")
confluence.managers['model'].preprocess_models()

print("\n=== Continental Model Configuration Complete ===")
print(f"Model: {confluence.config['HYDROLOGICAL_MODEL']}")
print(f"Domain: {confluence.config['DOMAIN_NAME']}")
print(f"Number of GRUs: ~{basin_count}")
print("Number of HRUs: Tens to hundreds of thousands")
print("\nModel is now ready for execution with high-performance computing resources")

## 10. Continental Model Running Considerations
For demonstration, we'll discuss running a continental model without actually running it, as it would require significant computational resources.

In [None]:
# Note: We don't actually run the continental model here
print("=== Continental Model Execution Considerations ===")
print("Running a continental-scale model requires significant HPC resources.")
print("\nTo run the model (when adequate resources are available):")
print("  confluence.managers['model'].run_models()")

print("\nTypical resource requirements:")
print("  - Memory: 100+ GB RAM")
print("  - CPU: 40+ cores for parallel processing")
print("  - Storage: 1+ TB for inputs/outputs")
print("  - Runtime: Days to weeks")

print("\nCommon execution strategies:")
print("  1. Break continent into regions and run separately")
print("  2. Use MPI for massive parallelization")
print("  3. Run shorter test periods before full simulation")
print("  4. Use HPC job scheduling for long-running simulations")

print("\nFor this tutorial, we've completed the continental model setup")
print("without running the full model due to computational constraints.")

## 11. Continental-Scale Considerations

### Computational Requirements
- **Memory**: 100+ GB for data processing
- **Storage**: 1+ TB for inputs/outputs
- **CPU**: High parallelization (40+ cores)
- **Runtime**: Days to weeks for full simulations

### Key Configuration Differences
```yaml
STREAM_THRESHOLD: 7500        # Higher for continental scale
MPI_PROCESSES: 40            # More parallel processes
MIN_GRU_SIZE: 50             # Larger minimum size to manage computational load
FORCING_DATASET: ERA5        # Global reanalysis data
```

### Challenges at Continental Scale
1. **Data Volume**: TB of input/output data
2. **Heterogeneity**: Diverse climates, terrains, vegetation
3. **Calibration**: How to calibrate thousands of watersheds?
4. **Validation**: Limited observations for many basins
5. **Computation**: Requires HPC resources

### Use Cases
- Climate change impact assessment
- Continental water balance
- Large-scale flood forecasting
- Water resources planning
- Earth system modeling

## 12. Summary
Let's summarize what we've accomplished with our continental-scale setup.

In [None]:
# Final summary
print("=== Continental Model Setup Complete ===\n")
print(f"Domain: {confluence.config['DOMAIN_NAME']}")
print(f"Scale: Continental (~24.7 million km²)")
print(f"Model: {confluence.config['HYDROLOGICAL_MODEL']}")
print(f"Status: Ready for simulation")

print("\nScale progression in tutorials:")
print("  1. Watershed: Bow at Banff (~2,200 km²)")
print("  2. Country: Iceland (~103,000 km²)")
print("  3. Continent: North America (~24,700,000 km²)")

print("\nKey output locations:")
print(f"  - Basin shapefiles: {basin_path}")
print(f"  - River network: {network_path}")
print(f"  - HRU shapefiles: {hru_path}")
print(f"  - Model settings: {project_dir}/settings/{confluence.config['HYDROLOGICAL_MODEL']}/")
print(f"  - Future simulation results: {project_dir}/simulations/{confluence.config['EXPERIMENT_ID']}/")

print("\n🎉 You've successfully scaled from watershed to continent!")