# CONFLUENCE Tutorial: Distributed Basin Workflow with Delineation

This notebook demonstrates the distributed modeling approach using the delineation method. We'll use the same Bow River at Banff location but create a distributed model with multiple GRUs (Grouped Response Units).

## Key Differences from Lumped Model

- **Domain Method**: `delineate` instead of `lumped`
- **Stream Threshold**: 5000 (creates more sub-basins)
- **Multiple GRUs**: Each sub-basin becomes a GRU
- **Routing**: mizuRoute connects the GRUs

## Learning Objectives

1. Understand watershed delineation with stream networks
2. Create a distributed model with multiple GRUs

## 1. Setup and Import Libraries

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path
import yaml
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
from datetime import datetime
import numpy as np
import contextily as cx
import xarray as xr
from IPython.display import Image, display

# Add CONFLUENCE to path
confluence_path = Path('../').resolve()
sys.path.append(str(confluence_path))

# Import main CONFLUENCE class
from CONFLUENCE import CONFLUENCE

# Set up plotting style
plt.style.use('default')
%matplotlib inline

## 2. Initialize CONFLUENCE
First, let's set up our directories and load the configuration. We'll modify the configuration from Tutorial 1 to create a distributed model.

In [None]:
# Set directory paths
CONFLUENCE_CODE_DIR = confluence_path
CONFLUENCE_DATA_DIR = Path('/work/comphyd_lab/data/CONFLUENCE_data')  # ← User should modify this path

# Load template configuration
config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_template.yaml'

# Read config file
with open(config_path, 'r') as f:
    config_dict = yaml.safe_load(f)

# Update core paths
config_dict['CONFLUENCE_CODE_DIR'] = str(CONFLUENCE_CODE_DIR)
config_dict['CONFLUENCE_DATA_DIR'] = str(CONFLUENCE_DATA_DIR)

# Modify for distributed delineation
config_dict['DOMAIN_NAME'] = 'Bow_at_Banff_distributed'
config_dict['EXPERIMENT_ID'] = 'distributed_tutorial'
config_dict['DOMAIN_DEFINITION_METHOD'] = 'delineate'  # Changed from 'lumped'
config_dict['STREAM_THRESHOLD'] = 5000  # Higher threshold for fewer sub-basins
config_dict['DOMAIN_DISCRETIZATION'] = 'GRUs'  # Keep as GRUs
config_dict['SPATIAL_MODE'] = 'Distributed'  # Changed from 'Lumped'

# Save updated config to a temporary file
temp_config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_distributed.yaml'
with open(temp_config_path, 'w') as f:
    yaml.dump(config_dict, f)

# Initialize CONFLUENCE
confluence = CONFLUENCE(temp_config_path)

# Display configuration
print("=== Directory Configuration ===")
print(f"Code Directory: {CONFLUENCE_CODE_DIR}")
print(f"Data Directory: {CONFLUENCE_DATA_DIR}")
print("\n=== Key Configuration Settings ===")
print(f"Domain Name: {confluence.config['DOMAIN_NAME']}")
print(f"Pour Point: {confluence.config['POUR_POINT_COORDS']}")
print(f"Domain Method: {confluence.config['DOMAIN_DEFINITION_METHOD']}")
print(f"Stream Threshold: {confluence.config['STREAM_THRESHOLD']}")
print(f"Spatial Mode: {confluence.config['SPATIAL_MODE']}")
print(f"Model: {confluence.config['HYDROLOGICAL_MODEL']}")
print(f"Simulation Period: {confluence.config['EXPERIMENT_TIME_START']} to {confluence.config['EXPERIMENT_TIME_END']}")

## 3. Project Setup - Organizing the Modeling Workflow

First, we'll establish a well-organized project structure, similar to what we did in Tutorial 1.

In [None]:
# Step 1: Project Initialization
print("=== Step 1: Project Initialization ===")

# Setup project
project_dir = confluence.managers['project'].setup_project()

# Create pour point
pour_point_path = confluence.managers['project'].create_pour_point()

# List created directories
print("\nCreated directories:")
for item in sorted(project_dir.iterdir()):
    if item.is_dir():
        print(f"  📁 {item.name}")

print("\nNote: The pour point location is identical to the lumped model.")
print("The difference is in how we subdivide the watershed above this point.")

## 4. Geospatial Domain Definition - Data Acquisition and Preparation

We'll reuse some of the geospatial data from the lumped model tutorial, where appropriate.

In [None]:
# Check if we can reuse data from the lumped model
lumped_dem_path = CONFLUENCE_DATA_DIR / 'domain_Bow_at_Banff_lumped' / 'attributes' / 'elevation' / 'dem'
lumped_forcing_path = CONFLUENCE_DATA_DIR / 'domain_Bow_at_Banff_lumped' / 'forcing' / 'raw_data'
can_reuse = lumped_dem_path.exists()
can_reuse_forcing = lumped_forcing_path.exists()

if can_reuse or can_reuse_forcing:
    import shutil
    
    # Create a function to copy files with name substitution
    def copy_with_name_substitution(src_path, dst_path, old_str='_lumped', new_str='_distributed'):
        if not src_path.exists():
            return False
            
        # Create destination directory if it doesn't exist
        dst_path.parent.mkdir(parents=True, exist_ok=True)
        
        if src_path.is_dir():
            # Copy entire directory
            if not dst_path.exists():
                dst_path.mkdir(parents=True, exist_ok=True)
                
            # Copy all files with name substitution
            for src_file in src_path.glob('**/*'):
                if src_file.is_file():
                    # Create relative path
                    rel_path = src_file.relative_to(src_path)
                    # Create new filename with substitution
                    new_name = src_file.name.replace(old_str, new_str)
                    # Create destination path
                    dst_file = dst_path / rel_path.parent / new_name
                    # Create parent directories if they don't exist
                    dst_file.parent.mkdir(parents=True, exist_ok=True)
                    # Copy the file
                    shutil.copy2(src_file, dst_file)
            return True
        elif src_path.is_file():
            # Copy single file with name substitution
            new_name = dst_path.name.replace(old_str, new_str)
            dst_file = dst_path.parent / new_name
            dst_file.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy2(src_path, dst_file)
            return True
        
        return False

    print("Found existing geospatial data from lumped model. Copying and renaming files...")
    
    # Copy and rename DEM and other attribute data
    if can_reuse:
        # Define paths
        src_attr_path = CONFLUENCE_DATA_DIR / 'domain_Bow_at_Banff_lumped' / 'attributes'
        dst_attr_path = project_dir / 'attributes'
        
        # Copy attributes with name substitution
        copied = copy_with_name_substitution(src_attr_path, dst_attr_path, '_lumped', '_distributed')
        if copied:
            print("✓ Copied and renamed attribute files from lumped model")
    
    # Copy and rename forcing data
    if can_reuse_forcing:
        # Define paths
        src_forcing_path = CONFLUENCE_DATA_DIR / 'domain_Bow_at_Banff_lumped' / 'forcing' / 'raw_data'
        dst_forcing_path = project_dir / 'forcing' / 'raw_data'
        
        # Copy forcing data with name substitution
        copied = copy_with_name_substitution(src_forcing_path, dst_forcing_path, '_lumped', '_distributed')
        if copied:
            print("✓ Copied and renamed forcing data from lumped model")
            
    print("The distributed model will use these copied files as a starting point.")
else:
    print("No existing data found from the lumped model. Will acquire all data from scratch.")

    # Step 2: Geospatial Domain Definition - Data Acquisition
    print("\n=== Step 2: Geospatial Domain Definition - Data Acquisition ===")
    
    # Acquire attributes
    print("Acquiring geospatial attributes (DEM, soil, land cover)...")
    confluence.managers['data'].acquire_attributes()

    # Acquire forcings
    print(f"\nAcquiring forcing data: {confluence.config['FORCING_DATASET']}")
    confluence.managers['data'].acquire_forcings()
    
print("\n✓ Geospatial attributes acquired")

## 6. Geospatial Domain Definition - Delineation with Stream Network

This is where the main difference occurs - we'll create multiple sub-basins connected by a stream network.

In [None]:
# Step 3: Geospatial Domain Definition - Delineation
print("=== Step 3: Geospatial Domain Definition - Delineation ===")

# Define domain
print(f"Delineating distributed watershed...")
print(f"Method: {confluence.config['DOMAIN_DEFINITION_METHOD']}")
print(f"Stream threshold: {confluence.config['STREAM_THRESHOLD']}")
print("\nThis will create multiple sub-basins connected by a stream network.")

watershed_path = confluence.managers['domain'].define_domain()

# Check outputs
basin_path = project_dir / 'shapefiles' / 'river_basins'
network_path = project_dir / 'shapefiles' / 'river_network'

if basin_path.exists():
    basin_files = list(basin_path.glob('*.shp'))
    print(f"\n✓ Created basin shapefiles: {len(basin_files)}")
    
if network_path.exists():
    network_files = list(network_path.glob('*.shp'))
    print(f"✓ Created river network shapefiles: {len(network_files)}")
    
    # Load and check number of basins
    if basin_files:
        gdf = gpd.read_file(basin_files[0])
        print(f"\nNumber of sub-basins (GRUs): {len(gdf)}")
        print(f"Total area: {gdf.geometry.area.sum() / 1e6:.2f} km²")

## 7. Visualize the Distributed Domain

In [None]:
# Visualize the delineated domain with stream network
basin_files = list((project_dir / 'shapefiles' / 'river_basins').glob('*.shp'))
network_files = list((project_dir / 'shapefiles' / 'river_network').glob('*.shp'))
    
if basin_files and network_files:
    fig, ax = plt.subplots(figsize=(12, 10))
    
    # Load data
    basins = gpd.read_file(basin_files[0])
    rivers = gpd.read_file(network_files[0])
    
    # Plot basins with different colors
    basins.plot(ax=ax, column='GRU_ID', cmap='viridis', 
               alpha=0.7, edgecolor='black', linewidth=0.5)
    
    # Plot river network
    rivers.plot(ax=ax, color='blue', linewidth=2)
    
    # Add pour point
    pour_point = gpd.read_file(pour_point_path)
    pour_point.plot(ax=ax, color='red', markersize=150, marker='o', zorder=5)
    
    ax.set_title(f'Distributed Domain: {len(basins)} Sub-basins', fontsize=16, fontweight='bold')
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    
    # Add colorbar for GRU IDs
    sm = plt.cm.ScalarMappable(cmap='viridis', 
                               norm=plt.Normalize(vmin=basins['GRU_ID'].min(), 
                                                 vmax=basins['GRU_ID'].max()))
    sm._A = []
    cbar = fig.colorbar(sm, ax=ax, shrink=0.8)
    cbar.set_label('GRU ID', fontsize=12)
    
    plt.tight_layout()
    plt.show()

## 8. Geospatial Domain Definition - Discretization

Now we'll create Hydrologic Response Units (HRUs) based on the Grouped Response Units (GRUs) we just created.

In [None]:
# Step 4: Geospatial Domain Definition - Discretization
print("=== Step 4: Geospatial Domain Definition - Discretization ===")

# Discretize domain
print(f"Creating HRUs based on GRUs...")
print(f"Method: {confluence.config['DOMAIN_DISCRETIZATION']}")
print("For this tutorial: 1 GRU = 1 HRU (simplest case)")

hru_path = confluence.managers['domain'].discretize_domain()

# Check the created HRU shapefile
catchment_path = project_dir / 'shapefiles' / 'catchment'
if catchment_path.exists():
    hru_files = list(catchment_path.glob('*.shp'))
    print(f"\n✓ Created HRU shapefiles: {len(hru_files)}")
    
    if hru_files:
        hru_gdf = gpd.read_file(hru_files[0])
        print(f"\nHRU Statistics:")
        print(f"Number of HRUs: {len(hru_gdf)}")
        print(f"Number of GRUs: {hru_gdf['GRU_ID'].nunique()}")
        print(f"Total area: {hru_gdf.geometry.area.sum() / 1e6:.2f} km²")
        
        # Show HRU distribution
        hru_counts = hru_gdf.groupby('GRU_ID').size()
        print(f"\nHRUs per GRU:")
        for gru_id, count in hru_counts.items():
            print(f"  GRU {gru_id}: {count} HRUs")

## 9. Model Agnostic Data Processing - Observed Data

The observed streamflow data will be the same for both the lumped and distributed models since they use the same pour point.

In [None]:
# Step 5: Model Agnostic Data Processing - Observed Data
print("=== Step 5: Model Agnostic Data Processing - Observed Data ===")

# Check if we can reuse data from the lumped model
lumped_obs_path = CONFLUENCE_DATA_DIR / 'domain_Bow_at_Banff_lumped' / 'observations' / 'streamflow' / 'preprocessed'
can_reuse_obs = lumped_obs_path.exists() and list(lumped_obs_path.glob('*.csv'))

if can_reuse_obs:
    print("Found existing observed data from lumped model. Reusing...")
    # We can proceed, but CONFLUENCE will handle the reuse internally

# Process observed data
print("Processing observed streamflow data...")
confluence.managers['data'].process_observed_data()

# Visualize observed streamflow data
obs_path = project_dir / 'observations' / 'streamflow' / 'preprocessed' / f"{confluence.config['DOMAIN_NAME']}_streamflow_processed.csv"
if obs_path.exists():
    obs_df = pd.read_csv(obs_path)
    obs_df['datetime'] = pd.to_datetime(obs_df['datetime'])
    
    fig, ax = plt.subplots(figsize=(14, 6))
    ax.plot(obs_df['datetime'], obs_df['discharge_cms'], 
            linewidth=1.5, color='blue', alpha=0.7)
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('Discharge (m³/s)', fontsize=12)
    ax.set_title(f'Observed Streamflow - Bow River at Banff (WSC Station: {confluence.config["STATION_ID"]})', 
                fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # Add statistics
    ax.text(0.02, 0.95, f'Mean: {obs_df["discharge_cms"].mean():.1f} m³/s\nMax: {obs_df["discharge_cms"].max():.1f} m³/s', 
            transform=ax.transAxes, 
            bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.8),
            verticalalignment='top')
    
    plt.tight_layout()
    plt.show()

## 10. Model Agnostic Data Processing - Preprocessing

In [None]:
# Step 7: Model Agnostic Data Processing - Preprocessing
print("=== Step 7: Model Agnostic Data Processing - Preprocessing ===")

# Run model-agnostic preprocessing
print("\nRunning model-agnostic preprocessing...")
confluence.managers['data'].run_model_agnostic_preprocessing()

print("\n✓ Model-agnostic preprocessing completed")

## 11. Model-Specific Processing - Preprocessing

Now we prepare inputs specific to our chosen hydrological model (SUMMA in this case), set up for a distributed configuration.

In [None]:
# Step 8: Model Specific Processing and Initialization
print("=== Step 8: Model Specific Processing and Initialization ===")

# Preprocess models
print(f"Preparing {confluence.config['HYDROLOGICAL_MODEL']} input files...")
print(f"Note: For distributed mode with {confluence.config['HYDROLOGICAL_MODEL']}, this includes generating:")
print(f"  - Model parameter files for each GRU")
print(f"  - Routing configuration for river network")

confluence.managers['model'].preprocess_models()

print("\n✓ Model-specific preprocessing completed")

## 13. Run the Distributed Model

Now we execute the SUMMA model in distributed mode with routing.

In [None]:
# Step 9: Run the Distributed Model
print("=== Step 9: Run the Distributed Model ===")

# Run the model
print(f"Running distributed {confluence.config['HYDROLOGICAL_MODEL']} model...")
print(f"Number of GRUs: (check from previous output)")
print("Note: This will take longer than the lumped model due to multiple units.")

confluence.managers['model'].run_models()

print("\n✓ Model execution completed")

## 15. Visualize Distributed Model Results

In [None]:
# Step 14: Visualize Observed vs. Simulated Streamflow for Distributed Model
print("=== Step 14: Visualizing Model Results (Distributed) ===")

import numpy as np
import matplotlib.dates as mdates

# Load and plot simulation results
sim_path = project_dir / 'simulations' / confluence.config['EXPERIMENT_ID'] / 'mizuRoute'
sim_files = list(sim_path.glob('*.nc'))

if not sim_files:
    print("No mizuRoute simulation results found. Check if model execution was successful.")
    print(f"Expected path: {sim_path}")
    
    # Check for alternative locations
    alt_sim_paths = list(Path(config_dict['CONFLUENCE_DATA_DIR']).glob(
        f"domain_{config_dict['DOMAIN_NAME']}/simulations/{config_dict['EXPERIMENT_ID']}/mizuRoute/*.nc"))
    
    if alt_sim_paths:
        sim_files = alt_sim_paths
        print(f"Found alternative simulation data at: {sim_files[0]}")
    else:
        print("No simulation results found anywhere. Visualization cannot proceed.")

if sim_files:
    try:
        # Load simulation data
        print(f"Loading simulation data from: {sim_files[0]}")
        sim_data = xr.open_dataset(sim_files[0])
        
        # Load observation data
        obs_path = project_dir / 'observations' / 'streamflow' / 'preprocessed' / f"{confluence.config['DOMAIN_NAME']}_streamflow_processed.csv"
        
        if not obs_path.exists():
            print(f"Warning: Observation data not found at expected path: {obs_path}")
            print("Checking for alternative locations...")
            alt_obs_paths = list(Path(config_dict['CONFLUENCE_DATA_DIR']).glob(
                f"domain_{config_dict['DOMAIN_NAME']}/observations/streamflow/preprocessed/*_streamflow_processed.csv"))
            
            if alt_obs_paths:
                obs_path = alt_obs_paths[0]
                print(f"Found alternative observation data at: {obs_path}")
            else:
                print("No observation data found. Only simulations will be displayed.")
        
        if obs_path.exists():
            print(f"Loading observation data from: {obs_path}")
            obs_df = pd.read_csv(obs_path)
            obs_df['datetime'] = pd.to_datetime(obs_df['datetime'])
            obs_df.set_index('datetime', inplace=True)
            print(f"Observation period: {obs_df.index.min()} to {obs_df.index.max()}")
        else:
            obs_df = None
            
        # Find the segment ID for the outlet
        reach_id = int(confluence.config.get('SIM_REACH_ID', 0))
        print(f"Using reach ID for outlet: {reach_id}")
        
        if 'reachID' in sim_data.variables:
            # Find the index of our target reach
            reach_indices = np.where(sim_data.reachID.values == reach_id)[0]
            
            if len(reach_indices) > 0:
                reach_idx = reach_indices[0]
                print(f"Found reach ID {reach_id} at index {reach_idx}")
                
                # Extract simulated flow at outlet
                if 'IRFroutedRunoff' in sim_data.variables:
                    print("Extracting IRFroutedRunoff variable")
                    
                    # Extract flow at the outlet segment
                    if 'seg' in sim_data.dims:
                        sim_flow = sim_data.IRFroutedRunoff.sel(seg=reach_idx).to_series()
                    else:
                        sim_flow = sim_data.IRFroutedRunoff.isel(reachID=reach_idx).to_series()
                    
                    sim_df = pd.DataFrame(sim_flow)
                    sim_df.columns = ['discharge_cms']
                    
                    # Determine common time period if observations exist
                    if obs_df is not None:
                        # Align to daily timestep for comparison
                        obs_daily = obs_df.resample('D').mean()
                        sim_daily = sim_df.resample('D').mean()
                        
                        # Find overlapping time period
                        start_date = max(obs_daily.index.min(), sim_daily.index.min())
                        end_date = min(obs_daily.index.max(), sim_daily.index.max())
                        
                        # Advance start date by 1 month to skip initial spinup
                        start_date = start_date + pd.DateOffset(months=1)
                        
                        print(f"Common data period (after skipping 1 month spinup): {start_date} to {end_date}")
                        
                        # Filter to common period
                        obs_period = obs_daily.loc[start_date:end_date]
                        sim_period = sim_daily.loc[start_date:end_date]
                        
                        # Calculate performance metrics
                        rmse = np.sqrt(((obs_period['discharge_cms'] - sim_period['discharge_cms'])**2).mean())
                        
                        # Calculate Nash-Sutcliffe Efficiency (NSE)
                        mean_obs = obs_period['discharge_cms'].mean()
                        numerator = ((obs_period['discharge_cms'] - sim_period['discharge_cms'])**2).sum()
                        denominator = ((obs_period['discharge_cms'] - mean_obs)**2).sum()
                        nse = 1 - (numerator / denominator)
                        
                        # Calculate Percent Bias (PBIAS)
                        pbias = 100 * (sim_period['discharge_cms'].sum() - obs_period['discharge_cms'].sum()) / obs_period['discharge_cms'].sum()
                        
                        # Calculate Kling-Gupta Efficiency (KGE)
                        r = obs_period['discharge_cms'].corr(sim_period['discharge_cms'])  # Correlation
                        alpha = sim_period['discharge_cms'].std() / obs_period['discharge_cms'].std()  # Relative variability
                        beta = sim_period['discharge_cms'].mean() / obs_period['discharge_cms'].mean()  # Bias ratio
                        kge = 1 - ((r - 1)**2 + (alpha - 1)**2 + (beta - 1)**2)**0.5
                        
                        print(f"Performance metrics:")
                        print(f"  - RMSE: {rmse:.2f} m³/s")
                        print(f"  - NSE: {nse:.2f}")
                        print(f"  - PBIAS: {pbias:.2f}%")
                        print(f"  - KGE: {kge:.2f}")
                        
                        # Create figure with two subplots for time series and flow duration curve
                        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 16))
                        fig.suptitle(f"Distributed Model Results - {confluence.config['DOMAIN_NAME'].replace('_', ' ').title()}", 
                                     fontsize=16, fontweight='bold')
                        
                        # Plot time series
                        ax1.plot(obs_period.index, obs_period['discharge_cms'], 
                                 'b-', label='Observed', linewidth=1.5, alpha=0.7)
                        ax1.plot(sim_period.index, sim_period['discharge_cms'], 
                                 'r-', label='Simulated (Distributed)', linewidth=1.5, alpha=0.7)
                        
                        # Add calibration/evaluation period shading if configured
                        if 'CALIBRATION_PERIOD' in confluence.config and 'EVALUATION_PERIOD' in confluence.config:
                            cal_start = pd.Timestamp(confluence.config.get('CALIBRATION_PERIOD').split(',')[0].strip())
                            cal_end = pd.Timestamp(confluence.config.get('CALIBRATION_PERIOD').split(',')[1].strip())
                            eval_start = pd.Timestamp(confluence.config.get('EVALUATION_PERIOD').split(',')[0].strip())
                            eval_end = pd.Timestamp(confluence.config.get('EVALUATION_PERIOD').split(',')[1].strip())
                            
                            # Only shade if within the plot range
                            if cal_start <= end_date and cal_end >= start_date:
                                valid_cal_start = max(cal_start, start_date)
                                valid_cal_end = min(cal_end, end_date)
                                ax1.axvspan(valid_cal_start, valid_cal_end, alpha=0.2, color='gray', label='Calibration Period')
                            
                            if eval_start <= end_date and eval_end >= start_date:
                                valid_eval_start = max(eval_start, start_date)
                                valid_eval_end = min(eval_end, end_date)
                                ax1.axvspan(valid_eval_start, valid_eval_end, alpha=0.2, color='lightblue', label='Evaluation Period')
                        
                        ax1.set_xlabel('Date', fontsize=12)
                        ax1.set_ylabel('Discharge (m³/s)', fontsize=12)
                        ax1.set_title('Streamflow Comparison', fontsize=14)
                        ax1.legend(loc='upper right', fontsize=10)
                        ax1.grid(True, linestyle=':', alpha=0.6)
                        ax1.set_facecolor('#f0f0f0')
                        
                        # Format x-axis
                        ax1.xaxis.set_major_locator(mdates.YearLocator())
                        ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
                        
                        # Add metrics as text
                        ax1.text(0.02, 0.95, 
                                 f"RMSE: {rmse:.2f} m³/s\nNSE: {nse:.2f}\nPBIAS: {pbias:.2f}%\nKGE: {kge:.2f}",
                                 transform=ax1.transAxes, 
                                 fontsize=12,
                                 bbox=dict(facecolor='white', alpha=0.8, boxstyle='round,pad=0.5'))
                        
                        # Plot flow duration curve
                        # Sort values in descending order
                        obs_sorted = obs_period['discharge_cms'].sort_values(ascending=False)
                        sim_sorted = sim_period['discharge_cms'].sort_values(ascending=False)
                        
                        # Calculate exceedance probabilities
                        obs_ranks = np.arange(1., len(obs_sorted) + 1) / len(obs_sorted)
                        sim_ranks = np.arange(1., len(sim_sorted) + 1) / len(sim_sorted)
                        
                        # Plot Flow Duration Curves
                        ax2.loglog(obs_ranks * 100, obs_sorted, 'b-', label='Observed', linewidth=2)
                        ax2.loglog(sim_ranks * 100, sim_sorted, 'r-', label='Simulated', linewidth=2)
                        
                        ax2.set_xlabel('Exceedance Probability (%)', fontsize=12)
                        ax2.set_ylabel('Discharge (m³/s)', fontsize=12)
                        ax2.set_title('Flow Duration Curve', fontsize=14)
                        ax2.legend(loc='best', fontsize=10)
                        ax2.grid(True, which='both', linestyle=':', alpha=0.6)
                        ax2.set_facecolor('#f0f0f0')
                        
                        # Add flow regime regions
                        ax2.axvspan(0, 20, alpha=0.2, color='blue', label='High Flows')
                        ax2.axvspan(20, 70, alpha=0.2, color='green', label='Medium Flows')
                        ax2.axvspan(70, 100, alpha=0.2, color='red', label='Low Flows')
                        
                        # Save the plot to file
                        plot_folder = project_dir / "plots" / "results"
                        plot_folder.mkdir(parents=True, exist_ok=True)
                        plot_filename = plot_folder / f"{confluence.config['EXPERIMENT_ID']}_streamflow_comparison.png"
                        
                        plt.tight_layout()
                        plt.subplots_adjust(top=0.93)
                        plt.savefig(plot_filename, dpi=300, bbox_inches='tight')
                        print(f"Plot saved to: {plot_filename}")
                        
                        plt.show()
                    else:
                        # If no observations, just plot simulation
                        fig, ax = plt.subplots(figsize=(14, 6))
                        ax.plot(sim_df.index, sim_df['discharge_cms'], 
                                color='red', linewidth=1.5, label='Simulated (Distributed)')
                        
                        ax.set_xlabel('Date', fontsize=12)
                        ax.set_ylabel('Discharge (m³/s)', fontsize=12)
                        ax.set_title(f'Distributed Model Results - {confluence.config["DOMAIN_NAME"].replace("_", " ").title()}', 
                                    fontsize=14, fontweight='bold')
                        ax.grid(True, alpha=0.3)
                        ax.legend(fontsize=10)
                        
                        plt.tight_layout()
                        plt.show()
                else:
                    print("Error: IRFroutedRunoff variable not found in simulation output")
                    print(f"Available variables: {list(sim_data.variables)}")
            else:
                print(f"Error: Could not find reach ID {reach_id} in simulation output")
                print(f"Available reach IDs: {sim_data.reachID.values}")
        else:
            print("Error: reachID variable not found in simulation output")
            print(f"Available variables: {list(sim_data.variables)}")
        
        # Close the dataset
        sim_data.close()
    except Exception as e:
        print(f"Error visualizing simulation results: {str(e)}")
        import traceback
        print(traceback.format_exc())
else:
    print("No simulation results found. Check model execution.")

## 16. Optimization and Analysis (Optional)


## 17. Compare Lumped vs Distributed Results (Optional)

If you've completed the lumped model tutorial, we can compare results between the two approaches.

In [None]:
# Step 15: Compare Lumped vs. Distributed Model Results
print("=== Step 15: Comparing Lumped and Distributed Model Results ===")

# Import necessary libraries if not already imported
import numpy as np
import matplotlib.dates as mdates

# Set paths for both lumped and distributed model results
lumped_domain = 'Bow_at_Banff_lumped_tutorial'
lumped_sim_path = CONFLUENCE_DATA_DIR / f'domain_{lumped_domain}' / 'simulations' / 'tutorial_run' / 'SUMMA'
dist_sim_path = project_dir / 'simulations' / confluence.config['EXPERIMENT_ID'] / 'mizuRoute'

# Check if paths exist and find simulation files
lumped_sim_files = str(lumped_sim_path / 'tutorial_run_timestep.nc')
dist_sim_files = list(dist_sim_path.glob('*.nc')) if dist_sim_path.exists() else []

# Check what result files are available
if not lumped_sim_files and not dist_sim_files:
    print("Neither lumped nor distributed model results found. Run both models first for comparison.")
elif not lumped_sim_files:
    print("Lumped model results not found. Complete Tutorial 2 first for comparison.")
elif not dist_sim_files:
    print("Distributed simulation results not found. Run the distributed model first.")
else:
    print("Found both lumped and distributed model results. Creating comparison plot...")
    
    try:
        # Load lumped simulation data
        print(f"Loading lumped model results from: {lumped_sim_files[0]}")
        lumped_data = xr.open_dataset(lumped_sim_files[0])
        
        # Load distributed simulation data
        print(f"Loading distributed model results from: {dist_sim_files[0]}")
        dist_data = xr.open_dataset(dist_sim_files[0])
        
        # Load observation data
        obs_path = project_dir / 'observations' / 'streamflow' / 'preprocessed' / f"{confluence.config['DOMAIN_NAME']}_streamflow_processed.csv"
        
        if not obs_path.exists():
            print(f"Observation data not found at: {obs_path}")
            # Try to find observations in the lumped domain directory
            alt_obs_path = CONFLUENCE_DATA_DIR / f'domain_{lumped_domain}' / 'observations' / 'streamflow' / 'preprocessed' / f"{lumped_domain}_streamflow_processed.csv"
            if alt_obs_path.exists():
                obs_path = alt_obs_path
                print(f"Found alternative observation data at: {obs_path}")
            else:
                print("No observation data found. Only simulations will be compared.")
        
        obs_df = None
        if obs_path.exists():
            print(f"Loading observation data from: {obs_path}")
            obs_df = pd.read_csv(obs_path)
            obs_df['datetime'] = pd.to_datetime(obs_df['datetime'])
            obs_df.set_index('datetime', inplace=True)
            print(f"Observation period: {obs_df.index.min()} to {obs_df.index.max()}")
        
        # Define reach IDs for each model
        lumped_reach_id = int(config_dict.get('SIM_REACH_ID', 0))
        dist_reach_id = int(confluence.config.get('SIM_REACH_ID', 0))
        
        # Extract flows based on available structure
        lumped_flow = None
        dist_flow = None
        
        # Extract lumped flow
        print(f"Extracting lumped model flow for reach ID: {lumped_reach_id}")
        if 'reachID' in lumped_data.variables and 'IRFroutedRunoff' in lumped_data.variables:
            reach_indices = np.where(lumped_data.reachID.values == lumped_reach_id)[0]
            if len(reach_indices) > 0:
                reach_idx = reach_indices[0]
                if 'seg' in lumped_data.dims:
                    lumped_flow = lumped_data.IRFroutedRunoff.sel(seg=reach_idx).to_series()
                else:
                    lumped_flow = lumped_data.IRFroutedRunoff.isel(reachID=reach_idx).to_series()
                
                lumped_df = pd.DataFrame(lumped_flow)
                lumped_df.columns = ['discharge_cms']
            else:
                print(f"Warning: Reach ID {lumped_reach_id} not found in lumped model output")
                print(f"Available reach IDs: {lumped_data.reachID.values}")
        else:
            print("Warning: Required variables not found in lumped model output")
            print(f"Available variables: {list(lumped_data.variables)}")
        
        # Extract distributed flow
        print(f"Extracting distributed model flow for reach ID: {dist_reach_id}")
        if 'reachID' in dist_data.variables and 'IRFroutedRunoff' in dist_data.variables:
            reach_indices = np.where(dist_data.reachID.values == dist_reach_id)[0]
            if len(reach_indices) > 0:
                reach_idx = reach_indices[0]
                if 'seg' in dist_data.dims:
                    dist_flow = dist_data.IRFroutedRunoff.sel(seg=reach_idx).to_series()
                else:
                    dist_flow = dist_data.IRFroutedRunoff.isel(reachID=reach_idx).to_series()
                
                dist_df = pd.DataFrame(dist_flow)
                dist_df.columns = ['discharge_cms']
            else:
                print(f"Warning: Reach ID {dist_reach_id} not found in distributed model output")
                print(f"Available reach IDs: {dist_data.reachID.values}")
        else:
            print("Warning: Required variables not found in distributed model output")
            print(f"Available variables: {list(dist_data.variables)}")
        
        # Proceed only if both flows are extracted
        if lumped_flow is not None and dist_flow is not None:
            # Resample to daily for comparison
            lumped_daily = lumped_df.resample('D').mean()
            dist_daily = dist_df.resample('D').mean()
            
            # Determine common time period
            if obs_df is not None:
                obs_daily = obs_df.resample('D').mean()
                start_date = max(obs_daily.index.min(), lumped_daily.index.min(), dist_daily.index.min())
                end_date = min(obs_daily.index.max(), lumped_daily.index.max(), dist_daily.index.max())
            else:
                start_date = max(lumped_daily.index.min(), dist_daily.index.min())
                end_date = min(lumped_daily.index.max(), dist_daily.index.max())
            
            # Advance start date by 1 month to skip spinup
            start_date = start_date + pd.DateOffset(months=1)
            
            print(f"Common comparison period (after skipping 1 month spinup): {start_date} to {end_date}")
            
            # Filter to common period
            lumped_period = lumped_daily.loc[start_date:end_date]
            dist_period = dist_daily.loc[start_date:end_date]
            
            if obs_df is not None:
                obs_period = obs_daily.loc[start_date:end_date]
                
                # Calculate metrics for both models
                metrics = []
                
                # Lumped model metrics
                lumped_rmse = np.sqrt(((obs_period['discharge_cms'] - lumped_period['discharge_cms'])**2).mean())
                lumped_nse = 1 - (((obs_period['discharge_cms'] - lumped_period['discharge_cms'])**2).sum() / 
                                 ((obs_period['discharge_cms'] - obs_period['discharge_cms'].mean())**2).sum())
                lumped_pbias = 100 * (lumped_period['discharge_cms'].sum() - obs_period['discharge_cms'].sum()) / obs_period['discharge_cms'].sum()
                
                # Calculate Kling-Gupta Efficiency for lumped model
                r_lumped = obs_period['discharge_cms'].corr(lumped_period['discharge_cms'])
                alpha_lumped = lumped_period['discharge_cms'].std() / obs_period['discharge_cms'].std()
                beta_lumped = lumped_period['discharge_cms'].mean() / obs_period['discharge_cms'].mean()
                kge_lumped = 1 - ((r_lumped - 1)**2 + (alpha_lumped - 1)**2 + (beta_lumped - 1)**2)**0.5
                
                metrics.append({
                    'model': 'Lumped',
                    'RMSE': f"{lumped_rmse:.2f} m³/s",
                    'NSE': f"{lumped_nse:.3f}",
                    'PBIAS': f"{lumped_pbias:.2f}%",
                    'KGE': f"{kge_lumped:.3f}"
                })
                
                # Distributed model metrics
                dist_rmse = np.sqrt(((obs_period['discharge_cms'] - dist_period['discharge_cms'])**2).mean())
                dist_nse = 1 - (((obs_period['discharge_cms'] - dist_period['discharge_cms'])**2).sum() / 
                               ((obs_period['discharge_cms'] - obs_period['discharge_cms'].mean())**2).sum())
                dist_pbias = 100 * (dist_period['discharge_cms'].sum() - obs_period['discharge_cms'].sum()) / obs_period['discharge_cms'].sum()
                
                # Calculate Kling-Gupta Efficiency for distributed model
                r_dist = obs_period['discharge_cms'].corr(dist_period['discharge_cms'])
                alpha_dist = dist_period['discharge_cms'].std() / obs_period['discharge_cms'].std()
                beta_dist = dist_period['discharge_cms'].mean() / obs_period['discharge_cms'].mean()
                kge_dist = 1 - ((r_dist - 1)**2 + (alpha_dist - 1)**2 + (beta_dist - 1)**2)**0.5
                
                metrics.append({
                    'model': 'Distributed',
                    'RMSE': f"{dist_rmse:.2f} m³/s",
                    'NSE': f"{dist_nse:.3f}",
                    'PBIAS': f"{dist_pbias:.2f}%",
                    'KGE': f"{kge_dist:.3f}"
                })
                
                # Print metrics table
                print("\nPerformance Metrics Comparison:")
                metrics_df = pd.DataFrame(metrics).set_index('model')
                print(metrics_df)
            
            # Create figure
            fig = plt.figure(figsize=(15, 15))
            gs = gridspec.GridSpec(3, 1, height_ratios=[2, 1, 1])
            
            # Timeseries plot
            ax1 = fig.add_subplot(gs[0])
            
            # Plot observations if available
            if obs_df is not None:
                ax1.plot(obs_period.index, obs_period['discharge_cms'], 
                        color='black', linewidth=2, label='Observed', zorder=3)
            
            # Plot lumped model results
            ax1.plot(lumped_period.index, lumped_period['discharge_cms'], 
                    color='#1f77b4', linewidth=1.5, alpha=0.8, label='Lumped Model', zorder=2)
            
            # Plot distributed model results
            ax1.plot(dist_period.index, dist_period['discharge_cms'], 
                    color='#ff7f0e', linewidth=1.5, alpha=0.8, label='Distributed Model', zorder=1)
            
            # Add calibration/evaluation period shading if configured
            if 'CALIBRATION_PERIOD' in confluence.config and 'EVALUATION_PERIOD' in confluence.config:
                cal_start = pd.Timestamp(confluence.config.get('CALIBRATION_PERIOD').split(',')[0].strip())
                cal_end = pd.Timestamp(confluence.config.get('CALIBRATION_PERIOD').split(',')[1].strip())
                eval_start = pd.Timestamp(confluence.config.get('EVALUATION_PERIOD').split(',')[0].strip())
                eval_end = pd.Timestamp(confluence.config.get('EVALUATION_PERIOD').split(',')[1].strip())
                
                # Only shade if within the plot range
                if cal_start <= end_date and cal_end >= start_date:
                    valid_cal_start = max(cal_start, start_date)
                    valid_cal_end = min(cal_end, end_date)
                    ax1.axvspan(valid_cal_start, valid_cal_end, alpha=0.2, color='gray', label='Calibration Period')
                
                if eval_start <= end_date and eval_end >= start_date:
                    valid_eval_start = max(eval_start, start_date)
                    valid_eval_end = min(eval_end, end_date)
                    ax1.axvspan(valid_eval_start, valid_eval_end, alpha=0.2, color='lightblue', label='Evaluation Period')
            
            ax1.set_xlabel('Date', fontsize=12)
            ax1.set_ylabel('Discharge (m³/s)', fontsize=12)
            ax1.set_title('Lumped vs Distributed Model Comparison', fontsize=14, fontweight='bold')
            ax1.grid(True, alpha=0.3)
            ax1.legend(fontsize=10, loc='upper right')
            
            # Format x-axis to show years
            ax1.xaxis.set_major_locator(mdates.YearLocator())
            ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
            
            # Add metrics if observations are available
            if obs_df is not None:
                # Add metrics table as text
                metrics_text = "Performance Metrics:\n"
                metrics_text += f"Lumped Model:    RMSE: {lumped_rmse:.2f} m³/s    NSE: {lumped_nse:.3f}    KGE: {kge_lumped:.3f}\n"
                metrics_text += f"Distributed Model:    RMSE: {dist_rmse:.2f} m³/s    NSE: {dist_nse:.3f}    KGE: {kge_dist:.3f}"
                
                ax1.text(0.01, 0.02, metrics_text, transform=ax1.transAxes,
                        bbox=dict(facecolor='white', alpha=0.8, boxstyle='round,pad=0.5'),
                        fontsize=10, verticalalignment='bottom')
            
            # Flow Duration Curve
            ax2 = fig.add_subplot(gs[1])
            
            if obs_df is not None:
                # Sort values in descending order
                obs_sorted = obs_period['discharge_cms'].sort_values(ascending=False)
                lumped_sorted = lumped_period['discharge_cms'].sort_values(ascending=False)
                dist_sorted = dist_period['discharge_cms'].sort_values(ascending=False)
                
                # Calculate exceedance probabilities
                obs_ranks = np.arange(1., len(obs_sorted) + 1) / len(obs_sorted)
                lumped_ranks = np.arange(1., len(lumped_sorted) + 1) / len(lumped_sorted)
                dist_ranks = np.arange(1., len(dist_sorted) + 1) / len(dist_sorted)
                
                # Plot Flow Duration Curves
                ax2.semilogy(obs_ranks * 100, obs_sorted, 'k-', label='Observed', linewidth=2)
                ax2.semilogy(lumped_ranks * 100, lumped_sorted, '-', color='#1f77b4', label='Lumped Model', linewidth=1.5)
                ax2.semilogy(dist_ranks * 100, dist_sorted, '-', color='#ff7f0e', label='Distributed Model', linewidth=1.5)
                
                ax2.set_xlabel('Exceedance Probability (%)', fontsize=12)
                ax2.set_ylabel('Discharge (m³/s)', fontsize=12)
                ax2.set_title('Flow Duration Curve', fontsize=14)
                ax2.legend(loc='best', fontsize=10)
                ax2.grid(True, which='both', alpha=0.3)
                
                # Add flow regime regions
                ax2.axvspan(0, 20, alpha=0.2, color='blue')
                ax2.axvspan(20, 70, alpha=0.2, color='green')
                ax2.axvspan(70, 100, alpha=0.2, color='red')
                
                # Add text labels for flow regions
                ax2.text(10, ax2.get_ylim()[1] * 0.8, 'High Flows', fontsize=10, ha='center')
                ax2.text(45, ax2.get_ylim()[1] * 0.1, 'Medium Flows', fontsize=10, ha='center')
                ax2.text(85, ax2.get_ylim()[1] * 0.02, 'Low Flows', fontsize=10, ha='center')
            
            # Error analysis
            ax3 = fig.add_subplot(gs[2])
            
            if obs_df is not None:
                # Calculate errors for both models
                lumped_error = lumped_period['discharge_cms'] - obs_period['discharge_cms']
                dist_error = dist_period['discharge_cms'] - obs_period['discharge_cms']
                
                # Plot errors
                ax3.plot(lumped_period.index, lumped_error, '-', color='#1f77b4', label='Lumped Model Error', alpha=0.7)
                ax3.plot(dist_period.index, dist_error, '-', color='#ff7f0e', label='Distributed Model Error', alpha=0.7)
                
                # Add zero line
                ax3.axhline(y=0, color='k', linestyle='-', linewidth=0.8)
                
                ax3.set_xlabel('Date', fontsize=12)
                ax3.set_ylabel('Error (m³/s)', fontsize=12)
                ax3.set_title('Model Error (Simulated - Observed)', fontsize=14)
                ax3.legend(fontsize=10)
                ax3.grid(True, alpha=0.3)
                
                # Format x-axis to match the first plot
                ax3.xaxis.set_major_locator(mdates.YearLocator())
                ax3.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
            
            # Save and show the plot
            plt.tight_layout()
            
            # Save the plot
            plot_folder = project_dir / "plots" / "results"
            plot_folder.mkdir(parents=True, exist_ok=True)
            plot_filename = plot_folder / 'lumped_vs_distributed_comparison.png'
            plt.savefig(plot_filename, dpi=300, bbox_inches='tight')
            print(f"Plot saved to: {plot_filename}")
            
            plt.show()
            
            # Close datasets
            lumped_data.close()
            dist_data.close()
        else:
            print("Error: Failed to extract flow data from one or both models.")
    
    except Exception as e:
        print(f"Error during comparison: {str(e)}")
        import traceback
        print(traceback.format_exc())

In [None]:
# Alternative: Run the complete workflow in one step
# (Uncomment to use this instead of the step-by-step approach)

# confluence.run_workflow()