# CONFLUENCE Tutorial: Lumped Basin Workflow (Bow River at Banff)

This notebook walks through a complete workflow for a lumped basin model using the Bow River at Banff as an example. We'll execute each step individually to understand what's happening at each stage.

## Overview of This Tutorial

We'll work through the simplest case in catchment modeling: a lumped basin model. This treats the entire watershed as a single unit, making it an ideal starting point for understanding the CONFLUENCE workflow.

We'll run through:
1. Project setup and configuration
2. Domain definition (watershed delineation)
3. Data acquisition (forcings and attributes)
4. Model preprocessing
5. Model execution
6. Results visualization

## 1. Setup and Import Libraries

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path
import yaml
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
from datetime import datetime
import contextily as cx
import xarray as xr

# Add CONFLUENCE to path
confluence_path = Path('../').resolve()
sys.path.append(str(confluence_path))

# Import main CONFLUENCE class
from CONFLUENCE import CONFLUENCE

# Set up plotting style
plt.style.use('default')
%matplotlib inline

## 2. Initialize CONFLUENCE
First, let's set up our directories and load the configuration. CONFLUENCE uses a centralized configuration file that controls all aspects of the modeling workflow.

In [None]:
# Set directory paths
CONFLUENCE_CODE_DIR = confluence_path
CONFLUENCE_DATA_DIR = Path('/work/comphyd_lab/data/CONFLUENCE_data')  # ← User should modify this path

# Load and update configuration
config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_template.yaml'

# Read config file and update paths
with open(config_path, 'r') as f:
    config_dict = yaml.safe_load(f)

# Update paths and settings 
config_dict['CONFLUENCE_CODE_DIR'] = str(CONFLUENCE_CODE_DIR)
config_dict['CONFLUENCE_DATA_DIR'] = str(CONFLUENCE_DATA_DIR)

# Save updated config to a temporary file
temp_config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_notebook.yaml'
with open(temp_config_path, 'w') as f:
    yaml.dump(config_dict, f)

# Initialize CONFLUENCE
confluence = CONFLUENCE(temp_config_path)

# Display configuration
print("=== Directory Configuration ===")
print(f"Code Directory: {CONFLUENCE_CODE_DIR}")
print(f"Data Directory: {CONFLUENCE_DATA_DIR}")
print("\n=== Key Configuration Settings ===")
print(f"Domain Name: {confluence.config['DOMAIN_NAME']}")
print(f"Pour Point: {confluence.config['POUR_POINT_COORDS']}")
print(f"Spatial Mode: {confluence.config['SPATIAL_MODE']}")
print(f"Model: {confluence.config['HYDROLOGICAL_MODEL']}")
print(f"Simulation Period: {confluence.config['EXPERIMENT_TIME_START']} to {confluence.config['EXPERIMENT_TIME_END']}")

## 3. Project Setup - Organizing the Modeling Workflow
The first step in any CONFLUENCE workflow is to establish a well-organized project structure. This might seem trivial, but it's crucial for:

- Maintaining consistency across different experiments
- Ensuring all components can find required files
- Enabling reproducibility
- Facilitating collaboration

In [None]:
# Step 1: Project Initialization
print("=== Step 1: Project Initialization ===")

# Setup project
project_dir = confluence.managers['project'].setup_project()

# Create pour point
pour_point_path = confluence.managers['project'].create_pour_point()

# List created directories
print("\nCreated directories:")
for item in sorted(project_dir.iterdir()):
    if item.is_dir():
        print(f"  📁 {item.name}")

print("\nDirectory purposes:")
print("  📁 shapefiles: Domain geometry (watershed, pour points, river network)")
print("  📁 attributes: Static characteristics (elevation, soil, land cover)")
print("  📁 forcing: Meteorological inputs (precipitation, temperature)")
print("  📁 simulations: Model outputs")
print("  📁 evaluation: Performance metrics and comparisons")
print("  📁 plots: Visualizations")
print("  📁 optimisation: Calibration results")

## 4. Geospatial Domain Definition and Analysis - A data acquisition 
Before we can delineate the watershed, we need elevation data. CONFLUENCE also acquires soil and land cover data at this stage for later use in the model.

In [None]:
# Step 2: Geospatial Domain Definition and Analysis
print("=== Step 2: Geospatial Domain Definition and Analysis ===")

# Acquire attributes
print("Acquiring geospatial attributes (DEM, soil, land cover)...")
confluence.managers['data'].acquire_attributes()

## 5. Geospatial Domain Definition and Analysis - Delineation 

In [None]:
# Define domain
print(f"\nDelineating watershed using method: {confluence.config['DOMAIN_DEFINITION_METHOD']}")
watershed_path = confluence.managers['domain'].define_domain()

# Check outputs
print("\nDomain definition complete:")
print(f"  - Watershed defined: {watershed_path is not None}")

## 6. Geospatial Domain Definition and Analysis - Discretisation 

In [None]:
# Discretize domain
print(f"\nCreating HRUs using method: {confluence.config['DOMAIN_DISCRETIZATION']}")
hru_path = confluence.managers['domain'].discretize_domain()

# Check outputs
print("\nDomain definition complete:")
print(f"  - HRUs created: {hru_path is not None}")

## 7. Visualize the Delineated Domain
Let's see what our watershed looks like:

In [None]:
# Visualize the watershed
basin_path = project_dir / 'shapefiles' / 'river_basins'
if basin_path.exists():
    basin_files = list(basin_path.glob('*.shp'))
    
    if basin_files:
        fig, ax = plt.subplots(figsize=(12, 10))
        
        # Load watershed and pour point
        basin_gdf = gpd.read_file(basin_files[0])
        pour_point_gdf = gpd.read_file(pour_point_path)
        
        # Reproject for visualization
        basin_web = basin_gdf.to_crs(epsg=3857)
        pour_web = pour_point_gdf.to_crs(epsg=3857)
        
        # Plot watershed
        basin_web.plot(ax=ax, facecolor='lightblue', edgecolor='navy', 
                       linewidth=2, alpha=0.7)
        
        # Add pour point
        pour_web.plot(ax=ax, color='red', markersize=200, marker='o', 
                      edgecolor='white', linewidth=2, zorder=5)
                
        # Set extent
        minx, miny, maxx, maxy = basin_web.total_bounds
        pad = 5000
        ax.set_xlim(minx - pad, maxx + pad)
        ax.set_ylim(miny - pad, maxy + pad)
        
        ax.set_title('Bow River Watershed at Banff \n All water from this area flows to the pour point', 
                    fontsize=16, fontweight='bold', pad=20)
        
        ax.axis('off')
        plt.tight_layout()
        plt.show()

## 8. Model Agnostic Data Pre-Processing - Observed data
For a lumped model, the entire watershed becomes a single Hydrologic Response Unit (HRU). This simplification assumes uniform characteristics across the watershed - obviously an approximation, but useful for many applications.


In [None]:
# Step 3: Model Agnostic Data Pre-Processing
print("=== Step 3: Model Agnostic Data Pre-Processing ===")

# Process observed data
print("Processing observed streamflow data...")
confluence.managers['data'].process_observed_data()

In [None]:
# Visualize observed streamflow data
obs_path = project_dir / 'observations' / 'streamflow' / 'preprocessed' / f"{confluence.config['DOMAIN_NAME']}_streamflow_processed.csv"
if obs_path.exists():
    obs_df = pd.read_csv(obs_path)
    obs_df['datetime'] = pd.to_datetime(obs_df['datetime'])
    
    fig, ax = plt.subplots(figsize=(14, 6))
    ax.plot(obs_df['datetime'], obs_df['discharge_cms'], 
            linewidth=1.5, color='blue', alpha=0.7)
    
    ax.set_xlabel('Date', fontsize=12)
    ax.set_ylabel('Discharge (m³/s)', fontsize=12)
    ax.set_title(f'Observed Streamflow - Bow River at Banff (WSC Station: {confluence.config["STATION_ID"]})', 
                fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # Add statistics
    ax.text(0.02, 0.95, f'Mean: {obs_df["discharge_cms"].mean():.1f} m³/s\\nMax: {obs_df["discharge_cms"].max():.1f} m³/s', 
            transform=ax.transAxes, 
            bbox=dict(boxstyle='round,pad=0.5', facecolor='white', alpha=0.8),
            verticalalignment='top')
    
    plt.tight_layout()
    plt.show()

## 9. Model Agnostic Data Pre-Processing - Forcing data

In [None]:
# Acquire forcings
print(f"\nAcquiring forcing data: {confluence.config['FORCING_DATASET']}")
confluence.managers['data'].acquire_forcings()

## 10. Model Agnostic Data Pre-Processing - Remapping and zonal statistics

In [None]:
# Run model-agnostic preprocessing
print("\nRunning model-agnostic preprocessing...")
confluence.managers['data'].run_model_agnostic_preprocessing()

## 12. Model-Specific - Preprocessing
Now we prepare inputs specific to our chosen hydrological model (SUMMA in this case). Each model has its own requirements for input format and configuration.

In [None]:
# Step 4: Model Specific Processing and Initialization
print("=== Step 4: Model Specific Processing and Initialization ===")

# Preprocess models
print(f"Preparing {confluence.config['HYDROLOGICAL_MODEL']} input files...")
confluence.managers['model'].preprocess_models()

## 13. Model-Specific - Instantiation

In [None]:
# Run models
print(f"\nRunning {confluence.config['HYDROLOGICAL_MODEL']} model...")
confluence.managers['model'].run_models()

print("\nModel run complete")

## 14 Visualisation of results

In [None]:
# Step 14: Visualize Observed vs. Simulated Streamflow
print("=== Step 14: Comparing Observed vs. Simulated Streamflow ===")
import numpy as np 

# 1. Load the observed streamflow data
obs_path = project_dir / 'observations' / 'streamflow' / 'preprocessed' / f"{confluence.config['DOMAIN_NAME']}_streamflow_processed.csv"
if not obs_path.exists():
    print(f"Warning: Observed streamflow data not found at {obs_path}")
    print("Checking for alternative locations...")
    alt_paths = list(Path(config_dict['CONFLUENCE_DATA_DIR']).glob(f"**/observations/streamflow/preprocessed/*_streamflow_processed.csv"))
    if alt_paths:
        obs_path = alt_paths[0]
        print(f"Found alternative streamflow data at: {obs_path}")
    else:
        print("No observed streamflow data found. Only simulated data will be displayed.")

# 2. Load the simulated streamflow data from SUMMA output
sim_path = Path(config_dict['CONFLUENCE_DATA_DIR']) / f"domain_{config_dict['DOMAIN_NAME']}" / "simulations" / config_dict['EXPERIMENT_ID'] / "SUMMA" / f"{config_dict['EXPERIMENT_ID']}_timestep.nc"

# Check for alternative NetCDF file patterns if not found
if not sim_path.exists():
    print(f"Simulated data not found at {sim_path}")
    print("Checking for alternative NetCDF files...")
    alt_sim_paths = list(Path(config_dict['CONFLUENCE_DATA_DIR']).glob(
        f"domain_{config_dict['DOMAIN_NAME']}/simulations/{config_dict['EXPERIMENT_ID']}/SUMMA/*.nc"))
    
    if alt_sim_paths:
        sim_path = alt_sim_paths[0]
        print(f"Found alternative simulation data at: {sim_path}")
    else:
        raise FileNotFoundError(f"No simulation results found for experiment {config_dict['EXPERIMENT_ID']}")

# Load simulated data
print(f"Loading simulated data from: {sim_path}")
ds = xr.open_dataset(sim_path)

# Extract averageRoutedRunoff
print("Extracting 'averageRoutedRunoff' variable...")
if 'averageRoutedRunoff' in ds:
    # Extract and convert to DataFrame
    sim_runoff = ds['averageRoutedRunoff'].to_dataframe().reset_index()
    
    # Get catchment area from the river basin shapefile to convert from m/s to m³/s
    basin_shapefile = config_dict.get('RIVER_BASINS_NAME', 'default')
    if basin_shapefile == 'default':
        basin_shapefile = f"{config_dict['DOMAIN_NAME']}_riverBasins_{config_dict.get('DOMAIN_DEFINITION_METHOD', 'lumped')}.shp"
    
    basin_path = project_dir / "shapefiles" / "river_basins" / basin_shapefile
    
    try:
        print(f"Loading catchment shapefile from: {basin_path}")
        basin_gdf = gpd.read_file(basin_path)
        area_col = config_dict.get('RIVER_BASIN_SHP_AREA', 'GRU_area')
        
        # Area should be in m²
        if area_col in basin_gdf.columns:
            area_m2 = basin_gdf[area_col].sum()
            print(f"Catchment area: {area_m2:.2f} m² ({area_m2/1e6:.2f} km²)")
            
            # Convert from m/s to m³/s by multiplying by area in m²
            # Assuming first GRU for lumped basin simulation if multiple GRUs exist
            if 'gru' in sim_runoff.columns:
                sim_runoff = sim_runoff[sim_runoff['gru'] == 1][['time', 'averageRoutedRunoff']]
            else:
                sim_runoff = sim_runoff[['time', 'averageRoutedRunoff']]
            
            # Convert units: m/s -> m³/s
            sim_runoff['discharge_cms'] = sim_runoff['averageRoutedRunoff'] * area_m2
            print(f"Converted runoff from m/s to m³/s (multiplied by basin area)")
        else:
            print(f"Warning: Area column '{area_col}' not found in catchment shapefile")
            sim_runoff['discharge_cms'] = sim_runoff['averageRoutedRunoff']  # Use raw values as fallback
    except Exception as e:
        print(f"Error getting basin area: {str(e)}. Using raw values.")
        sim_runoff['discharge_cms'] = sim_runoff['averageRoutedRunoff']  # Use raw values as fallback
    
    # Set index to time for easier processing
    sim_runoff.set_index('time', inplace=True)
    sim_df = sim_runoff[['discharge_cms']]
else:
    print("Warning: 'averageRoutedRunoff' variable not found in the SUMMA output")
    print("Available variables:", list(ds.data_vars))
    raise ValueError("Required 'averageRoutedRunoff' variable not found in SUMMA output")

# Load observed data
obs_df = None
if obs_path.exists():
    print(f"Loading observed streamflow data from: {obs_path}")
    obs_df = pd.read_csv(obs_path)
    obs_df['datetime'] = pd.to_datetime(obs_df['datetime'])
    obs_df.set_index('datetime', inplace=True)
    print(f"Observed data period: {obs_df.index.min()} to {obs_df.index.max()}")
    print(f"Observed streamflow range: {obs_df['discharge_cms'].min():.2f} to {obs_df['discharge_cms'].max():.2f} m³/s")

# Show simulated data info
print(f"Simulated data period: {sim_df.index.min()} to {sim_df.index.max()}")
print(f"Simulated streamflow range: {sim_df['discharge_cms'].min():.2f} to {sim_df['discharge_cms'].max():.2f} m³/s")

# Find common date range if observed data exists
if obs_df is not None:
    # Ensure same frequency for both datasets
    obs_daily = obs_df.resample('D').mean()  # Daily mean if multiple obs per day
    sim_daily = sim_df.resample('D').mean()  # Daily mean if sub-daily sim data
    
    # Find common date range
    start_date = max(obs_daily.index.min(), sim_daily.index.min())
    end_date = min(obs_daily.index.max(), sim_daily.index.max())
    
    print(f"\nCommon data period: {start_date} to {end_date}")

    # Advance the start date to skip the initial spinup
    start_date = start_date + pd.Timedelta(days=30)
    
    # Filter to common period
    obs_period = obs_daily.loc[start_date:end_date]
    sim_period = sim_daily.loc[start_date:end_date]
    
    # Calculate performance metrics
    # Calculate root mean square error (RMSE)
    rmse = ((obs_period['discharge_cms'] - sim_period['discharge_cms'])**2).mean()**0.5
    
    # Calculate Nash-Sutcliffe Efficiency (NSE)
    mean_obs = obs_period['discharge_cms'].mean()
    numerator = ((obs_period['discharge_cms'] - sim_period['discharge_cms'])**2).sum()
    denominator = ((obs_period['discharge_cms'] - mean_obs)**2).sum()
    nse = 1 - (numerator / denominator)
    
    # Calculate Percent Bias (PBIAS)
    pbias = 100 * (sim_period['discharge_cms'].sum() - obs_period['discharge_cms'].sum()) / obs_period['discharge_cms'].sum()
    
    # Calculate Kling-Gupta Efficiency (KGE)
    r = obs_period['discharge_cms'].corr(sim_period['discharge_cms'])  # Correlation
    alpha = sim_period['discharge_cms'].std() / obs_period['discharge_cms'].std()  # Relative variability
    beta = sim_period['discharge_cms'].mean() / obs_period['discharge_cms'].mean()  # Bias ratio
    kge = 1 - ((r - 1)**2 + (alpha - 1)**2 + (beta - 1)**2)**0.5
    
    print(f"Performance metrics:")
    print(f"  - RMSE: {rmse:.2f} m³/s")
    print(f"  - NSE: {nse:.2f}")
    print(f"  - PBIAS: {pbias:.2f}%")
    print(f"  - KGE: {kge:.2f}")
    
    # Create visualizations
    plt.figure(figsize=(16, 12))
    
    # 1. Time Series Plot - Full Period
    plt.subplot(2, 1, 1)
    plt.plot(obs_period.index, obs_period['discharge_cms'], 'b-', label='Observed', linewidth=1.5, alpha=0.7)
    plt.plot(sim_period.index, sim_period['discharge_cms'], 'r-', label='Simulated', linewidth=1.5, alpha=0.7)
    
    plt.title(f'Observed vs. Simulated Streamflow - {config_dict["DOMAIN_NAME"].replace("_", " ").title()}', fontsize=14)
    plt.xlabel('Date', fontsize=12)
    plt.ylabel('Discharge (m³/s)', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.legend(fontsize=12)
    
    # Add performance metrics as text box
    plt.text(0.02, 0.95, 
             f"RMSE: {rmse:.2f} m³/s\nNSE: {nse:.2f}\nPBIAS: {pbias:.2f}%\nKGE: {kge:.2f}",
             transform=plt.gca().transAxes, 
             fontsize=12,
             bbox=dict(facecolor='white', alpha=0.8, boxstyle='round,pad=0.5'))
    
    # 2. Scatter Plot with 1:1 line
    plt.subplot(2, 2, 3)
    plt.scatter(obs_period['discharge_cms'], sim_period['discharge_cms'], alpha=0.5, color='blue')
    
    # Add 1:1 line
    max_val = max(obs_period['discharge_cms'].max(), sim_period['discharge_cms'].max())
    plt.plot([0, max_val], [0, max_val], 'k--', label='1:1 line')
    
    plt.title('Observed vs. Simulated Comparison', fontsize=14)
    plt.xlabel('Observed Discharge (m³/s)', fontsize=12)
    plt.ylabel('Simulated Discharge (m³/s)', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.legend()
    
    # 3. Annual cycle plot - by month
    plt.subplot(2, 2, 4)
    
    # Calculate monthly means
    obs_monthly = obs_period.groupby(obs_period.index.month)['discharge_cms'].mean()
    sim_monthly = sim_period.groupby(sim_period.index.month)['discharge_cms'].mean()
    
    # Get month names for x-axis
    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    
    # Plot
    plt.plot(range(1, 13), obs_monthly.reindex(range(1, 13)), 'b-o', label='Observed', linewidth=2)
    plt.plot(range(1, 13), sim_monthly.reindex(range(1, 13)), 'r-o', label='Simulated', linewidth=2)
    
    plt.title('Annual Cycle (Monthly Average)', fontsize=14)
    plt.xlabel('Month', fontsize=12)
    plt.ylabel('Average Discharge (m³/s)', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.xticks(range(1, 13), month_names)
    plt.legend()
    
    plt.tight_layout()
    plt.show()
    
    # Create Flow Duration Curve
    plt.figure(figsize=(10, 6))
    
    # Sort values in descending order
    obs_sorted = obs_period['discharge_cms'].sort_values(ascending=False)
    sim_sorted = sim_period['discharge_cms'].sort_values(ascending=False)
    
    # Calculate exceedance probabilities
    obs_ranks = np.arange(1., len(obs_sorted) + 1) / len(obs_sorted)
    sim_ranks = np.arange(1., len(sim_sorted) + 1) / len(sim_sorted)
    
    # Plot Flow Duration Curves
    plt.semilogy(obs_ranks * 100, obs_sorted, 'b-', label='Observed', linewidth=2)
    plt.semilogy(sim_ranks * 100, sim_sorted, 'r-', label='Simulated', linewidth=2)
    
    plt.title('Flow Duration Curve', fontsize=14)
    plt.xlabel('Exceedance Probability (%)', fontsize=12)
    plt.ylabel('Discharge (m³/s)', fontsize=12)
    plt.grid(True, which='both', alpha=0.3)
    plt.legend(fontsize=12)
    
    # Add low, medium and high flow regions
    plt.axvspan(0, 20, alpha=0.2, color='blue', label='High Flows')
    plt.axvspan(20, 70, alpha=0.2, color='green', label='Medium Flows')
    plt.axvspan(70, 100, alpha=0.2, color='red', label='Low Flows')
    
    # Add text labels for flow regions
    plt.text(10, max(obs_sorted.max(), sim_sorted.max()) * 0.8, 'High Flows', fontsize=10, ha='center')
    plt.text(45, max(obs_sorted.max(), sim_sorted.max()) * 0.1, 'Medium Flows', fontsize=10, ha='center')
    plt.text(85, max(obs_sorted.max(), sim_sorted.max()) * 0.02, 'Low Flows', fontsize=10, ha='center')
    
    plt.tight_layout()
    plt.show()

else:
    # If no observed data, just plot simulated
    plt.figure(figsize=(14, 6))
    plt.plot(sim_df.index, sim_df['discharge_cms'], '-', label='Simulated Streamflow', color='blue', linewidth=1.5)
    plt.title(f"Simulated Streamflow - {config_dict['DOMAIN_NAME'].replace('_', ' ').title()}", fontsize=14)
    plt.xlabel('Date', fontsize=12)
    plt.ylabel('Discharge (m³/s)', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.legend(fontsize=12)
    plt.tight_layout()
    plt.show()

# Close the dataset
ds.close()

print("\nStreamflow visualization complete")

## 11. Model Agnostic Data Pre-Processing - Benchmarking

In [None]:
# Run benchmarking
print("\nRunning benchmarking analysis...")
benchmark_results = confluence.managers['analysis'].run_benchmarking()

## 14. Optional Steps - Optimization and Analysis

In [None]:
# Step 5 & 6: Optional Steps (Optimization and Analysis)
print("=== Step 5 & 6: Optional Steps ===")


## Alternative - Run Complete Workflow

In [None]:
# Alternative: Run the complete workflow in one step
# (Uncomment to use this instead of the step-by-step approach)

# confluence.run_workflow()

## Summary: Understanding the CONFLUENCE Workflow
Congratulations! You've completed a full lumped basin modeling workflow with CONFLUENCE. 

## Next Steps You Could Try:

### Experiment with different models (change HYDROLOGICAL_MODEL)
- Try distributed modeling (change SPATIAL_MODE to 'Distributed')
- Calibrate the model (use the optimization module)
- Analyze model sensitivity to different parameters
- Compare multiple model structures (decision analysis)
