# SYMFLUENCE Tutorial 04a — Logan River Workshop (Lumped SUMMA, Cloud Data)

## Introduction

This workshop notebook demonstrates how to set up a lumped SUMMA model for the Logan River at Logan, Utah using cloud-based data sources. The workflow includes:

1. **Configuration** — Set up a lumped basin model for the Logan River
2. **Domain Definition** — Delineate the watershed using TauDEM
3. **Data Acquisition** — Fetch AORC forcing data and USGS streamflow observations from cloud sources
4. **Model Execution** — Run SUMMA with mizuRoute routing
5. **Evaluation & Calibration** — Assess model performance and calibrate parameters

The **Logan River at Logan** is a snow-dominated mountain watershed in the Bear River Range of the Wasatch Mountains. USGS station 10109000 provides streamflow observations. The watershed covers approximately 218 km² with elevations ranging from ~1,400 m to over 2,900 m.

### 2i2c Environment Setup

This notebook is designed to work with the 2i2c JupyterHub environment using the pre-installed SYMFLUENCE virtual environment at `/tmp/symfluence`.

**Launch this notebook from the CLI:**
```bash
symfluence example launch 4a
```

In [None]:
# Environment verification
import sys
import warnings
from pathlib import Path

# Suppress experimental module warnings for cleaner output
warnings.filterwarnings('ignore', message='.*is an EXPERIMENTAL module.*')
warnings.filterwarnings('ignore', message='.*import failed.*')

print(f"Python executable: {sys.executable}")

# Verify SYMFLUENCE is available
try:
    import symfluence
    print(f"SYMFLUENCE version: {symfluence.__version__}")
    print(f"SYMFLUENCE location: {Path(symfluence.__file__).parent}")
except ImportError:
    print("ERROR: SYMFLUENCE not found. Please activate the symfluence environment.")
    sys.exit(1)

In [None]:
# Fix working directory if running from .ipynb_checkpoints
import os
from pathlib import Path

current_dir = Path.cwd()
print(f"Current directory: {current_dir}")

# If we're in .ipynb_checkpoints, move up to parent directory
if '.ipynb_checkpoints' in str(current_dir):
    correct_dir = current_dir.parent
    os.chdir(correct_dir)
    print(f"Changed to: {Path.cwd()}")
else:
    print("Working directory is correct")

# Verify we're in the workshop notebooks directory
expected_notebook = Path.cwd() / '04a_logan_river_workshop.ipynb'
if not expected_notebook.exists():
    print(f"WARNING: Expected notebook not found at {expected_notebook}")
else:
    print("Notebook location verified")

## Step 1 — Configuration

Create a configuration for the Logan River lumped basin model. Key settings:
- **Domain**: Lumped (single HRU) representation
- **Forcing**: AORC (Analysis of Record for Calibration) — 1km hourly gridded data
- **Observations**: USGS streamflow from station 10109000
- **Period**: 4 years (2018-2021) with 1-year spinup

In [None]:
# Step 1 — Create basin-scale configuration using new config system

from pathlib import Path
from symfluence import SYMFLUENCE
from symfluence.core.config.models import SymfluenceConfig
import os

# === Logan River Basin Configuration ===
# Using the new config factory method with modern lowercase/snake_case syntax

# Ensure we're working from the correct directory
current_dir = Path.cwd()
print(f"INITIAL working directory: {current_dir}")

if '.ipynb_checkpoints' in str(current_dir):
    current_dir = current_dir.parent
    os.chdir(current_dir)
    print(f"CHANGED to: {Path.cwd()}")
else:
    print("Working directory is correct")

# Set explicit paths
# Navigate to repo root (2 levels up from notebooks directory)
repo_root = current_dir.parent.parent
data_dir = repo_root.parent / 'SYMFLUENCE_data'  # Sibling to repo

taudem_path = str(data_dir / 'installs' / 'TauDEM' / 'bin')
mizuroute_path = '/Users/darrieythorsson/compHydro/data/CONFLUENCE_data/installs/mizuRoute/route/bin'

print(f"Notebook directory: {current_dir}")
print(f"Repo root (CODE_DIR): {repo_root}")
print(f"Data directory (DATA_DIR): {data_dir}")
print(f"TauDEM directory: {taudem_path}")
print(f"mizuRoute directory: {mizuroute_path}")

config = SymfluenceConfig.from_minimal(
    # ============================================================================
    # BASIC IDENTIFICATION
    # ============================================================================
    domain_name='Logan_River_at_Logan',
    experiment_id='workshop_run_1',
    
    # ============================================================================
    # PATHS (explicit to avoid working directory issues)
    # Use UPPERCASE parameter names as required by from_minimal
    # ============================================================================
    SYMFLUENCE_DATA_DIR=str(data_dir),
    SYMFLUENCE_CODE_DIR=str(repo_root),
    TAUDEM_DIR=taudem_path,
    SUMMA_INSTALL_PATH='default',  # Will use DATA_DIR/installs/summa/bin/summa_sundials.exe
    MIZUROUTE_INSTALL_PATH=mizuroute_path,
    
    # ============================================================================
    # HYDROLOGICAL MODEL
    # ============================================================================
    model='SUMMA',                              # Hydrological model (sets defaults)
    routing_model='mizuRoute',                  # Routing model
    
    # ============================================================================
    # SIMULATION PERIOD (4 years: 2018-2021)
    # ============================================================================
    time_start='2018-01-01 01:00',
    time_end='2021-12-31 23:00',
    
    # Time period definitions
    spinup_period='2018-01-01, 2018-12-31',        # Year 1: Model spinup
    calibration_period='2019-01-01, 2020-12-31',   # Years 2-3: Parameter calibration
    evaluation_period='2021-01-01, 2021-12-31',    # Year 4: Model evaluation
    
    # ============================================================================
    # SPATIAL DOMAIN (Logan River at Logan, UT)
    # ============================================================================
    # USGS station 10109000: 41.7443°N, 111.8086°W
    pour_point_coords='41.743098/-111.786432',
    bounding_box_coords='42.15/-111.90/41.70/-111.40',  # lat_max/lon_min/lat_min/lon_max
    
    # Domain discretization
    definition_method='lumped',
    discretization='GRUs',
    lumped_watershed_method='TauDEM',
    
    # ============================================================================
    # DATA SOURCES & FORCING
    # ============================================================================
    data_access='cloud',                        # Use cloud-based data sources
    forcing_dataset='AORC',                     # NOAA Analysis of Record for Calibration
    forcing_measurement_height=10,              # AORC wind measurements at 10m
    
    # DEM source (use Copernicus - free and open)
    dem_source='copernicus',                    # Copernicus DEM (30m, global, free)
    download_dem=True,
    
    # ============================================================================
    # STREAMFLOW OBSERVATIONS
    # ============================================================================
    station_id='10109000',                      # USGS station ID
    streamflow_data_provider='USGS',
    download_usgs_data=True,
    
    # ============================================================================
    # CALIBRATION SETTINGS
    # ============================================================================
    # Enable optimization methods
    OPTIMIZATION_METHODS=['iteration'],         # Enable iterative optimization
    
    # Parameters to calibrate
    params_to_calibrate='k_soil,theta_sat,aquiferBaseflowExp,aquiferBaseflowRate,qSurfScale,summerLAI,frozenPrecipMultip,Fcapil,tempCritRain,heightCanopyTop,heightCanopyBottom,windReductionParam,vGn_n',
    basin_params_to_calibrate='routingGammaScale,routingGammaShape',
    
    # Optimization configuration
    optimization_target='streamflow',
    optimization_algorithm='DDS',               # Dynamically Dimensioned Search
    optimization_metric='KGE',                  # Kling-Gupta Efficiency
    calibration_timestep='hourly',
    iterations=100,                             # Number of calibration iterations
)

# Verify the config was created correctly
print(f"\nConfig verification:")
print(f"  Config taudem_dir: {config.paths.taudem_dir}")
print(f"  Optimization methods: {config.optimization.methods}")

# ============================================================================
# SAVE CONFIGURATION (OPTIONAL)
# ============================================================================
config_path = Path('./config_logan_river_lumped.yaml')
config_dict = config.to_dict(flatten=True)
import yaml
with open(config_path, 'w') as f:
    yaml.dump(config_dict, f, default_flow_style=False, sort_keys=False)
print(f"\nConfiguration saved to: {config_path}")

# ============================================================================
# INITIALIZE SYMFLUENCE
# ============================================================================
symfluence = SYMFLUENCE(config)

# Create project structure
project_dir = symfluence.managers['project'].setup_project()
pour_point_path = symfluence.managers['project'].create_pour_point()

print(f"\nProject structure created at: {project_dir}")
print(f"Pour point shapefile: {pour_point_path}")
print("="*80)

## Step 2 — Domain Definition

Delineate the Logan River watershed using TauDEM and create a single lumped HRU.

### Step 2a — Geospatial Attribute Acquisition

Acquire elevation, land cover, and soil data from cloud sources.

In [None]:
# Step 2a — Acquire geospatial attributes from cloud
symfluence.managers['data'].acquire_attributes()
print("Attribute acquisition complete")

### Step 2b — Watershed Delineation

Delineate the watershed boundary from the pour point using TauDEM.

In [None]:
# Step 2b — Watershed delineation
watershed_path = symfluence.managers['domain'].define_domain()
print(f"Watershed delineation complete")
print(f"Watershed file: {watershed_path}")

### Step 2c — Domain Discretization

Create a single lumped HRU for the watershed.

In [None]:
# Step 2c — Discretization (single lumped HRU)
hru_path = symfluence.managers['domain'].discretize_domain()
print("Domain discretization complete")
print(f"HRU file: {hru_path}")

### Step 2d — Visualization

Visualize the delineated watershed and pour point.

In [None]:
# Step 2d — Basin visualization

import geopandas as gpd
import matplotlib.pyplot as plt

# Load spatial data
basin_path = project_dir / 'shapefiles' / 'river_basins' / f"{config.domain.name}_riverBasins_lumped.shp"
hru_file = project_dir / 'shapefiles' / 'catchment' / 'lumped' / config.domain.experiment_id / f"{config.domain.name}_HRUs_GRUs.shp"                     

watershed_gdf = gpd.read_file(str(basin_path))
hru_gdf = gpd.read_file(str(hru_file))
pour_point_gdf = gpd.read_file(pour_point_path)

# Calculate area (UTM Zone 12N for Utah)
watershed_proj = watershed_gdf.to_crs('EPSG:32612')
area_km2 = watershed_proj.geometry.area.sum() / 1e6

# Plot
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
watershed_gdf.boundary.plot(ax=ax, color='blue', linewidth=2, label='Watershed')
hru_gdf.plot(ax=ax, facecolor='lightblue', edgecolor='blue', alpha=0.3)
pour_point_gdf.plot(ax=ax, color='red', markersize=150, marker='*', label=f'Pour Point (USGS {config.evaluation.streamflow.station_id})')

ax.set_title(f"Logan River at Logan\nArea: {area_km2:.0f} km²", fontweight='bold', fontsize=14)
ax.legend(loc='upper right')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.tight_layout()
plt.show()

print(f"Watershed area: {area_km2:.0f} km²")
print(f"Number of HRUs: {len(hru_gdf)} (lumped)")

## Step 3 — Data Acquisition and Preprocessing

Fetch forcing data (AORC) and streamflow observations (USGS) from cloud sources.

### Step 3a — USGS Streamflow Observations

Download and process USGS streamflow data for station 10109000.

In [None]:
# Step 3a — Download and process USGS streamflow data
symfluence.managers['data'].process_observed_data()                                                                                                      

print("USGS streamflow data acquisition complete")

### Step 3b — AORC Meteorological Forcing

Download AORC forcing data from NOAA's cloud archive (AWS S3). AORC provides:
- 1 km spatial resolution
- Hourly temporal resolution
- Complete forcing variables: precipitation, temperature, humidity, wind, radiation, pressure

In [None]:
# Step 3b — Acquire AORC forcing data from cloud
symfluence.managers['data'].acquire_forcings()
print("AORC forcing acquisition complete")

### Step 3c — Model-Agnostic Preprocessing

Standardize forcing data: variable names, units, and spatial averaging over the watershed.

In [None]:
# Step 3c — Model-agnostic preprocessing
symfluence.managers['data'].run_model_agnostic_preprocessing()
print("Model-agnostic preprocessing complete")

## Step 4 — Model Configuration and Execution

Configure SUMMA for the lumped basin and run the simulation with mizuRoute routing.

In [None]:
# Step 4a — SUMMA-specific preprocessing
symfluence.managers['model'].preprocess_models()
print("SUMMA configuration complete")

In [None]:
# Step 4b — Model execution
print(f"Running {config.model.hydrological_model} with {config.model.routing_model}...")
print(f"Simulation period: {config.domain.time_start} to {config.domain.time_end}")
symfluence.managers['model'].run_models()
print("Basin-scale simulation complete")

## Step 5 — Streamflow Evaluation

Compare simulated streamflow against USGS observations using standard hydrological metrics.

In [None]:
# Step 5 — Streamflow evaluation

import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr

# Load basin area from shapefile
basin_path = project_dir / 'shapefiles' / 'river_basins' / f"{config.domain.name}_riverBasins_lumped.shp"
watershed_gdf = gpd.read_file(str(basin_path))
watershed_proj = watershed_gdf.to_crs('EPSG:32612')  # UTM Zone 12N for Utah
basin_area_m2 = watershed_proj.geometry.area.sum()
basin_area_km2 = basin_area_m2 / 1e6

print(f"Basin area: {basin_area_km2:.2f} km²")

# Load observed streamflow
obs_path = project_dir / "observations" / "streamflow" / "preprocessed" / f"{config.domain.name}_streamflow_processed.csv"
obs_df = pd.read_csv(obs_path, parse_dates=['datetime'])
obs_df.set_index('datetime', inplace=True)

# Load simulated streamflow from SUMMA output
sim_dir = project_dir / "simulations" / config.domain.experiment_id / "SUMMA"
sim_files = list(sim_dir.glob('*_timestep.nc'))
if not sim_files:
    raise FileNotFoundError(f"No SUMMA output found in: {sim_dir}")

sim_ds = xr.open_dataset(sim_files[0])
sim_df = sim_ds['averageRoutedRunoff'].to_dataframe().reset_index()
sim_df = sim_df.rename(columns={'time': 'datetime', 'averageRoutedRunoff': 'discharge_m_s'})
sim_df.set_index('datetime', inplace=True)

# Convert from m/s to m³/s
sim_df['discharge_sim'] = sim_df['discharge_m_s'] * basin_area_m2

# Exclude spinup period
spinup_end = pd.to_datetime(config.domain.spinup_period.split(',')[1].strip())
print(f"Excluding spinup period up to: {spinup_end}")

# Merge and align
eval_df = obs_df.join(sim_df[['discharge_sim']], how='inner')
eval_df = eval_df[eval_df.index > spinup_end]

obs_valid = eval_df['discharge_cms'].dropna()
sim_valid = eval_df.loc[obs_valid.index, 'discharge_sim']

print(f"Evaluation period: {obs_valid.index[0]} to {obs_valid.index[-1]}")
print(f"Number of timesteps: {len(obs_valid)}")

# Calculate evaluation metrics
def nse(obs, sim):
    return float(1 - np.sum((obs - sim)**2) / np.sum((obs - obs.mean())**2))

def kge(obs, sim):
    r = obs.corr(sim)
    alpha = sim.std() / obs.std()
    beta = sim.mean() / obs.mean()
    return float(1 - np.sqrt((r-1)**2 + (alpha-1)**2 + (beta-1)**2))

def pbias(obs, sim):
    return float(100 * (sim.sum() - obs.sum()) / obs.sum())

metrics = {
    'NSE': round(nse(obs_valid, sim_valid), 3),
    'KGE': round(kge(obs_valid, sim_valid), 3),
    'PBIAS': round(pbias(obs_valid, sim_valid), 1)
}

print("\nPerformance Metrics (Uncalibrated):")
for k, v in metrics.items():
    print(f"  {k}: {v}")

In [None]:
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Time series (top left)
axes[0, 0].plot(obs_valid.index, obs_valid.values, 'b-', label='Observed (USGS)', linewidth=1.2, alpha=0.7)
axes[0, 0].plot(sim_valid.index, sim_valid.values, 'r-', label='Simulated (SUMMA)', linewidth=1.2, alpha=0.7)
axes[0, 0].set_ylabel('Discharge (m³/s)')
axes[0, 0].set_title('Streamflow Time Series')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].text(0.02, 0.95, f"NSE: {metrics['NSE']}\nKGE: {metrics['KGE']}\nBias: {metrics['PBIAS']}%",
                transform=axes[0, 0].transAxes, verticalalignment='top',
                bbox=dict(facecolor='white', alpha=0.8), fontsize=9)

# Scatter (top right)
axes[0, 1].scatter(obs_valid, sim_valid, alpha=0.5, s=10)
max_val = max(obs_valid.max(), sim_valid.max())
axes[0, 1].plot([0, max_val], [0, max_val], 'k--', alpha=0.5)
axes[0, 1].set_xlabel('Observed (m³/s)')
axes[0, 1].set_ylabel('Simulated (m³/s)')
axes[0, 1].set_title('Observed vs Simulated')
axes[0, 1].grid(True, alpha=0.3)

# Monthly climatology (bottom left)
monthly_obs = obs_valid.groupby(obs_valid.index.month).mean()
monthly_sim = sim_valid.groupby(sim_valid.index.month).mean()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
axes[1, 0].plot(monthly_obs.index, monthly_obs.values, 'b-o', label='Observed', markersize=6)
axes[1, 0].plot(monthly_sim.index, monthly_sim.values, 'r-o', label='Simulated', markersize=6)
axes[1, 0].set_xticks(range(1, 13))
axes[1, 0].set_xticklabels(month_names)
axes[1, 0].set_ylabel('Mean Discharge (m³/s)')
axes[1, 0].set_title('Seasonal Flow Regime')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Flow duration curve (bottom right)
obs_sorted = obs_valid.sort_values(ascending=False)
sim_sorted = sim_valid.sort_values(ascending=False)
obs_ranks = np.arange(1., len(obs_sorted) + 1) / len(obs_sorted) * 100
sim_ranks = np.arange(1., len(sim_sorted) + 1) / len(sim_sorted) * 100
axes[1, 1].semilogy(obs_ranks, obs_sorted, 'b-', label='Observed', linewidth=2)
axes[1, 1].semilogy(sim_ranks, sim_sorted, 'r-', label='Simulated', linewidth=2)
axes[1, 1].set_xlabel('Exceedance Probability (%)')
axes[1, 1].set_ylabel('Discharge (m³/s)')
axes[1, 1].set_title('Flow Duration Curve')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle(f'Logan River at Logan — Lumped SUMMA Evaluation', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nStreamflow evaluation complete")

## Step 5b — Model Calibration

Calibrate SUMMA parameters using the DDS (Dynamically Dimensioned Search) algorithm to improve model performance. The calibration optimizes KGE over the calibration period (2019-2020).

In [None]:
# Step 5b — Run calibration
print(f"Starting calibration...")
print(f"Algorithm: {config.optimization.algorithm}")
print(f"Metric: {config.optimization.metric}")
print(f"Iterations: {config.optimization.iterations}")
print(f"Calibration period: {config.domain.calibration_period}")

results_file = symfluence.managers['optimization'].calibrate_model()
print(f"\nCalibration complete!")
print(f"Results file: {results_file}")

### View Calibration Results

In [None]:
# Load and display calibration results
if results_file and Path(results_file).exists():
    results_df = pd.read_csv(results_file)
    
    print("Calibration Progress:")
    print(f"  Best {config.optimization.metric}: {results_df['best_score'].iloc[-1]:.4f}")
    print(f"  Initial {config.optimization.metric}: {results_df['best_score'].iloc[0]:.4f}")
    print(f"  Improvement: {results_df['best_score'].iloc[-1] - results_df['best_score'].iloc[0]:.4f}")
    
    # Plot calibration progress
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(results_df['generation'], results_df['best_score'], 'b-', linewidth=2)
    ax.set_xlabel('Iteration')
    ax.set_ylabel(f'Best {config.optimization.metric}')
    ax.set_title('Calibration Progress')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("No calibration results found.")