# CONFLUENCE Tutorial - 10: CAMELS Large Sample Study (Multi-Basin Streamflow Analysis)

## Introduction

This tutorial represents the culmination of our CONFLUENCE large sample studies series: systematic streamflow modeling across hundreds of watersheds using the CAMELS spat dataset (Knoben et al., 2025). While our previous large sample tutorials focused on point-scale processes (FLUXNET energy fluxes, NorSWE snow dynamics), this tutorial demonstrates watershed-scale analysis of the most fundamental hydrological variable: streamflow. This represents the classic application of hydrological modeling and the ultimate test of CONFLUENCE's capabilities across diverse watersheds.

### CAMELS Spat: The Gold Standard for Large Sample Hydrology

The CAMELS Spat dataset was specifically designed to revolutionize hydrological science through large sample studies:

**Comprehensive Coverage**:
- **CAMELS-US**: 671 watersheds across the contiguous United States
- **Global Extensions**: CAMELS-GB, CAMELS-BR, CAMELS-CL, CAMELS-AUS, CAMELS-FR
- **Climate Diversity**: Arid to humid, tropical to continental, coastal to mountainous
- **Scale Range**: 4 to 25,000 km² watersheds

**Standardized Framework**:
- **Meteorological Forcing**: Gridded precipitation and temperature data
- **Streamflow Observations**: Quality-controlled daily discharge time series
- **Catchment Attributes**: Topographic, geologic, soil, and vegetation characteristics
- **Minimal Human Impact**: Focus on near-natural watersheds

**Research Impact**:
- **Benchmark Studies**: Standard dataset for model comparison
- **Process Understanding**: Systematic analysis of hydrological controls
- **Machine Learning**: Training data for data-driven approaches
- **Climate Studies**: Assessment of climate change impacts on hydrology

### Streamflow: The Integrative Hydrological Variable

Streamflow represents the integrated response of all watershed processes:

**Process Integration**:
- **Precipitation Processing**: Interception, infiltration, and runoff generation
- **Evapotranspiration**: Plant water use and soil moisture dynamics
- **Groundwater Interactions**: Baseflow contributions and storage dynamics
- **Routing Processes**: Travel time and channel hydraulics
- **Snow Processes**: Seasonal storage and release in cold regions

**Observational Advantages**:
- **Direct Measurement**: Streamflow is directly observable at gauging stations
- **Integrative Nature**: Represents the integrated watershed response
- **Long Records**: Many sites have decades of continuous observations
- **Management Relevance**: Direct connection to water resources applications

### Scientific Importance of Multi-Basin Streamflow Analysis

Large sample streamflow studies address fundamental questions in hydrology:

**Hydrological Controls**:
- **Climate vs. Landscape**: Relative importance of meteorological vs. physical controls
- **Scale Dependencies**: How hydrological processes scale from hillslopes to watersheds
- **Threshold Behaviors**: Nonlinear responses to climate and landscape characteristics
- **Regional Patterns**: Systematic variations across physiographic regions

**Model Evaluation**:
- **Process Representation**: Which model components are most important?
- **Parameter Transferability**: Can parameters be regionalized effectively?
- **Uncertainty Quantification**: How does model uncertainty vary across environments?
- **Structural Adequacy**: Are current model structures sufficient?

### CAMELS vs. Previous Large Sample Studies

This tutorial complements our previous large sample analyses:

| Dataset | Scale | Focus | Validation | Complexity |
|---------|-------|-------|------------|------------|
| **FLUXNET** | Point | Energy/carbon fluxes | Flux measurements | Ecosystem interactions |
| **NorSWE** | Point | Snow dynamics | State variables | Phase change physics |
| **CAMELS** | Watershed | Streamflow | Discharge observations | Process integration |

### Unique Challenges of Multi-Basin Streamflow Modeling

Watershed-scale streamflow modeling presents distinct challenges:

**Spatial Heterogeneity**:
- **Landscape Diversity**: Elevation, slope, soil, and vegetation gradients
- **Climate Variability**: Precipitation and temperature patterns within watersheds
- **Geological Controls**: Subsurface heterogeneity and groundwater systems
- **Scale Interactions**: Processes operating at different spatial scales

**Temporal Dynamics**:
- **Multiple Timescales**: Event response, seasonal cycles, and long-term trends
- **Memory Effects**: Antecedent conditions and storage dynamics
- **Extreme Events**: Floods, droughts, and their watershed-scale impacts
- **Climate Variability**: Interannual and decadal variations

### CONFLUENCE's Advantages for Multi-Basin Studies

CONFLUENCE's design provides unique advantages for large sample streamflow analysis:

**Consistent Methodology**:
- **Standardized Workflow**: Same modeling approach across all watersheds
- **Automated Processing**: Efficient setup and execution for hundreds of basins
- **Reproducible Science**: Complete documentation of modeling decisions
- **Quality Control**: Systematic evaluation of model performance

**Physical Realism**:
- **Process-Based Models**: Explicit representation of hydrological processes
- **Flexible Structure**: Adaptable to different watershed characteristics
- **Multi-Model Capability**: Compare different model structures
- **Uncertainty Assessment**: Quantify parameter and structural uncertainty

### Research Questions for Multi-Basin Analysis

Large sample streamflow studies enable investigation of fundamental hydrological questions:

1. **Process Controls**: What are the dominant controls on streamflow generation across different environments?
2. **Model Performance**: How does model performance vary with climate, topography, and soil characteristics?
3. **Parameter Patterns**: Are there systematic patterns in optimal parameter values across watersheds?
4. **Prediction Capability**: Can models trained in one region predict streamflow in another?
5. **Climate Sensitivity**: How sensitive is streamflow to climate variability and change?

### Expected Outcomes

This tutorial demonstrates several key capabilities for multi-basin streamflow analysis:

1. **Watershed-Scale Configuration**: Adapt CONFLUENCE for diverse watershed characteristics
2. **Streamflow Validation**: Compare simulated and observed hydrographs across sites
3. **Performance Analysis**: Evaluate model performance using multiple metrics
4. **Regional Patterns**: Identify systematic variations in model performance
5. **Process Diagnostics**: Understand reasons for model success and failure

### Methodological Framework

Multi-basin streamflow studies require sophisticated analytical approaches:

**Site Selection**:
- **Climate Gradients**: Represent aridity, temperature, and seasonality gradients
- **Physiographic Diversity**: Include different geological and topographic settings
- **Scale Representation**: Cover the range of watershed sizes
- **Data Quality**: Ensure reliable streamflow and meteorological data

**Model Evaluation**:
- **Multiple Metrics**: Nash-Sutcliffe efficiency, Kling-Gupta efficiency, bias
- **Flow Components**: Evaluate high flows, low flows, and timing
- **Seasonal Performance**: Assess model performance across different seasons
- **Extreme Events**: Evaluate performance during floods and droughts

### Tutorial Structure

This tutorial follows the established large sample framework while emphasizing streamflow-specific aspects:

1. **CAMELS Site Selection**: Choose representative watersheds across environmental gradients
2. **Watershed Configuration**: Adapt CONFLUENCE for diverse basin characteristics
3. **Streamflow-Focused Setup**: Configure for discharge validation and routing
4. **Batch Processing**: Execute CONFLUENCE across multiple watersheds
5. **Hydrograph Analysis**: Collect and analyze streamflow time series
6. **Performance Assessment**: Evaluate model performance across sites
7. **Regional Synthesis**: Identify patterns and controls on model performance

### Scientific Impact

Multi-basin streamflow studies contribute to advancing hydrological science:

- **Process Understanding**: Identify universal vs. regional hydrological controls
- **Model Development**: Improve model structure and parameterization
- **Water Resources**: Enhance streamflow prediction for management applications
- **Climate Applications**: Improve projections of streamflow under changing climate
- **Ungauged Basins**: Develop approaches for prediction in ungauged watersheds

### Tutorial Series Culmination

This tutorial represents the ultimate demonstration of CONFLUENCE's capabilities:

**Complete Skill Integration**:
- **Point-scale understanding**: Foundation in individual processes
- **Spatial scaling**: Watershed-scale process integration
- **Large sample methods**: Systematic multi-site analysis
- **Workflow automation**: Efficient processing of hundreds of sites

**Hydrological Scope**:
- **Process diversity**: Energy balance, snow dynamics, and streamflow
- **Scale range**: Points to watersheds to continental domains
- **Temporal coverage**: Event-scale to multi-decadal analysis
- **Environmental gradients**: Complete range of hydroclimatic conditions

**Methodological Sophistication**:
- **Model complexity**: From simple to sophisticated process representations
- **Uncertainty quantification**: Parameter and structural uncertainty assessment
- **Comparative analysis**: Systematic evaluation across multiple sites
- **Reproducible science**: Complete workflow documentation and automation

### Practical Applications

The skills developed in this tutorial have immediate practical applications:

- **Water Resources Management**: Streamflow prediction for reservoir operations
- **Flood Forecasting**: Improved understanding of extreme event generation
- **Climate Change Assessment**: Quantifying future streamflow changes
- **Ecological Applications**: Instream flow requirements and habitat assessment
- **Policy Support**: Science-based water allocation and management decisions

By completing this tutorial, you'll have mastered the complete spectrum of CONFLUENCE applications, from individual process understanding to large sample comparative analysis. This represents the cutting edge of hydrological science, where systematic multi-site analysis drives both theoretical advances and practical applications in water resources management.

The combination of CONFLUENCE's workflow efficiency with CAMELS' comprehensive watershed database provides an unparalleled framework for advancing our understanding of how watersheds function across Earth's diverse hydroclimatic environments.

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import subprocess
import yaml
from datetime import datetime

# Add CONFLUENCE to path
confluence_path = Path('../').resolve()
sys.path.append(str(confluence_path))

# Set up plotting style
plt.style.use('default')
%matplotlib inline

In [None]:
# Configuration for the large sample experiment
experiment_config = {
    'dataset': 'camels-spat',
    'max_watersheds': 5,
    'dry_run': False,
    'experiment_name': 'camelsspat_tutorial',
    'template_config': '/home/darri.eythorsson/code/CONFLUENCE/0_config_files/config_distributed_basin_template.yaml',
    'config_dir': '/home/darri.eythorsson/code/CONFLUENCE/0_config_files/camels_spat',
    'camelsspat_script': '/home/darri.eythorsson/code/CONFLUENCE/9_scripts/run_watersheds_camelsspat.py',
    'camelsspat_dir': '/work/comphyd_lab/data/_to-be-moved/camels-spat-upload/shapefiles/meso-scale/shapes-distributed',
    'metadata_csv': 'camels-spat-metadata.csv'
}

# Create experiment directory
experiment_dir = Path(f"./experiments/{experiment_config['experiment_name']}")
experiment_dir.mkdir(parents=True, exist_ok=True)

# Save configuration
with open(experiment_dir / 'experiment_config.yaml', 'w') as f:
    yaml.dump(experiment_config, f)

In [None]:
# Import function from the CAMELS-SPAT script to extract shapefile info
sys.path.append(str(Path(experiment_config['camelsspat_script']).parent))
from run_watersheds_camelsspat import extract_shapefile_info

# Check if we already have watershed info cached
watersheds_csv = experiment_dir / 'camelsspat_watersheds.csv'

if watersheds_csv.exists():
    print(f"Loading existing watershed information")
    watersheds_df = pd.read_csv(watersheds_csv)
else:
    print(f"Extracting watershed information...")
    watersheds_df = extract_shapefile_info(experiment_config['camelsspat_dir'])
    watersheds_df.to_csv(watersheds_csv, index=False)

print(f"Found {len(watersheds_df)} watersheds")

In [None]:
# Load CAMELS-SPAT metadata if available
metadata_path = experiment_config['metadata_csv']

if os.path.exists(metadata_path):
    print(f"Loading and merging metadata")
    metadata_df = pd.read_csv(metadata_path)
    metadata_df.columns = [col.strip() for col in metadata_df.columns]
    
    watersheds_df['Metadata_ID'] = watersheds_df['ID'].str.replace(r'^[A-Z]+_', '', regex=True)
    watersheds_merged = pd.merge(
        watersheds_df, metadata_df, 
        left_on='Metadata_ID', right_on='ID',
        how='left', suffixes=('', '_metadata')
    )
    watersheds_df = watersheds_merged
else:
    print(f"No metadata file found. Proceeding with shapefile information only.")

In [None]:
'''
# Launch the large sample experiment
cmd = ['python', experiment_config['camelsspat_script']]

# Add command-line arguments
if experiment_config['dry_run']:
    cmd.extend(['--dry-run'])

if experiment_config['max_watersheds'] > 0:
    cmd.extend(['--max-watersheds', str(experiment_config['max_watersheds'])])

print(f"Launching CONFLUENCE for {experiment_config['max_watersheds']} watersheds")
result = subprocess.run(cmd, capture_output=True, text=True)

# Save submission log
with open(experiment_dir / 'submission.log', 'w') as f:
    f.write(result.stdout)
'''

In [None]:
'''
# Parse submission log and check job status
submission_log = experiment_dir / 'submission.log'
submitted_jobs = []

if submission_log.exists():
    with open(submission_log, 'r') as f:
        log_content = f.read()
    
    # Extract job submissions
    import re
    pattern = r'Domain: ([^,]+), Job ID: (\d+)'
    matches = re.findall(pattern, log_content)
    for domain, job_id in matches:
        submitted_jobs.append({'domain': domain, 'job_id': job_id})
    
    if submitted_jobs:
        jobs_df = pd.DataFrame(submitted_jobs)
        print(f"Submitted {len(jobs_df)} jobs")

# Check job status
def check_job_status(user=None):
    user = user or os.environ.get('USER')
    cmd = ['squeue', '-u', user]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout

print("\nCurrent jobs:")
print(check_job_status())
'''

In [None]:
# Find completed watershed simulations
confluence_data_dir = Path("/work/comphyd_lab/data/CONFLUENCE_data")
camelsspat_dir = confluence_data_dir / "camels_spat"

completed = []
if camelsspat_dir.exists():
    for domain_dir in camelsspat_dir.glob("domain_*"):
        watershed_id = domain_dir.name.replace('domain_', '')
        sim_dir = domain_dir / "simulations"
        
        if sim_dir.exists() and list(sim_dir.rglob("*.nc")):
            completed.append({
                'watershed_id': watershed_id,
                'domain_dir': domain_dir,
                'sim_dir': sim_dir
            })

print(f"Completed simulations: {len(completed)}")

In [None]:
# Define functions to load model outputs
def load_summa_output(sim_dir, variable='scalarSWE'):
    import xarray as xr
    summa_output_dir = sim_dir / 'run_1' / 'SUMMA'
    if not summa_output_dir.exists():
        return None
    output_files = list(summa_output_dir.glob("*timestep*.nc"))
    if output_files:
        ds = xr.open_dataset(output_files[0])
        if variable in ds.variables:
            return pd.DataFrame({
                'time': pd.to_datetime(ds.time.values),
                'value': ds[variable].values.flatten()
            })
    return None

def load_streamflow(sim_dir):
    import xarray as xr
    mizuroute_dir = sim_dir / 'run_1' / 'mizuRoute'
    if not mizuroute_dir.exists():
        print(f"mizuRoute directory not found: {mizuroute_dir}")
        return None
    
    output_files = list(mizuroute_dir.glob("*.nc"))
    if not output_files:
        print(f"No netCDF files found in: {mizuroute_dir}")
        return None
    
    try:
        ds = xr.open_dataset(output_files[0])
        
        # Try to find streamflow variable
        for var in ['IRFroutedRunoff', 'routedRunoff', 'discharge']:
            if var in ds.variables:
                # Check dimensions
                var_dims = ds[var].dims
                
                if len(var_dims) == 1 and 'time' in var_dims:
                    # Single dimension (time only)
                    return pd.DataFrame({
                        'time': pd.to_datetime(ds.time.values),
                        'simulated': ds[var].values
                    })
                
                elif len(var_dims) > 1:
                    # Multiple dimensions (likely time and segments/reaches)
                    time_dim = 'time'
                    reach_dims = [d for d in var_dims if d != time_dim]
                    
                    if not reach_dims:
                        print(f"Unexpected dimensions in {var}: {var_dims}")
                        continue
                    
                    reach_dim = reach_dims[0]
                    
                    # Find the right reach/segment to use
                    # If there's a single reach, use it
                    if ds[reach_dim].size == 1:
                        reach_idx = 0
                    else:
                        # TODO: Add logic to find the appropriate outlet reach
                        # For now, use the last reach which is often the outlet
                        reach_idx = ds[reach_dim].size - 1
                    
                    # Extract data for the selected reach
                    flow_data = ds[var].isel({reach_dim: reach_idx}).values
                    
                    # Make sure lengths match
                    if len(ds.time) == len(flow_data):
                        return pd.DataFrame({
                            'time': pd.to_datetime(ds.time.values),
                            'simulated': flow_data
                        })
                    else:
                        print(f"Length mismatch: time ({len(ds.time)}) vs flow ({len(flow_data)})")
        
        print(f"No suitable flow variable found in {output_files[0]}")
        return None
        
    except Exception as e:
        print(f"Error loading streamflow data: {e}")
        return None

def load_observed_streamflow(domain_dir, watershed_id):
    obs_dir = domain_dir / 'observations' / 'streamflow' / 'preprocessed'
    obs_file = list(obs_dir.glob(f"*{watershed_id}*streamflow*.csv"))
    
    if obs_file:
        obs_df = pd.read_csv(obs_file[0])
        time_col = 'datetime' if 'datetime' in obs_df.columns else 'date'
        flow_col = 'discharge_cms' if 'discharge_cms' in obs_df.columns else 'streamflow'
        
        return pd.DataFrame({
            'time': pd.to_datetime(obs_df[time_col]),
            'observed': obs_df[flow_col]
        })
    return None

In [None]:
# Calculate performance metrics
def calculate_metrics(obs, sim):
    mask = ~(np.isnan(obs) | np.isnan(sim))
    obs, sim = obs[mask], sim[mask]
    
    if len(obs) == 0:
        return {}
    
    # Nash-Sutcliffe Efficiency
    nse = 1 - np.sum((obs - sim)**2) / np.sum((obs - np.mean(obs))**2)
    # RMSE
    rmse = np.sqrt(np.mean((obs - sim)**2))
    # Percent Bias
    pbias = 100 * np.sum(sim - obs) / np.sum(obs) if np.sum(obs) != 0 else np.nan
    # Correlation
    corr = np.corrcoef(obs, sim)[0, 1] if len(obs) > 1 else np.nan
    
    return {'NSE': nse, 'RMSE': rmse, 'PBIAS': pbias, 'Correlation': corr}

# Calculate metrics for completed watersheds
metrics_list = []
for ws in completed:
    sim_flow = load_streamflow(ws['sim_dir'])
    obs_flow = load_observed_streamflow(ws['domain_dir'], ws['watershed_id'])
    
    if sim_flow is not None and obs_flow is not None:
        merged = pd.merge(obs_flow, sim_flow, on='time', how='inner')
        metrics = calculate_metrics(merged['observed'].values, merged['simulated'].values)
        metrics['watershed_id'] = ws['watershed_id']
        metrics_list.append(metrics)

if metrics_list:
    metrics_df = pd.DataFrame(metrics_list)
    print("Performance Metrics:")
    print(metrics_df)
    metrics_df.to_csv(experiment_dir / 'performance_metrics.csv', index=False)

In [None]:
# Create summary report
if completed:
    print("### CAMELS-SPAT Experiment Summary Report ###")
    print(f"Experiment: {experiment_config['experiment_name']}")
    print(f"Date: {datetime.now().strftime('%Y-%m-%d')}")
    print(f"Completed simulations: {len(completed)}")
    
    if metrics_list:
        print("\nOverall Model Performance:")
        for metric in ['NSE', 'RMSE', 'PBIAS', 'Correlation']:
            if metric in metrics_df.columns:
                print(f"  Average {metric}: {metrics_df[metric].mean():.3f}")
                
    # Save report to file
    report_path = experiment_dir / 'experiment_report.txt'
    with open(report_path, 'w') as f:
        f.write(f"CAMELS-SPAT Experiment Summary\n")
        f.write(f"Date: {datetime.now().strftime('%Y-%m-%d')}\n")
        f.write(f"Experiment: {experiment_config['experiment_name']}\n")
    
    print(f"\nSummary report saved to {report_path}")
else:
    print("No completed simulations found yet.")