# CONFLUENCE Tutorial - 8: Large Sample Studies (FLUXNET Multi-Site Analysis)

## Introduction

This tutorial represents the culmination of our CONFLUENCE series, demonstrating the evolution from single-domain modeling to large sample studies. While previous tutorials focused on modeling individual domains across different scales, large sample studies leverage CONFLUENCE's workflow efficiency to systematically analyze hundreds or thousands of sites, enabling comparative hydrology and robust statistical analysis.

### Large Sample Studies in Hydrology

Large sample studies in hydrology involve systematic analysis across many sites, watersheds, or regions to identify patterns in hydrological processes across different environments, test theoretical concepts under diverse conditions, and develop improved model parameterizations based on multi-site evidence. This approach enables researchers to quantify uncertainty, assess model performance and reliability across different settings, and advance comparative hydrology by comparing hydrological responses across climates, landscapes, and scales.

### FLUXNET as a Framework for Large Sample Analysis

The FLUXNET network provides an ideal framework for large sample hydrological analysis due to its global coverage spanning all continents and diverse ecosystems including forests, grasslands, wetlands, and croplands across multiple climate zones from tropical to boreal and arid regions. The network's standardized eddy covariance methodology ensures consistent, quality-controlled data processing with comparable variables across sites at standardized temporal resolution.

FLUXNET data offers significant scientific value for energy balance validation of land surface models, ecosystem-scale process understanding, climate-vegetation interaction studies, and model benchmarking across diverse environmental conditions. This makes it particularly well-suited for systematic multi-site hydrological analysis.

### CONFLUENCE's Large Sample Capabilities

CONFLUENCE's design makes it particularly effective for large sample analysis through workflow automation that reduces manual effort per site while maintaining consistent methodology across all sites to ensure comparability. The system's template-based configuration enables rapid site setup, while complete workflow documentation ensures reproducible science. High-performance computing capabilities allow parallel execution across multiple sites, and standardized output formats facilitate multi-site analysis.

### Technical Implementation

Large sample studies with CONFLUENCE involve several key components: strategic site selection across environmental gradients, automated configuration generation for site-specific setups, efficient batch processing of CONFLUENCE across multiple sites, systematic results aggregation and standardization from all sites, and comprehensive comparative analysis to identify patterns and relationships across sites.

### Learning Objectives

This tutorial will teach you to design large sample experiments with appropriate site selection strategies, automate configuration generation for hundreds of sites, manage batch processing of multiple CONFLUENCE runs, aggregate and analyze results from multi-site experiments, visualize patterns across environmental gradients, and apply statistical methods to understand hydrological controls.

The combination of CONFLUENCE's workflow efficiency with large sample methodologies opens new possibilities for understanding how hydrological processes vary across Earth's diverse environments, from individual flux towers to global patterns.

## Step 1: Large Sample Template Configuration and Experimental Design
This tutorial represents the next evolution in our CONFLUENCE series: large sample studies. Rather than scaling to larger spatial domains, we now leverage CONFLUENCE's workflow efficiency to systematically analyze hundreds of sites across global environmental gradients. Using the FLUXNET network, we demonstrate how to transform CONFLUENCE from a single-domain modeling platform into a powerful engine for comparative hydrology and statistical analysis across diverse ecosystems.

The same CONFLUENCE framework now scales to handle systematic multi-site analysis while maintaining the workflow consistency and scientific rigor established throughout our tutorial series.


In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import subprocess
import yaml
from datetime import datetime
import seaborn as sns
import warnings

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline
confluence_path = Path('../').resolve()

# =============================================================================
# LARGE SAMPLE EXPERIMENTAL DESIGN CONFIGURATION
# =============================================================================

# Set directory paths
CONFLUENCE_CODE_DIR = confluence_path
CONFLUENCE_DATA_DIR = Path('/Users/darrieythorsson/compHydro/data/CONFLUENCE_data')  # ← Update this path
#CONFLUENCE_DATA_DIR = Path('/path/to/your/CONFLUENCE_data') 

# Load point scale configuration template or create from base template
na_config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_point_template.yaml'
with open(na_config_path, 'r') as f:
    config_dict = yaml.safe_load(f)

# Update for tutorial-specific settings
config_updates = {
    'CONFLUENCE_CODE_DIR': str(CONFLUENCE_CODE_DIR),
    'CONFLUENCE_DATA_DIR': str(CONFLUENCE_DATA_DIR),
    'DOMAIN_NAME': 'fluxnet',
    'EXPERIMENT_ID': 'run_1',
    'EXPERIMENT_TIME_START': '2018-01-01 01:00',
    'EXPERIMENT_TIME_END': '2018-03-31 23:00',  # Short for tutorial demonstration
}

config_dict.update(config_updates)

# Save continental configuration
continental_config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_fluxnet_template.yaml'
with open(continental_config_path, 'w') as f:
    yaml.dump(config_dict, f, default_flow_style=False, sort_keys=False)

print(f"✅ Fluxnet templare configuration saved: {continental_config_path}")

# =============================================================================
# LOAD AND EXAMINE FLUXNET DATASET
# =============================================================================

print(f"\nLoading FLUXNET Site Database...")

# Load the FLUXNET sites database
try:
    fluxnet_df = pd.read_csv('fluxnet_towers.csv')
    print(f"Successfully loaded FLUXNET database: {len(fluxnet_df)} sites available")
except FileNotFoundError:
    print(f"Error: FLUXNET database not found")
    print(f"Please ensure 'fluxnet_transformed.csv' is in the current directory")
    raise

# Display basic dataset information
print(f"\nDataset Overview:")
print(f"  Total sites: {len(fluxnet_df)}")
print(f"  Columns: {len(fluxnet_df.columns)}")
print(f"  Column names: {', '.join(fluxnet_df.columns)}")

# =============================================================================
# EXTRACT SPATIAL COORDINATES
# =============================================================================

print(f"\nExtracting Spatial Information...")

# Parse coordinate information
try:
    coords = fluxnet_df['POUR_POINT_COORDS'].str.split('/', expand=True)
    fluxnet_df['latitude'] = coords[0].astype(float)
    fluxnet_df['longitude'] = coords[1].astype(float)
    
    print(f"Coordinate extraction successful")
    print(f"  Latitude range: {fluxnet_df['latitude'].min():.1f}° to {fluxnet_df['latitude'].max():.1f}°")
    print(f"  Longitude range: {fluxnet_df['longitude'].min():.1f}° to {fluxnet_df['longitude'].max():.1f}°")
except Exception as e:
    print(f"Error extracting coordinates: {e}")

# =============================================================================
# DATASET CHARACTERISTICS ANALYSIS
# =============================================================================

print(f"\nAnalyzing Dataset Characteristics...")

# Climate classification analysis
if 'KG' in fluxnet_df.columns:
    climate_counts = fluxnet_df['KG'].value_counts()
    print(f"  Climate types (Köppen-Geiger): {len(climate_counts)}")
    print(f"    Most common: {climate_counts.index[0]} ({climate_counts.iloc[0]} sites)")

# Land cover analysis
if 'Dominant_LC' in fluxnet_df.columns:
    landcover_counts = fluxnet_df['Dominant_LC'].value_counts()
    print(f"  Land cover types: {len(landcover_counts)}")
    print(f"    Most common: {landcover_counts.index[0]} ({landcover_counts.iloc[0]} sites)")

# Area analysis
if 'Area_km2' in fluxnet_df.columns:
    area_stats = fluxnet_df['Area_km2'].describe()
    print(f"  Area range: {area_stats['min']:.2f} to {area_stats['max']:.2f} km²")
    print(f"  Mean area: {area_stats['mean']:.2f} km²")

# =============================================================================
# DATASET VISUALIZATION
# =============================================================================

print(f"\nCreating Dataset Overview Visualization...")

# Create comprehensive dataset overview
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Global distribution map
ax1 = axes[0, 0]
ax1.scatter(fluxnet_df['longitude'], fluxnet_df['latitude'], 
           c='blue', alpha=0.6, s=30)
ax1.set_xlabel('Longitude')
ax1.set_ylabel('Latitude')
ax1.set_title(f'Global FLUXNET Site Distribution\n({len(fluxnet_df)} sites)')
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-180, 180)
ax1.set_ylim(-60, 80)

# Climate type distribution
ax2 = axes[0, 1]
if 'KG' in fluxnet_df.columns:
    climate_counts = fluxnet_df['KG'].value_counts()
    bars = ax2.bar(range(len(climate_counts)), climate_counts.values, 
                   color='skyblue', alpha=0.7)
    ax2.set_xticks(range(len(climate_counts)))
    ax2.set_xticklabels(climate_counts.index, rotation=45, ha='right')
    ax2.set_ylabel('Number of Sites')
    ax2.set_title('Climate Type Distribution')
    ax2.grid(True, alpha=0.3, axis='y')

# Land cover distribution
ax3 = axes[1, 0]
if 'Dominant_LC' in fluxnet_df.columns:
    lc_counts = fluxnet_df['Dominant_LC'].value_counts()
    bars = ax3.bar(range(len(lc_counts)), lc_counts.values, 
                   color='lightgreen', alpha=0.7)
    ax3.set_xticks(range(len(lc_counts)))
    ax3.set_xticklabels(lc_counts.index, rotation=45, ha='right')
    ax3.set_ylabel('Number of Sites')
    ax3.set_title('Land Cover Distribution')
    ax3.grid(True, alpha=0.3, axis='y')

# Area distribution
ax4 = axes[1, 1]
if 'Area_km2' in fluxnet_df.columns:
    ax4.hist(fluxnet_df['Area_km2'], bins=20, color='orange', alpha=0.7, edgecolor='black')
    ax4.set_xlabel('Area (km²)')
    ax4.set_ylabel('Number of Sites')
    ax4.set_title('Site Area Distribution')
    ax4.grid(True, alpha=0.3, axis='y')

plt.suptitle('FLUXNET Dataset Overview', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# =============================================================================
# SUMMARY STATISTICS
# =============================================================================

print(f"\nDataset Summary:")
print(f"  Geographic coverage: {len(fluxnet_df)} sites globally")
print(f"  Latitudinal span: {fluxnet_df['latitude'].max() - fluxnet_df['latitude'].min():.0f}°")
print(f"  Environmental diversity:")
if 'KG' in fluxnet_df.columns:
    print(f"    Climate types: {len(fluxnet_df['KG'].unique())}")
if 'Dominant_LC' in fluxnet_df.columns:
    print(f"    Land cover types: {len(fluxnet_df['Dominant_LC'].unique())}")
if 'Area_km2' in fluxnet_df.columns:
    print(f"    Area range: {fluxnet_df['Area_km2'].min():.1f} - {fluxnet_df['Area_km2'].max():.1f} km²")

print(f"\nSection 1 Complete: FLUXNET dataset loaded and analyzed")
print(f"Ready for large sample processing workflow")

## Step 2: Automated CONFLUENCE Configuration and Batch Processing

Building on the FLUXNET dataset analysis and default configuration from Step 1, this step demonstrates automated large sample processing using the `run_watersheds_fluxnet.py` script. This script performs two key functions:

**Configuration Generation**: The script reads the FLUXNET site database and automatically creates individual CONFLUENCE configuration files for each site. Each configuration is customized with site-specific parameters including domain coordinates, bounding box definitions, and unique identifiers, while maintaining consistent model settings across all sites.

**Batch Job Submission**: The script submits SLURM jobs to execute the complete CONFLUENCE workflow for each FLUXNET site in parallel. Each job processes geographic data, prepares meteorological forcing, processes FLUXNET observations, runs the hydrological model, and generates standardized output files.

This automated approach scales CONFLUENCE from single-domain modeling to systematic multi-site analysis across hundreds of locations:

In [None]:
def run_fluxnet_script_from_notebook():
    """
    Execute the run_watersheds_fluxnet.py script from within the notebook
    """
    print(f"\n Executing FLUXNET Large Sample Processing Script...")
    
    script_path = "./run_watersheds_fluxnet.py"
    
    if not Path(script_path).exists():
        print(f"❌ Script not found: {script_path}")
        return False
    
    print(f"   Script location: {script_path}")
    print(f"   Target sites: {len(selected_df)} FLUXNET sites")
    print(f"   Processing started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    try:
        # Run the script with automated responses
        # Note: This assumes the script will use the CSV file created in Step 1
        
        # Create a process with input automation
        process = subprocess.Popen(
            ['python', script_path],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            bufsize=1,
            universal_newlines=True
        )
        
        # Send 'y' to confirm job submission when prompted
        stdout, stderr = process.communicate(input='y\n')
        
        # Print the output
        if stdout:
            print("📋 Script Output:")
            for line in stdout.split('\n'):
                if line.strip():
                    print(f"   {line}")
        
        if stderr:
            print("⚠️  Script Warnings/Errors:")
            for line in stderr.split('\n'):
                if line.strip():
                    print(f"   {line}")
        
        if process.returncode == 0:
            print(f"✅ FLUXNET processing script completed successfully")
            return True
        else:
            print(f"❌ Script failed with return code: {process.returncode}")
            return False
            
    except Exception as e:
        print(f"❌ Error running script: {e}")
        return False

# Execute the FLUXNET processing script
script_success = run_fluxnet_script_from_notebook()

if script_success:
    print(f"\n✅ Step 2 Complete: Large sample processing initiated")

## Step 3: Multi-Site Output Analysis and ET Validation

Having executed large sample processing, this step demonstrates the analytical power that emerges from systematic multi-site CONFLUENCE results. The analysis showcases comprehensive spatial analysis, statistical comparison, and process validation across diverse environmental gradients.

### From Case Studies to Comparative Hydrology

Traditional hydrological analysis relies on individual site interpretation and validation, resulting in site-specific model evaluation with limited generalizability. This approach makes it difficult to distinguish universal processes from local effects and requires manual comparison across disparate studies with limited statistical power for robust pattern identification.

Large sample analysis enables systematic multi-site comparative hydrology through spatial pattern recognition across global environmental gradients. This approach provides sufficient sample sizes for statistical hypothesis testing, allows assessment of process universality by distinguishing general from site-specific patterns, and enables evaluation of model transferability across diverse environmental conditions.

The transition from individual case studies to large sample analysis represents a fundamental shift in hydrological science, moving from descriptive site-specific understanding to predictive process-based knowledge applicable across diverse environments.

In [None]:
def discover_completed_domains():
    """
    Discover all completed FLUXNET domain directories and their outputs
    """
    print(f"\n🔍 Discovering Completed FLUXNET Domains...")
    
    # Base data directory pattern
    data_dir_pattern = str(CONFLUENCE_DATA_DIR / "domain_*")
    
    # Find all domain directories
    domain_dirs = glob.glob(data_dir_pattern)
    
    print(f"   📁 Found {len(domain_dirs)} total domain directories")
    
    completed_domains = []
    
    for domain_dir in domain_dirs:
        domain_path = Path(domain_dir)
        domain_name = domain_path.name.replace('domain_', '')
        
        # Check if this is a FLUXNET domain (should match our selected sites)
        if any(domain_name in site for site in selected_df['DOMAIN_NAME'].values):
            
            # Check for key output files
            shapefile_path = domain_path / "shapefiles" / "catchment" / f"{domain_name}_HRUs.shp"
            simulation_dir = domain_path / "simulations"
            
            domain_info = {
                'domain_name': domain_name,
                'domain_path': domain_path,
                'has_shapefile': shapefile_path.exists(),
                'shapefile_path': shapefile_path if shapefile_path.exists() else None,
                'has_simulations': simulation_dir.exists(),
                'simulation_path': simulation_dir if simulation_dir.exists() else None,
                'simulation_files': []
            }
            
            # Find simulation output files
            if simulation_dir.exists():
                nc_files = list(simulation_dir.glob("**/*.nc"))
                domain_info['simulation_files'] = nc_files
                domain_info['has_results'] = len(nc_files) > 0
            else:
                domain_info['has_results'] = False
            
            completed_domains.append(domain_info)
    
    print(f"   🎯 FLUXNET domains found: {len(completed_domains)}")
    print(f"   📊 Domains with shapefiles: {sum(1 for d in completed_domains if d['has_shapefile'])}")
    print(f"   📈 Domains with simulation results: {sum(1 for d in completed_domains if d['has_results'])}")
    
    return completed_domains

def create_domain_overview_map(completed_domains):
    """
    Create an overview map showing all domain locations and their completion status
    """
    print(f"\n🗺️  Creating Domain Overview Map...")
    
    # Create figure for overview map
    fig, axes = plt.subplots(2, 2, figsize=(20, 16))
    
    # Map 1: Global overview with completion status
    ax1 = axes[0, 0]
    
    # Plot all selected sites
    ax1.scatter(selected_df['longitude'], selected_df['latitude'], 
               c='lightgray', alpha=0.5, s=30, label='Selected sites', marker='o')
    
    # Plot completed domains with different colors for different completion levels
    for domain in completed_domains:
        domain_name = domain['domain_name']
        
        # Find corresponding site in selected_df
        site_row = selected_df[selected_df['DOMAIN_NAME'] == domain_name]
        
        if not site_row.empty:
            lat = site_row['latitude'].iloc[0]
            lon = site_row['longitude'].iloc[0]
            
            # Color based on completion status
            if domain['has_results']:
                color = 'green'
                label = 'Complete with results'
                marker = 's'
                size = 50
            elif domain['has_shapefile']:
                color = 'orange' 
                label = 'Shapefile only'
                marker = '^'
                size = 40
            else:
                color = 'red'
                label = 'Processing started'
                marker = 'v'
                size = 30
            
            ax1.scatter(lon, lat, c=color, s=size, marker=marker, alpha=0.8,
                       edgecolors='black', linewidth=0.5)
    
    ax1.set_xlabel('Longitude')
    ax1.set_ylabel('Latitude')
    ax1.set_title('FLUXNET Domain Processing Status Overview')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(-180, 180)
    ax1.set_ylim(-60, 80)
    
    # Create custom legend
    legend_elements = [
        plt.scatter([], [], c='green', s=50, marker='s', label='Complete with results'),
        plt.scatter([], [], c='orange', s=40, marker='^', label='Shapefile generated'),
        plt.scatter([], [], c='red', s=30, marker='v', label='Processing started'),
        plt.scatter([], [], c='lightgray', s=30, marker='o', label='Selected sites')
    ]
    ax1.legend(handles=legend_elements, loc='lower left')
    
    # Map 2: Completion statistics by climate type
    ax2 = axes[0, 1]
    
    if 'KG' in selected_df.columns:
        # Count completion by climate type
        climate_completion = {}
        
        for domain in completed_domains:
            domain_name = domain['domain_name']
            site_row = selected_df[selected_df['DOMAIN_NAME'] == domain_name]
            
            if not site_row.empty:
                climate = site_row['KG'].iloc[0]
                
                if climate not in climate_completion:
                    climate_completion[climate] = {'total': 0, 'complete': 0, 'partial': 0}
                
                climate_completion[climate]['total'] += 1
                
                if domain['has_results']:
                    climate_completion[climate]['complete'] += 1
                elif domain['has_shapefile']:
                    climate_completion[climate]['partial'] += 1
        
        # Create stacked bar chart
        climates = list(climate_completion.keys())
        complete_counts = [climate_completion[c]['complete'] for c in climates]
        partial_counts = [climate_completion[c]['partial'] for c in climates]
        pending_counts = [climate_completion[c]['total'] - 
                         climate_completion[c]['complete'] - 
                         climate_completion[c]['partial'] for c in climates]
        
        x_pos = range(len(climates))
        
        ax2.bar(x_pos, complete_counts, label='Complete', color='green', alpha=0.7)
        ax2.bar(x_pos, partial_counts, bottom=complete_counts, 
               label='Partial', color='orange', alpha=0.7)
        ax2.bar(x_pos, pending_counts, 
               bottom=[c+p for c,p in zip(complete_counts, partial_counts)], 
               label='Pending', color='red', alpha=0.7)
        
        ax2.set_xticks(x_pos)
        ax2.set_xticklabels(climates, rotation=45, ha='right')
        ax2.set_ylabel('Number of Sites')
        ax2.set_title('Processing Status by Climate Type')
        ax2.legend()
        ax2.grid(True, alpha=0.3, axis='y')
    
    # Map 3: Domain area distribution
    ax3 = axes[1, 0]
    
    domain_areas = []
    for domain in completed_domains:
        domain_name = domain['domain_name']
        site_row = selected_df[selected_df['DOMAIN_NAME'] == domain_name]
        
        if not site_row.empty and 'Area_km2' in site_row.columns:
            area = site_row['Area_km2'].iloc[0]
            domain_areas.append(area)
    
    if domain_areas:
        ax3.hist(domain_areas, bins=15, color='skyblue', alpha=0.7, edgecolor='black')
        ax3.set_xlabel('Domain Area (km²)')
        ax3.set_ylabel('Number of Domains')
        ax3.set_title('Completed Domain Area Distribution')
        ax3.grid(True, alpha=0.3, axis='y')
        
        # Add statistics
        stats_text = f"Mean: {np.mean(domain_areas):.1f} km²\nMedian: {np.median(domain_areas):.1f} km²"
        ax3.text(0.98, 0.98, stats_text, transform=ax3.transAxes,
                bbox=dict(facecolor='white', alpha=0.8), fontsize=10,
                ha='right', va='top')
    
    # Map 4: Processing timeline (if log files available)
    ax4 = axes[1, 1]
    
    # Summary statistics
    total_selected = len(selected_df)
    total_discovered = len(completed_domains)
    total_with_shapefiles = sum(1 for d in completed_domains if d['has_shapefile'])
    total_with_results = sum(1 for d in completed_domains if d['has_results'])
    
    categories = ['Selected', 'Processing\nStarted', 'Shapefiles\nGenerated', 'Results\nComplete']
    counts = [total_selected, total_discovered, total_with_shapefiles, total_with_results]
    colors = ['lightblue', 'yellow', 'orange', 'green']
    
    bars = ax4.bar(categories, counts, color=colors, alpha=0.7, edgecolor='black')
    
    # Add value labels on bars
    for bar, count in zip(bars, counts):
        ax4.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.5,
                str(count), ha='center', va='bottom', fontweight='bold')
    
    ax4.set_ylabel('Number of Sites')
    ax4.set_title('Large Sample Processing Progress')
    ax4.grid(True, alpha=0.3, axis='y')
    
    plt.suptitle('FLUXNET Large Sample Study - Domain Overview', 
                 fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Save the overview map
    overview_path = experiment_dir / 'plots' / 'domain_overview_map.png'
    plt.savefig(overview_path, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"✅ Domain overview map saved: {overview_path}")
    
    return total_selected, total_discovered, total_with_shapefiles, total_with_results

def extract_et_results_from_domains(completed_domains):
    """
    Extract ET simulation results from all completed domains
    """
    print(f"\n📊 Extracting ET Results from Completed Domains...")
    
    et_results = []
    processing_summary = {
        'total_domains': len(completed_domains),
        'domains_with_results': 0,
        'domains_with_et': 0,
        'failed_extractions': 0
    }
    
    for domain in completed_domains:
        if not domain['has_results']:
            continue
            
        domain_name = domain['domain_name']
        processing_summary['domains_with_results'] += 1
        
        try:
            print(f"   🔄 Processing {domain_name}...")
            
            # Find simulation output files
            nc_files = domain['simulation_files']
            
            # Look for daily or monthly output files
            daily_files = [f for f in nc_files if 'day' in f.name.lower()]
            monthly_files = [f for f in nc_files if 'month' in f.name.lower()]
            
            output_file = None
            if daily_files:
                output_file = daily_files[0]
            elif monthly_files:
                output_file = monthly_files[0]
            elif nc_files:
                output_file = nc_files[0]  # Use any available file
            
            if output_file is None:
                print(f"     ❌ No suitable output files found")
                processing_summary['failed_extractions'] += 1
                continue
            
            # Load the netCDF file
            ds = xr.open_dataset(output_file)
            
            # Look for ET variables
            et_vars = [var for var in ds.data_vars 
                      if any(et_term in var.lower() 
                            for et_term in ['et', 'evap', 'latent', 'latheat'])]
            
            if not et_vars:
                print(f"     ⚠️  No ET variables found in {output_file.name}")
                processing_summary['failed_extractions'] += 1
                continue
            
            # Use the first ET variable found
            et_var = et_vars[0]
            print(f"     📈 Using ET variable: {et_var}")
            
            # Extract ET data
            et_data = ds[et_var]
            
            # Handle multi-dimensional data (take spatial mean if needed)
            if len(et_data.dims) > 1:
                spatial_dims = [dim for dim in et_data.dims if dim != 'time']
                if spatial_dims:
                    et_data = et_data.mean(dim=spatial_dims)
            
            # Convert to pandas Series
            et_series = et_data.to_pandas()
            
            # Handle unit conversion if needed
            # Check for negative values (SUMMA convention)
            if et_series.median() < 0:
                et_series = -et_series
            
            # Convert units to mm/day if needed
            if 'latent' in et_var.lower() or 'latheat' in et_var.lower():
                # Assume W/m² to mm/day conversion
                et_series = et_series * 0.0353
            elif et_series.max() < 1:  # Assume kg/m²/s
                et_series = et_series * 86400  # Convert to mm/day
            
            # Get site information
            site_row = selected_df[selected_df['DOMAIN_NAME'] == domain_name]
            
            if site_row.empty:
                print(f"     ⚠️  Site information not found for {domain_name}")
                continue
            
            # Store results
            result = {
                'domain_name': domain_name,
                'site_id': site_row['ID'].iloc[0] if 'ID' in site_row.columns else domain_name,
                'latitude': site_row['latitude'].iloc[0],
                'longitude': site_row['longitude'].iloc[0],
                'climate': site_row['KG'].iloc[0] if 'KG' in site_row.columns else 'Unknown',
                'landcover': site_row['Dominant_LC'].iloc[0] if 'Dominant_LC' in site_row.columns else 'Unknown',
                'et_timeseries': et_series,
                'et_mean': et_series.mean(),
                'et_std': et_series.std(),
                'et_min': et_series.min(),
                'et_max': et_series.max(),
                'data_period': f"{et_series.index.min()} to {et_series.index.max()}",
                'data_points': len(et_series),
                'et_variable': et_var,
                'output_file': str(output_file)
            }
            
            et_results.append(result)
            processing_summary['domains_with_et'] += 1
            
            print(f"     ✅ ET extracted: {result['et_mean']:.2f} ± {result['et_std']:.2f} mm/day")
            
        except Exception as e:
            print(f"     ❌ Error processing {domain_name}: {e}")
            processing_summary['failed_extractions'] += 1
    
    print(f"\n📊 ET Extraction Summary:")
    print(f"   Total domains: {processing_summary['total_domains']}")
    print(f"   Domains with results: {processing_summary['domains_with_results']}")
    print(f"   Successful ET extractions: {processing_summary['domains_with_et']}")
    print(f"   Failed extractions: {processing_summary['failed_extractions']}")
    
    return et_results, processing_summary

def load_fluxnet_observations():
    """
    Load FLUXNET observation data for comparison
    """
    print(f"\n📥 Loading FLUXNET Observation Data...")
    
    fluxnet_obs = {}
    obs_summary = {
        'sites_found': 0,
        'sites_with_et': 0,
        'total_observations': 0
    }
    
    # Look for processed FLUXNET data in domain directories
    for _, site in selected_df.iterrows():
        domain_name = site['DOMAIN_NAME']
        
        # Construct path to processed FLUXNET data
        obs_path = CONFLUENCE_DATA_DIR / f"domain_{domain_name}" / "observations" / "energy_fluxes" / "fluxnet" / "processed" / f"{domain_name}_fluxnet_processed.csv"
        
        if obs_path.exists():
            try:
                print(f"   📊 Loading {domain_name}...")
                
                obs_df = pd.read_csv(obs_path)
                obs_df['timestamp'] = pd.to_datetime(obs_df['timestamp'])
                obs_df.set_index('timestamp', inplace=True)
                
                obs_summary['sites_found'] += 1
                
                # Check for ET data
                if 'ET_from_LE_mm_per_day' in obs_df.columns:
                    et_obs = obs_df['ET_from_LE_mm_per_day'].dropna()
                    
                    if len(et_obs) > 0:
                        fluxnet_obs[domain_name] = {
                            'et_timeseries': et_obs,
                            'et_mean': et_obs.mean(),
                            'et_std': et_obs.std(),
                            'et_min': et_obs.min(),
                            'et_max': et_obs.max(),
                            'data_points': len(et_obs),
                            'data_period': f"{et_obs.index.min()} to {et_obs.index.max()}",
                            'latitude': site['latitude'],
                            'longitude': site['longitude'],
                            'climate': site['KG'] if 'KG' in site else 'Unknown',
                            'landcover': site['Dominant_LC'] if 'Dominant_LC' in site else 'Unknown'
                        }
                        
                        obs_summary['sites_with_et'] += 1
                        obs_summary['total_observations'] += len(et_obs)
                        
                        print(f"     ✅ ET obs: {et_obs.mean():.2f} ± {et_obs.std():.2f} mm/day ({len(et_obs)} points)")
                
            except Exception as e:
                print(f"     ❌ Error loading {domain_name}: {e}")
    
    print(f"\n📊 FLUXNET Observation Summary:")
    print(f"   Sites with observation files: {obs_summary['sites_found']}")
    print(f"   Sites with ET observations: {obs_summary['sites_with_et']}")
    print(f"   Total ET observations: {obs_summary['total_observations']}")
    
    return fluxnet_obs, obs_summary

def create_et_comparison_analysis(et_results, fluxnet_obs):
    """
    Create comprehensive ET comparison analysis between simulated and observed
    """
    print(f"\n📈 Creating ET Comparison Analysis...")
    
    # Find sites with both simulated and observed data
    common_sites = []
    
    for sim_result in et_results:
        domain_name = sim_result['domain_name']
        
        if domain_name in fluxnet_obs:
            # Align time periods
            sim_et = sim_result['et_timeseries']
            obs_et = fluxnet_obs[domain_name]['et_timeseries']
            
            # Find common time period
            common_start = max(sim_et.index.min(), obs_et.index.min())
            common_end = min(sim_et.index.max(), obs_et.index.max())
            
            if common_start < common_end:
                # Resample to daily and align
                sim_daily = sim_et.resample('D').mean().loc[common_start:common_end]
                obs_daily = obs_et.resample('D').mean().loc[common_start:common_end]
                
                # Remove NaN values
                valid_mask = ~(sim_daily.isna() | obs_daily.isna())
                sim_valid = sim_daily[valid_mask]
                obs_valid = obs_daily[valid_mask]
                
                if len(sim_valid) > 10:  # Need minimum data for meaningful comparison
                    
                    # Calculate performance metrics
                    rmse = np.sqrt(((obs_valid - sim_valid) ** 2).mean())
                    bias = (sim_valid - obs_valid).mean()
                    mae = np.abs(obs_valid - sim_valid).mean()
                    
                    # Correlation
                    try:
                        correlation = obs_valid.corr(sim_valid)
                        if pd.isna(correlation):
                            correlation = 0.0
                    except:
                        correlation = 0.0
                    
                    # Nash-Sutcliffe Efficiency
                    if obs_valid.var() > 0:
                        nse = 1 - ((obs_valid - sim_valid) ** 2).sum() / ((obs_valid - obs_valid.mean()) ** 2).sum()
                    else:
                        nse = np.nan
                    
                    common_site = {
                        'domain_name': domain_name,
                        'latitude': sim_result['latitude'],
                        'longitude': sim_result['longitude'],
                        'climate': sim_result['climate'],
                        'landcover': sim_result['landcover'],
                        'sim_et': sim_valid,
                        'obs_et': obs_valid,
                        'sim_mean': sim_valid.mean(),
                        'obs_mean': obs_valid.mean(),
                        'rmse': rmse,
                        'bias': bias,
                        'mae': mae,
                        'correlation': correlation,
                        'nse': nse,
                        'n_points': len(sim_valid),
                        'common_period': f"{common_start.date()} to {common_end.date()}"
                    }
                    
                    common_sites.append(common_site)
                    
                    print(f"   ✅ {domain_name}: r={correlation:.3f}, RMSE={rmse:.2f}, Bias={bias:+.2f} ({len(sim_valid)} points)")
    
    print(f"\n📊 ET Comparison Summary:")
    print(f"   Sites with both sim and obs: {len(common_sites)}")
    
    if len(common_sites) == 0:
        print("   ⚠️  No sites with overlapping sim/obs data for comparison")
        return None
    
    # Create comprehensive comparison visualization
    n_sites = len(common_sites)
    
    # Figure 1: Overview comparison plots
    fig1, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Scatter plot of all sites
    ax1 = axes[0, 0]
    
    all_obs = np.concatenate([site['obs_et'].values for site in common_sites])
    all_sim = np.concatenate([site['sim_et'].values for site in common_sites])
    
    ax1.scatter(all_obs, all_sim, alpha=0.5, s=10, c='blue')
    
    # 1:1 line
    min_val = min(all_obs.min(), all_sim.min())
    max_val = max(all_obs.max(), all_sim.max())
    ax1.plot([min_val, max_val], [min_val, max_val], 'k--', label='1:1 line')
    
    ax1.set_xlabel('Observed ET (mm/day)')
    ax1.set_ylabel('Simulated ET (mm/day)')
    ax1.set_title('All Sites: Simulated vs Observed ET')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Add overall statistics
    overall_corr = np.corrcoef(all_obs, all_sim)[0,1]
    overall_rmse = np.sqrt(np.mean((all_obs - all_sim)**2))
    overall_bias = np.mean(all_sim - all_obs)
    
    stats_text = f'r = {overall_corr:.3f}\nRMSE = {overall_rmse:.2f}\nBias = {overall_bias:+.2f}'
    ax1.text(0.05, 0.95, stats_text, transform=ax1.transAxes,
             bbox=dict(facecolor='white', alpha=0.8), fontsize=10, verticalalignment='top')
    
    # Performance metrics by climate
    ax2 = axes[0, 1]
    
    climate_stats = {}
    for site in common_sites:
        climate = site['climate']
        if climate not in climate_stats:
            climate_stats[climate] = {'correlations': [], 'rmses': [], 'biases': []}
        
        climate_stats[climate]['correlations'].append(site['correlation'])
        climate_stats[climate]['rmses'].append(site['rmse'])
        climate_stats[climate]['biases'].append(site['bias'])
    
    # Plot correlation by climate
    climates = list(climate_stats.keys())
    corr_means = [np.mean(climate_stats[c]['correlations']) for c in climates]
    corr_stds = [np.std(climate_stats[c]['correlations']) for c in climates]
    
    x_pos = range(len(climates))
    ax2.bar(x_pos, corr_means, yerr=corr_stds, capsize=5, alpha=0.7, color='skyblue')
    ax2.set_xticks(x_pos)
    ax2.set_xticklabels(climates, rotation=45, ha='right')
    ax2.set_ylabel('Correlation')
    ax2.set_title('ET Performance by Climate Type')
    ax2.grid(True, alpha=0.3, axis='y')
    ax2.set_ylim(0, 1)
    
    # Bias distribution
    ax3 = axes[1, 0]
    
    all_biases = [site['bias'] for site in common_sites]
    ax3.hist(all_biases, bins=15, color='orange', alpha=0.7, edgecolor='black')
    ax3.axvline(x=0, color='red', linestyle='--', label='Zero bias')
    ax3.set_xlabel('Bias (mm/day)')
    ax3.set_ylabel('Number of Sites')
    ax3.set_title('Distribution of ET Bias')
    ax3.legend()
    ax3.grid(True, alpha=0.3, axis='y')
    
    # RMSE vs site characteristics
    ax4 = axes[1, 1]
    
    site_lats = [site['latitude'] for site in common_sites]
    site_rmses = [site['rmse'] for site in common_sites]
    
    ax4.scatter(site_lats, site_rmses, alpha=0.7, s=30, c='green')
    ax4.set_xlabel('Latitude')
    ax4.set_ylabel('RMSE (mm/day)')
    ax4.set_title('ET Performance vs Latitude')
    ax4.grid(True, alpha=0.3)
    
    plt.suptitle('FLUXNET Large Sample ET Comparison Analysis', 
                 fontsize=14, fontweight='bold')
    plt.tight_layout()
    
    # Save comparison plot
    comparison_path = experiment_dir / 'plots' / 'et_comparison_analysis.png'
    plt.savefig(comparison_path, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"✅ ET comparison analysis saved: {comparison_path}")
    
    # Figure 2: Spatial map with performance metrics
    fig2, axes = plt.subplots(1, 2, figsize=(20, 8))
    
    # Map 1: Correlation map
    ax1 = axes[0]
    
    lats = [site['latitude'] for site in common_sites]
    lons = [site['longitude'] for site in common_sites]
    corrs = [site['correlation'] for site in common_sites]
    
    scatter1 = ax1.scatter(lons, lats, c=corrs, cmap='RdYlBu', s=60, 
                          vmin=0, vmax=1, edgecolors='black', linewidth=0.5)
    
    ax1.set_xlabel('Longitude')
    ax1.set_ylabel('Latitude')
    ax1.set_title('ET Model Performance: Correlation')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(-180, 180)
    ax1.set_ylim(-60, 80)
    
    # Add colorbar
    cbar1 = plt.colorbar(scatter1, ax=ax1)
    cbar1.set_label('Correlation')
    
    # Map 2: Bias map
    ax2 = axes[1]
    
    biases = [site['bias'] for site in common_sites]
    max_abs_bias = max(abs(min(biases)), abs(max(biases)))
    
    scatter2 = ax2.scatter(lons, lats, c=biases, cmap='RdBu_r', s=60,
                          vmin=-max_abs_bias, vmax=max_abs_bias, 
                          edgecolors='black', linewidth=0.5)
    
    ax2.set_xlabel('Longitude')
    ax2.set_ylabel('Latitude')
    ax2.set_title('ET Model Performance: Bias (Sim - Obs)')
    ax2.grid(True, alpha=0.3)
    ax2.set_xlim(-180, 180)
    ax2.set_ylim(-60, 80)
    
    # Add colorbar
    cbar2 = plt.colorbar(scatter2, ax=ax2)
    cbar2.set_label('Bias (mm/day)')
    
    plt.suptitle('FLUXNET Large Sample ET Performance - Spatial Distribution', 
                 fontsize=14, fontweight='bold')
    plt.tight_layout()
    
    # Save spatial analysis
    spatial_path = experiment_dir / 'plots' / 'et_spatial_performance.png'
    plt.savefig(spatial_path, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"✅ ET spatial performance map saved: {spatial_path}")
    
    return common_sites

# Execute Step 3 Analysis
print(f"\n🔍 Step 3.1: Domain Discovery and Overview")

# Discover completed domains
completed_domains = discover_completed_domains()

# Create domain overview map
total_selected, total_discovered, total_with_shapefiles, total_with_results = create_domain_overview_map(completed_domains)

print(f"\n📊 Step 3.2: ET Results Extraction")

# Extract ET results from simulations
et_results, et_processing_summary = extract_et_results_from_domains(completed_domains)

# Load FLUXNET observations
fluxnet_obs, obs_summary = load_fluxnet_observations()

print(f"\n📈 Step 3.3: ET Comparison Analysis")

# Create ET comparison analysis
if et_results and fluxnet_obs:
    common_sites = create_et_comparison_analysis(et_results, fluxnet_obs)
else:
    print("   ⚠️  Insufficient data for ET comparison analysis")
    common_sites = None

# Create final summary report
print(f"\n📋 Creating Final Large Sample Summary Report...")

summary_report_path = experiment_dir / 'reports' / 'large_sample_final_report.txt'

with open(summary_report_path, 'w') as f:
    f.write("FLUXNET Large Sample Study - Final Analysis Report\n")
    f.write("=" * 55 + "\n\n")
    f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    
    f.write("PROCESSING SUMMARY:\n")
    f.write(f"  Sites selected for analysis: {total_selected}\n")
    f.write(f"  Processing initiated: {total_discovered}\n")
    f.write(f"  Shapefiles generated: {total_with_shapefiles}\n")
    f.write(f"  Simulation results available: {total_with_results}\n")
    f.write(f"  ET extractions successful: {et_processing_summary['domains_with_et']}\n")
    f.write(f"  FLUXNET observations available: {obs_summary['sites_with_et']}\n")
    
    if common_sites:
        f.write(f"  Sites with sim/obs comparison: {len(common_sites)}\n\n")
        
        f.write("ET PERFORMANCE SUMMARY:\n")
        correlations = [site['correlation'] for site in common_sites]
        rmses = [site['rmse'] for site in common_sites]
        biases = [site['bias'] for site in common_sites]
        
        f.write(f"  Mean correlation: {np.mean(correlations):.3f} ± {np.std(correlations):.3f}\n")
        f.write(f"  Mean RMSE: {np.mean(rmses):.2f} ± {np.std(rmses):.2f} mm/day\n")
        f.write(f"  Mean bias: {np.mean(biases):+.2f} ± {np.std(biases):.2f} mm/day\n\n")
        
        f.write("BEST PERFORMING SITES (by correlation):\n")
        sorted_sites = sorted(common_sites, key=lambda x: x['correlation'], reverse=True)
        for i, site in enumerate(sorted_sites[:5]):
            f.write(f"  {i+1}. {site['domain_name']}: r={site['correlation']:.3f}, RMSE={site['rmse']:.2f}\n")

print(f"✅ Final summary report saved: {summary_report_path}")

print(f"\n🎉 Step 3 Complete: Large Sample Output Analysis")
print(f"   📁 Results saved to: {experiment_dir}")
print(f"   🗺️  Domain overview: {total_with_results}/{total_selected} sites with results")
print(f"   📊 ET analysis: {len(common_sites) if common_sites else 0} sites with sim/obs comparison")
print(f"   📈 Performance: Mean r = {np.mean([s['correlation'] for s in common_sites]):.3f}" if common_sites else "   📈 Performance: Awaiting more results")

print(f"\n✅ Large Sample FLUXNET Analysis Complete!")
print(f"   🌍 Multi-site comparative hydrology achieved")
print(f"   📊 Statistical patterns identified across environmental gradients")

## Summary

This tutorial demonstrated the scaling capabilies of CONFLUENCE, transitioning from single-domain modeling to large sample comparative hydrology. Using the global FLUXNET network as our example, we explored how CONFLUENCE's workflow efficiency enables systematic analysis across hundreds of sites with diverse environmental conditions.

The tutorial progressed through three key phases: comprehensive analysis of the FLUXNET dataset to understand its global coverage and environmental diversity; automated configuration generation and batch processing using the `run_watersheds_fluxnet.py` script to execute CONFLUENCE across multiple sites simultaneously; and multi-site output analysis to extract, compare, and validate evapotranspiration results across environmental gradients.

This large sample approach represents a fundamental advancement in hydrological modeling, moving beyond site-specific case studies to enable robust statistical analysis, pattern recognition, and process understanding at unprecedented scales. The combination of CONFLUENCE's standardized workflow with large sample methodologies opens new possibilities for comparative hydrology, allowing researchers to distinguish universal hydrological processes from site-specific effects and evaluate model transferability across diverse climatic and ecological conditions.

The systematic processing of FLUXNET sites demonstrates how modern computational infrastructure and standardized modeling workflows can transform our understanding of hydrological processes from local observations to global patterns, advancing both theoretical understanding and practical applications in water resources management and climate science.

### Next Focus: Large Sample Experiments - NorSWE
**Ready to explore large sample snow simulations?** → **[Tutorial 04b: Large Sample Studies - NORSWE](./04b_large_sample_norswe.ipynb)**