# CONFLUENCE Tutorial - 8: Large Sample Studies (FLUXNET Multi-Site Analysis)

## Introduction

This tutorial represents the culmination of our CONFLUENCE series: large sample studies. While our previous tutorials focused on modeling individual domains (from points to continents), large sample studies leverage CONFLUENCE's workflow efficiency to systematically analyze hundreds or thousands of sites. Using the global FLUXNET network as our example, we'll demonstrate how to transform CONFLUENCE from a single-domain modeling platform into a powerful tool for comparative hydrology and large sample analysis.

### What are Large Sample Studies?

Large sample studies in hydrology involve systematic analysis across many sites, watersheds, or regions to:

- **Identify patterns**: Understand how hydrological processes vary across different environments
- **Test hypotheses**: Evaluate theoretical concepts across diverse conditions
- **Improve models**: Develop better parameterizations based on multi-site evidence
- **Quantify uncertainty**: Assess model performance and reliability across different settings
- **Enable comparative hydrology**: Compare hydrological responses across climates, landscapes, and scales

### The Scientific Revolution of Large Sample Hydrology

Large sample studies have revolutionized hydrology by moving beyond single-site case studies:

**Traditional Approach**: Intensive study of individual watersheds or sites
- Deep understanding of specific locations
- Limited generalizability
- Difficult to separate site-specific vs. universal processes

**Large Sample Approach**: Systematic analysis across many sites
- Identifies universal patterns and regional variations
- Enables statistical analysis of hydrological controls
- Supports development of general theories and models
- Quantifies uncertainty across different environments

### Why FLUXNET for Large Sample Studies?

The FLUXNET network provides an ideal framework for large sample hydrological analysis:

**Global Coverage**: 
- 900+ tower sites across all continents
- Diverse ecosystems: forests, grasslands, wetlands, croplands
- Multiple climate zones: tropical, temperate, boreal, arid
- Elevation range: sea level to high mountains

**Standardized Measurements**:
- Consistent eddy covariance methodology
- Quality-controlled data processing
- Standardized temporal resolution
- Comparable variables across sites

**Scientific Value**:
- Energy balance validation for land surface models
- Ecosystem-scale process understanding
- Climate-vegetation interactions
- Model benchmarking across diverse conditions

### From Single Sites to Large Samples

Our tutorial progression has prepared you for large sample studies:

| Tutorial | Scale | Sites | Purpose |
|----------|-------|-------|---------|
| 1-2 | Point | 1 | Process understanding |
| 3-5 | Watershed | 1 | Spatial integration |
| 6-7 | Regional/Continental | 1 | Large-scale hydrology |
| 8 | Multi-site | 100s | Comparative analysis |

### CONFLUENCE's Advantages for Large Sample Studies

CONFLUENCE's design makes it particularly well-suited for large sample analysis:

1. **Workflow Automation**: Standardized workflow reduces manual effort per site
2. **Consistent Methodology**: Same modeling approach across all sites ensures comparability
3. **Scalable Configuration**: Template-based configuration enables rapid site setup
4. **Reproducible Science**: Complete workflow documentation ensures reproducibility
5. **High-Performance Computing**: Parallel execution across multiple sites
6. **Standardized Outputs**: Consistent output formats facilitate multi-site analysis

### Technical Implementation

Large sample studies with CONFLUENCE involve several key components:

**Site Selection**: Choose representative sites across environmental gradients
**Configuration Generation**: Automatically create site-specific configurations
**Batch Processing**: Run CONFLUENCE across multiple sites efficiently
**Results Aggregation**: Collect and standardize outputs from all sites
**Comparative Analysis**: Analyze patterns and relationships across sites

### Research Questions Addressed

Large sample studies enable investigation of questions impossible at single sites:

1. **Process Generalization**: Do hydrological processes scale consistently across environments?
2. **Climate Controls**: How do different climate variables control hydrological responses?
3. **Ecosystem Influences**: How do vegetation types affect water and energy balance?
4. **Model Performance**: Where do models perform well vs. poorly, and why?
5. **Parameter Transferability**: Can model parameters be transferred between similar sites?

### Methodological Considerations

Large sample studies require careful methodological choices:

**Site Selection Criteria**:
- Spatial distribution across environmental gradients
- Data quality and availability
- Representativeness of broader regions
- Temporal coverage consistency

**Standardization Approaches**:
- Consistent model configuration across sites
- Standardized evaluation metrics
- Comparable temporal periods
- Unified data processing protocols

**Analysis Strategies**:
- Statistical analysis of multi-site results
- Clustering sites by characteristics
- Regression analysis of controls
- Uncertainty quantification

### Expected Outcomes

This tutorial demonstrates several key large sample capabilities:

1. **Multi-Site Configuration**: Automatically generate configurations for hundreds of sites
2. **Batch Execution**: Run CONFLUENCE across multiple sites efficiently
3. **Results Synthesis**: Aggregate and analyze multi-site model outputs
4. **Comparative Visualization**: Create plots showing patterns across sites
5. **Statistical Analysis**: Quantify relationships between site characteristics and model performance

### What You'll Learn

By completing this tutorial, you'll understand how to:

1. **Design large sample experiments** with appropriate site selection
2. **Automate configuration generation** for hundreds of sites
3. **Manage batch processing** of multiple CONFLUENCE runs
4. **Aggregate and analyze results** from multi-site experiments
5. **Visualize patterns** across environmental gradients
6. **Apply statistical methods** to understand hydrological controls

### Tutorial Structure

This tutorial demonstrates the complete large sample workflow:

1. **Experiment Design**: Define objectives and select FLUXNET sites
2. **Configuration Generation**: Create site-specific CONFLUENCE configurations
3. **Batch Processing**: Execute CONFLUENCE across multiple sites
4. **Results Collection**: Aggregate outputs from all successful runs
5. **Comparative Analysis**: Analyze patterns and relationships across sites
6. **Visualization**: Create plots showing multi-site results
7. **Statistical Summary**: Quantify patterns and uncertainties

### Scientific Impact

Large sample studies represent the future of hydrological science:

- **Robust Conclusions**: Statistical significance from many sites
- **Universal Patterns**: Identify processes that transcend individual sites
- **Model Improvement**: Better parameterizations based on multi-site evidence
- **Uncertainty Quantification**: Understand model reliability across conditions
- **Predictive Capability**: Develop models that work in ungauged locations

### Tutorial Series Culmination

This tutorial represents the culmination of our CONFLUENCE journey:

**Foundation**: Point-scale process understanding
**Scaling**: Watershed to continental modeling
**Application**: Large sample comparative hydrology

By mastering large sample studies, you've gained the tools to conduct cutting-edge hydrological research that leverages CONFLUENCE's power across multiple scales and environments. This approach positions you to contribute to the next generation of hydrological science, where systematic multi-site analysis drives theoretical advances and practical applications.

The combination of CONFLUENCE's workflow efficiency with large sample methodologies opens new possibilities for understanding how hydrological processes vary across Earth's diverse environments - from individual flux towers to global patterns.

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import subprocess
import yaml
from datetime import datetime
import seaborn as sns

# Add CONFLUENCE to path
confluence_path = Path('../').resolve()
sys.path.append(str(confluence_path))

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("Setup complete!")

In [None]:
# Configuration for the FLUXNET large sample experiment
experiment_config = {
    'experiment_name': 'fluxnet_tutorial',
    'max_sites': 5,
    'dry_run': False,
    'template_config': '../CONFLUENCE/0_config_files/config_point_template.yaml',
    'config_dir': '../CONFLUENCE/0_config_files/fluxnet',
    'fluxnet_script': '../CONFLUENCE/9_scripts/run_towers_fluxnet.py',
    'fluxnet_csv': 'fluxnet_transformed.csv'
}

# Create experiment directory
experiment_dir = Path(f"./experiments/{experiment_config['experiment_name']}")
experiment_dir.mkdir(parents=True, exist_ok=True)

# Save configuration
with open(experiment_dir / 'experiment_config.yaml', 'w') as f:
    yaml.dump(experiment_config, f)

print(f"Experiment configured: {experiment_config['experiment_name']}")

In [None]:
# Load FLUXNET sites data
fluxnet_df = pd.read_csv(experiment_config['fluxnet_csv'])

print(f"Loaded {len(fluxnet_df)} FLUXNET sites")
print("\nColumns in dataset:")
for col in fluxnet_df.columns:
    print(f"  - {col}")

# Display first few sites
print("\nFirst 5 sites:")
display(fluxnet_df[['ID', 'Watershed_Name', 'KG', 'Dominant_LC', 'Area_km2']].head())

In [None]:
# Extract coordinates from POUR_POINT_COORDS
coords = fluxnet_df['POUR_POINT_COORDS'].str.split('/', expand=True)
fluxnet_df['lat'] = coords[0].astype(float)
fluxnet_df['lon'] = coords[1].astype(float)

# Create global distribution plot
plt.figure(figsize=(15, 8))
plt.scatter(fluxnet_df['lon'], fluxnet_df['lat'], c='red', alpha=0.6)
plt.title('Global Distribution of FLUXNET Sites')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True, alpha=0.3)
plt.xlim(-180, 180)
plt.ylim(-60, 80)
plt.show()

In [None]:
'''
# Optional Select sites based on criteria - diverse climate types
climate_types = fluxnet_df['KG'].unique()

# Select one site from each climate type (up to max_sites)
selected_sites = []
for climate in climate_types[:experiment_config['max_sites']]:
    site = fluxnet_df[fluxnet_df['KG'] == climate].iloc[0]
    selected_sites.append(site)

selected_df = pd.DataFrame(selected_sites)

print(f"Selected {len(selected_df)} sites for processing:")
display(selected_df[['ID', 'Watershed_Name', 'KG', 'Dominant_LC']])
'''

In [None]:
'''
# Generate configs for selected sites
config_dir = Path(experiment_config['config_dir'])
config_dir.mkdir(parents=True, exist_ok=True)

generated_configs = []

for _, site in fluxnet_df.iterrows():
    site_name = site['Watershed_Name']
    pour_point = site['POUR_POINT_COORDS']
    bounding_box = site['BOUNDING_BOX_COORDS']
    
    # Create config file name
    config_path = config_dir / f"config_{site_name}.yaml"
    
    # Generate config using the script function
    cmd = [
        'python', '-c',
        f"""
import sys
sys.path.append('{str(confluence_path)}/9_scripts')
from run_towers_fluxnet import generate_config_file
generate_config_file(
    '{experiment_config['template_config']}',
    '{config_path}',
    '{site_name}',
    '{pour_point}',
    '{bounding_box}'
)
"""
    ]
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode == 0:
        generated_configs.append(config_path)
        print(f"Generated config for {site_name}")

print(f"\nGenerated {len(generated_configs)} configuration files")
'''

In [None]:
'''
# Launch CONFLUENCE runs
cmd = ['python', experiment_config['fluxnet_script']]

# For dry run, add appropriate option
if experiment_config['dry_run']:
    print("DRY RUN MODE - No jobs will be submitted")

print(f"Launching CONFLUENCE for FLUXNET sites...")

# Execute the script (requires user input)
result = subprocess.run(cmd, input='n\n' if experiment_config['dry_run'] else 'y\n', 
                       capture_output=True, text=True)

print("\nOutput:")
print(result.stdout[:500] + "..." if len(result.stdout) > 500 else result.stdout)
'''

In [None]:
# Find completed FLUXNET simulations
confluence_data_dir = Path("/work/comphyd_lab/data/CONFLUENCE_data")
fluxnet_dir = confluence_data_dir / "fluxnet"

completed = []
if fluxnet_dir.exists():
    for domain_dir in fluxnet_dir.glob("domain_*"):
        site_name = domain_dir.name.replace("domain_", "")
        sim_dir = domain_dir / "simulations"
        
        if sim_dir.exists() and list(sim_dir.rglob("*.nc")):
            completed.append({
                'site_name': site_name,
                'sim_dir': sim_dir
            })

print(f"Completed simulations: {len(completed)}")

In [None]:
# Load and analyze model results
def load_summa_output(sim_dir, variable='scalarSWE'):
    import xarray as xr
    
    output_files = list(sim_dir.rglob("*day*.nc"))
    if output_files:
        ds = xr.open_dataset(output_files[0])
        if variable in ds.variables:
            return pd.DataFrame({
                'time': pd.to_datetime(ds.time.values),
                'value': ds[variable].values.flatten()
            })
    return None

# Summary Report
if completed:
    print("### FLUXNET Experiment Summary Report ###")
    print(f"Experiment Name: {experiment_config['experiment_name']}")
    print(f"Date: {datetime.now().strftime('%Y-%m-%d')}")
    print(f"Total Sites Selected: {len(fluxnet_df)}")
    print(f"Completed Simulations: {len(completed)}")

In [None]:
# Extract model results from all completed simulations and create histogram
if completed:
    # Dictionary to store average values for each site
    site_averages = {}
    
    # Variable to extract (using the one from the function definition as default)
    variable_name = 'scalarSWE'
    
    print(f"Extracting average {variable_name} values from all completed simulations...")
    
    # Loop through all completed simulations
    for site_info in completed:
        site_name = site_info['site_name']
        sim_dir = site_info['sim_dir']
        
        # Load data using the existing function
        data = load_summa_output(sim_dir, variable=variable_name)
        
        if data is not None:
            # Calculate average for this site
            site_avg = data['value'].mean()
            site_averages[site_name] = site_avg
            print(f"  - {site_name}: Average {variable_name} = {site_avg:.2f}")
        else:
            print(f"  - {site_name}: Could not extract {variable_name} data")
    
    # Create dataframe from the averages
    averages_df = pd.DataFrame({
        'site': list(site_averages.keys()),
        'average_value': list(site_averages.values())
    })
    
    # Save to CSV
    averages_csv = experiment_dir / f'site_averages_{variable_name}.csv'
    averages_df.to_csv(averages_csv, index=False)
    print(f"\nSaved site averages to {averages_csv}")
    
    # Create histogram of the averages
    plt.figure(figsize=(12, 6))
    sns.histplot(averages_df['average_value'], kde=True)
    plt.title(f'Distribution of Average {variable_name} Across FLUXNET Sites')
    plt.xlabel(f'Average {variable_name}')
    plt.ylabel('Count')
    plt.grid(alpha=0.3)
    
    # Save plot
    hist_path = experiment_dir / f'histogram_{variable_name}.png'
    plt.savefig(hist_path, dpi=300)
    plt.show()
    
    print(f"Histogram saved to {hist_path}")
    
    # Additional statistical summary
    print("\nStatistical Summary:")
    print(f"Number of sites: {len(site_averages)}")
    print(f"Mean across sites: {averages_df['average_value'].mean():.2f}")
    print(f"Median across sites: {averages_df['average_value'].median():.2f}")
    print(f"Min: {averages_df['average_value'].min():.2f}")
    print(f"Max: {averages_df['average_value'].max():.2f}")
else:
    print("No completed simulations found for analysis.")