# CONFLUENCE Tutorial - 04h: GGMN Large Sample Study (Groundwater Monitoring Network)

## Introduction
This tutorial demonstrates large sample groundwater modeling using the Global Groundwater Monitoring Network (GGMN) dataset. Building on the multi-site analysis framework established in previous tutorials, we apply CONFLUENCE to systematically evaluate groundwater simulation performance across diverse hydrogeological settings throughout North America.

## GGMN: A Critical Groundwater Observation Network
The Global Groundwater Monitoring Network represents one of the most comprehensive collections of groundwater level observations available for hydrological model validation. The network provides extensive spatial coverage across diverse hydrogeological environments, from shallow alluvial aquifers to deep confined systems. Stations span climatic gradients from arid to humid regions, capturing the full spectrum of groundwater-surface water interactions.

The observational richness of GGMN includes direct measurements of groundwater levels and complementary hydrogeological information that provide critical insights into subsurface water storage dynamics. Many sites contain multi-decade records processed through standardized measurement protocols, making them ideal for systematic groundwater model evaluation.

## Scientific Importance of Groundwater Validation
Groundwater processes represent some of the most challenging aspects of hydrological modeling due to their subsurface complexity and long timescales. Groundwater systems exhibit complex flow patterns influenced by geological heterogeneity, while storage dynamics depend on aquifer properties and recharge patterns. The coupling between surface and groundwater systems creates feedback mechanisms that significantly influence streamflow generation and water availability.

## Learning Outcomes
This tutorial demonstrates systematic groundwater validation through large sample analysis. We show how to adapt CONFLUENCE configurations for groundwater monitoring sites, focus analysis on baseflow periods and storage dynamics, implement multi-variable validation comparing simulated and observed groundwater levels, and assess model performance across different hydrogeological settings.


## Step 1: Large Sample Groundwater Study Design and Site Selection
Establishing the foundation for large sample groundwater modeling using the comprehensive GGMN observation network. We demonstrate how CONFLUENCE's workflow efficiency enables systematic groundwater process evaluation across diverse hydrogeological environments.

In [None]:
import sys
import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import subprocess
import yaml
from datetime import datetime
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style for groundwater visualization
plt.style.use('default')
sns.set_palette("viridis")
%matplotlib inline
confluence_path = Path('../').resolve()

# =============================================================================
# LARGE SAMPLE GROUNDWATER EXPERIMENTAL DESIGN CONFIGURATION
# =============================================================================

# Set directory paths
CONFLUENCE_CODE_DIR = confluence_path
CONFLUENCE_DATA_DIR = Path('/Users/darrieythorsson/compHydro/data/CONFLUENCE_data')  # ← Update this path

# Experiment configuration
experiment_config = {
    'ggmn_stations': 'ggmn_stations.csv',
    'ggmn_data_dir': '/path/to/ggmn/data',  # Update with actual GGMN data path
    'template_config': str(CONFLUENCE_CODE_DIR / '0_config_files' / 'config_point_template.yaml'),
    'output_dir': 'ggmn_output',
    'config_dir': 'ggmn_configs', 
    'base_path': str(CONFLUENCE_DATA_DIR / 'ggmn'),
    'min_completeness': 70.0,
    'min_records': 100,
    'max_stations': 50,
    'start_year': 2010,
    'end_year': 2020,
    'no_submit': True  # Set to False to actually submit jobs
}

experiment_dir = Path('ggmn_experiment_results')
experiment_dir.mkdir(exist_ok=True)

# Load groundwater configuration template
gw_config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_point_template.yaml'
with open(gw_config_path, 'r') as f:
    config_dict = yaml.safe_load(f)

# Update for groundwater tutorial-specific settings
config_updates = {
    'CONFLUENCE_CODE_DIR': str(CONFLUENCE_CODE_DIR),
    'CONFLUENCE_DATA_DIR': str(CONFLUENCE_DATA_DIR),
    'DOMAIN_NAME': 'ggmn_template',
    'EXPERIMENT_ID': 'run_1',
    'EXPERIMENT_TIME_START': '2015-01-01 01:00',
    'EXPERIMENT_TIME_END': '2020-12-31 23:00',
    'DOWNLOAD_USGS_GW': True,
    'ANALYSES': ['benchmarking', 'groundwater']
}

config_dict.update(config_updates)

# Save groundwater configuration template
ggmn_config_path = CONFLUENCE_CODE_DIR / '0_config_files' / 'config_ggmn_template.yaml'
with open(ggmn_config_path, 'w') as f:
    yaml.dump(config_dict, f, default_flow_style=False, sort_keys=False)

print(f"✅ GGMN template configuration saved: {ggmn_config_path}")

# =============================================================================
# LOAD AND EXAMINE GGMN GROUNDWATER STATIONS DATASET
# =============================================================================

print(f"\n💧 Loading GGMN Groundwater Station Database...")

# Load or create demonstration GGMN dataset
try:
    ggmn_df = pd.read_csv('ggmn_stations.csv')
    print(f"✅ Successfully loaded GGMN database: {len(ggmn_df)} groundwater stations available")
except FileNotFoundError:
    print(f"⚠️  GGMN database not found, creating demonstration dataset...")
    
    # Create demonstration GGMN dataset for tutorial
    np.random.seed(42)
    n_stations = 200
    
    # Generate realistic North American groundwater station locations
    regions = [
        {'name': 'Great_Plains', 'lat_range': (35, 50), 'lon_range': (-105, -95), 'n': 50},
        {'name': 'Eastern_US', 'lat_range': (30, 45), 'lon_range': (-85, -70), 'n': 40},
        {'name': 'Western_US', 'lat_range': (32, 48), 'lon_range': (-125, -105), 'n': 35},
        {'name': 'Canadian_Prairies', 'lat_range': (49, 60), 'lon_range': (-115, -95), 'n': 30},
        {'name': 'Southwest_US', 'lat_range': (25, 40), 'lon_range': (-115, -100), 'n': 25},
        {'name': 'Other_NA', 'lat_range': (25, 65), 'lon_range': (-140, -60), 'n': 20}
    ]
    
    stations_data = []
    station_id = 1
    
    for region in regions:
        for i in range(region['n']):
            lat = np.random.uniform(region['lat_range'][0], region['lat_range'][1])
            lon = np.random.uniform(region['lon_range'][0], region['lon_range'][1])
            
            # Well depth based on region (deeper in plains, shallower in mountains)
            if region['name'] in ['Great_Plains', 'Canadian_Prairies']:
                well_depth = np.random.uniform(10, 150)
            elif region['name'] == 'Southwest_US':
                well_depth = np.random.uniform(5, 300)  # High variability in arid regions
            else:
                well_depth = np.random.uniform(5, 80)
            
            # Data completeness (varies by accessibility and monitoring program)
            base_completeness = 85 - np.random.uniform(0, 20)
            data_completeness = max(30, np.random.normal(base_completeness, 10))
            
            # Aquifer type
            aquifer_types = ['Alluvial', 'Bedrock', 'Glacial', 'Volcanic', 'Confined', 'Unconfined']
            aquifer_type = np.random.choice(aquifer_types)
            
            # Create station entry
            station = {
                'station_id': f"GGMN_{station_id:05d}",
                'station_name': f"{region['name']}_GW_{i+1:03d}",
                'latitude': round(lat, 4),
                'longitude': round(lon, 4),
                'well_depth': round(well_depth, 1),
                'aquifer_type': aquifer_type,
                'data_completeness': round(min(95, max(30, data_completeness)), 1),
                'record_count': int(np.random.uniform(50, 3000)),
                'region': region['name']
            }
            
            # Add CONFLUENCE formatting
            buffer = 0.05
            station['BOUNDING_BOX_COORDS'] = f"{lat + buffer}/{lon - buffer}/{lat - buffer}/{lon + buffer}"
            station['POUR_POINT_COORDS'] = f"{lat}/{lon}"
            station['Watershed_Name'] = station['station_id'].replace(' ', '_')
            
            stations_data.append(station)
            station_id += 1
    
    ggmn_df = pd.DataFrame(stations_data)
    ggmn_df.to_csv('ggmn_stations.csv', index=False)
    print(f"✅ Created demonstration GGMN dataset: {len(ggmn_df)} stations")

# Display basic dataset information
print(f"\n📊 Dataset Overview:")
print(f"  Total groundwater stations: {len(ggmn_df)}")
print(f"  Well depth range: {ggmn_df['well_depth'].min():.1f}m to {ggmn_df['well_depth'].max():.1f}m")
print(f"  Data completeness: {ggmn_df['data_completeness'].mean():.1f}% ± {ggmn_df['data_completeness'].std():.1f}%")

# =============================================================================
# GROUNDWATER-SPECIFIC DATASET CHARACTERISTICS ANALYSIS
# =============================================================================

print(f"\n💧 Analyzing Groundwater Dataset Characteristics...")

# Well depth categories
depth_zones = [
    (0, 10, 'Shallow (<10m)'),
    (10, 30, 'Intermediate (10-30m)'),
    (30, 100, 'Deep (30-100m)'),
    (100, 1000, 'Very Deep (>100m)')
]

ggmn_df['depth_category'] = 'Unknown'
for min_depth, max_depth, category in depth_zones:
    mask = (ggmn_df['well_depth'] >= min_depth) & (ggmn_df['well_depth'] < max_depth)
    ggmn_df.loc[mask, 'depth_category'] = category

depth_counts = ggmn_df['depth_category'].value_counts()
print(f"  Well depth categories: {len(depth_counts)}")
print(f"    Most common: {depth_counts.index[0]} ({depth_counts.iloc[0]} wells)")

# Aquifer type analysis
if 'aquifer_type' in ggmn_df.columns:
    aquifer_counts = ggmn_df['aquifer_type'].value_counts()
    print(f"  Aquifer types: {len(aquifer_counts)}")
    print(f"    Most common: {aquifer_counts.index[0]} ({aquifer_counts.iloc[0]} wells)")

# Regional distribution
if 'region' in ggmn_df.columns:
    region_counts = ggmn_df['region'].value_counts()
    print(f"  Regions: {len(region_counts)}")
    print(f"    Most sampled: {region_counts.index[0]} ({region_counts.iloc[0]} wells)")

# =============================================================================
# GROUNDWATER DATASET VISUALIZATION
# =============================================================================

print(f"\n📈 Creating Groundwater Dataset Overview Visualization...")

# Create comprehensive groundwater dataset overview
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. North American groundwater station distribution map
ax1 = axes[0, 0]
scatter = ax1.scatter(ggmn_df['longitude'], ggmn_df['latitude'], 
                     c=ggmn_df['well_depth'], cmap='plasma', 
                     alpha=0.7, s=30, edgecolors='black', linewidth=0.3)
ax1.set_xlabel('Longitude')
ax1.set_ylabel('Latitude')
ax1.set_title(f'GGMN Groundwater Station Distribution\n({len(ggmn_df)} wells)')
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-140, -60)
ax1.set_ylim(25, 65)

cbar = plt.colorbar(scatter, ax=ax1)
cbar.set_label('Well Depth (m)')

# 2. Well depth distribution
ax2 = axes[0, 1]
depth_counts = ggmn_df['depth_category'].value_counts()
bars = ax2.bar(range(len(depth_counts)), depth_counts.values, 
               color='lightblue', alpha=0.7, edgecolor='black')
ax2.set_xticks(range(len(depth_counts)))
ax2.set_xticklabels(depth_counts.index, rotation=45, ha='right')
ax2.set_ylabel('Number of Wells')
ax2.set_title('Wells by Depth Category')
ax2.grid(True, alpha=0.3, axis='y')

for bar, count in zip(bars, depth_counts.values):
    ax2.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 1,
            str(count), ha='center', va='bottom', fontweight='bold')

# 3. Aquifer type distribution
ax3 = axes[0, 2]
if 'aquifer_type' in ggmn_df.columns:
    aquifer_counts = ggmn_df['aquifer_type'].value_counts()
    colors = plt.cm.Set3(np.linspace(0, 1, len(aquifer_counts)))
    bars = ax3.bar(range(len(aquifer_counts)), aquifer_counts.values, 
                   color=colors, alpha=0.7, edgecolor='black')
    ax3.set_xticks(range(len(aquifer_counts)))
    ax3.set_xticklabels(aquifer_counts.index, rotation=45, ha='right')
    ax3.set_ylabel('Number of Wells')
    ax3.set_title('Wells by Aquifer Type')
    ax3.grid(True, alpha=0.3, axis='y')

# 4. Well depth vs latitude
ax4 = axes[1, 0]
ax4.scatter(ggmn_df['latitude'], ggmn_df['well_depth'], 
           alpha=0.6, s=25, c='green', edgecolors='black', linewidth=0.2)
ax4.set_xlabel('Latitude (°N)')
ax4.set_ylabel('Well Depth (m)')
ax4.set_title('Well Depth vs Latitude')
ax4.grid(True, alpha=0.3)

# 5. Data quality assessment
ax5 = axes[1, 1]
ax5.scatter(ggmn_df['data_completeness'], ggmn_df['record_count'], 
           alpha=0.6, s=25, c='orange', edgecolors='black', linewidth=0.2)
ax5.set_xlabel('Data Completeness (%)')
ax5.set_ylabel('Record Count')
ax5.set_title('Data Quality Assessment')
ax5.grid(True, alpha=0.3)
ax5.axvline(x=70, color='red', linestyle='--', alpha=0.7, label='70% threshold')
ax5.legend()

# 6. Regional distribution
ax6 = axes[1, 2]
if 'region' in ggmn_df.columns:
    region_counts = ggmn_df['region'].value_counts()
    wedges, texts, autotexts = ax6.pie(region_counts.values, labels=region_counts.index, 
                                      autopct='%1.1f%%', startangle=90)
    ax6.set_title('Wells by Region')
    
    for autotext in autotexts:
        autotext.set_fontweight('bold')

plt.suptitle('GGMN Groundwater Monitoring Network - Dataset Overview', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# =============================================================================
# FILTER HIGH-QUALITY STATIONS FOR ANALYSIS
# =============================================================================

print(f"\n🔍 Filtering High-Quality Groundwater Stations...")

# Apply quality filters
quality_mask = (
    (ggmn_df['data_completeness'] >= experiment_config['min_completeness']) &
    (ggmn_df['record_count'] >= experiment_config['min_records'])
)

complete_stations = ggmn_df[quality_mask].copy()

if len(complete_stations) > experiment_config['max_stations']:
    # Select stations with highest data quality
    complete_stations = complete_stations.sort_values(
        ['data_completeness', 'record_count'], 
        ascending=False
    ).head(experiment_config['max_stations'])

print(f"✅ Selected {len(complete_stations)} high-quality groundwater stations")
print(f"  Quality criteria: ≥{experiment_config['min_completeness']}% completeness, ≥{experiment_config['min_records']} records")
print(f"  Mean data completeness: {complete_stations['data_completeness'].mean():.1f}%")
print(f"  Mean record count: {complete_stations['record_count'].mean():.0f}")

# Regional distribution of selected stations
if 'region' in complete_stations.columns:
    selected_regions = complete_stations['region'].value_counts()
    print(f"\n🗺️  Selected stations by region:")
    for region, count in selected_regions.items():
        print(f"    {region}: {count} wells")

print(f"\n✅ Step 1 Complete: GGMN Dataset Analysis and Site Selection")
print(f"   📊 Dataset loaded: {len(ggmn_df)} total groundwater monitoring wells")
print(f"   🎯 High-quality selection: {len(complete_stations)} wells for analysis")
print(f"   🌍 Geographic coverage: North American groundwater monitoring network")

## Step 2: Large Sample Groundwater Modeling Execution
Execute systematic groundwater modeling across diverse hydrogeological environments using the GGMN station selection. This demonstrates CONFLUENCE's capability for large sample groundwater process validation.

In [None]:
def run_ggmn_script_from_notebook():
    """
    Execute the run_ggmn.py script from within the notebook
    """
    print(f"\n💧 Executing GGMN Large Sample Groundwater Processing Script...")
    
    script_path = "./run_ggmn.py"
    
    if not Path(script_path).exists():
        print(f"❌ Script not found: {script_path}")
        print(f"   Please ensure run_ggmn.py is in the current directory")
        return False
    
    print(f"   📝 Script location: {script_path}")
    print(f"   🎯 Target sites: {len(complete_stations)} GGMN stations")
    print(f"   ⏰ Processing started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    try:
        # Prepare script arguments
        script_args = [
            'python', script_path,
            '--ggmn_stations', experiment_config['ggmn_stations'],
            '--ggmn_data_dir', experiment_config['ggmn_data_dir'],
            '--template_config', experiment_config['template_config'],
            '--output_dir', experiment_config['output_dir'],
            '--config_dir', experiment_config['config_dir'],
            '--min_completeness', str(experiment_config['min_completeness']),
            '--min_records', str(experiment_config['min_records']),
            '--max_stations', str(experiment_config['max_stations']),
            '--base_path', experiment_config['base_path']
        ]
        
        # Add optional parameters
        if experiment_config.get('start_year'):
            script_args.extend(['--start_year', str(experiment_config['start_year'])])
        if experiment_config.get('end_year'):
            script_args.extend(['--end_year', str(experiment_config['end_year'])])
        
        if experiment_config.get('no_submit', False):
            script_args.append('--no_submit')
        
        print(f"   🔧 Key arguments: stations={experiment_config['max_stations']}, quality≥{experiment_config['min_completeness']}%")
        
        # Execute script
        process = subprocess.Popen(
            script_args,
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            bufsize=1,
            universal_newlines=True
        )
        
        # Handle input for job submission
        if not experiment_config.get('no_submit', False):
            stdout, stderr = process.communicate(input='y\n')
        else:
            stdout, stderr = process.communicate()
        
        # Display output
        if stdout:
            print("\n📋 Script Output:")
            key_lines = [line for line in stdout.split('\n') if any(keyword in line.lower() for keyword in 
                        ['found', 'processing', 'generated', 'submitted', 'complete', 'error', 'success'])]
            for line in key_lines[:10]:  # Show first 10 relevant lines
                if line.strip():
                    print(f"   {line}")
            if len(key_lines) > 10:
                print(f"   ... and {len(key_lines) - 10} more lines")
        
        if stderr and len(stderr.strip()) > 0:
            print("\n⚠️  Script Messages:")
            for line in stderr.split('\n')[:5]:  # Show first 5 error lines
                if line.strip():
                    print(f"   {line}")
        
        if process.returncode == 0:
            print(f"\n✅ GGMN processing script completed successfully")
            return True
        else:
            print(f"\n❌ Script failed with return code: {process.returncode}")
            return False
            
    except FileNotFoundError:
        print(f"❌ Python or script not found. Please check paths.")
        return False
    except Exception as e:
        print(f"❌ Error running script: {e}")
        return False

def create_manual_ggmn_configs():
    """
    Create CONFLUENCE config files manually if script is not available
    """
    print(f"\n🔧 Creating CONFLUENCE Configuration Files for GGMN Stations...")
    
    config_dir = Path(experiment_config['config_dir'])
    config_dir.mkdir(exist_ok=True)
    
    configs_created = 0
    
    for idx, station in complete_stations.iterrows():
        station_id = station['station_id']
        domain_name = station['Watershed_Name']
        pour_point = station['POUR_POINT_COORDS']
        bounding_box = station['BOUNDING_BOX_COORDS']
        
        # Load template config
        with open(experiment_config['template_config'], 'r') as f:
            config_content = f.read()
        
        # Update for this station
        config_content = config_content.replace('DOMAIN_NAME: "ggmn_template"', f'DOMAIN_NAME: "{domain_name}"')
        config_content = config_content.replace('POUR_POINT_COORDS: 51.1722/-115.5717', f'POUR_POINT_COORDS: {pour_point}')
        config_content = config_content.replace('BOUNDING_BOX_COORDS: 51.76/-116.55/50.95/-115.5', f'BOUNDING_BOX_COORDS: {bounding_box}')
        
        # Enable groundwater settings
        config_content = config_content.replace('DOWNLOAD_USGS_GW: \'true\'', 'DOWNLOAD_USGS_GW: true')
        config_content = config_content.replace('USGS_STATION: \'06306300\'', f'USGS_STATION: \'{station_id}\'')
        
        # Save config file
        config_path = config_dir / f"config_{domain_name}.yaml"
        with open(config_path, 'w') as f:
            f.write(config_content)
        
        configs_created += 1
        
        if configs_created <= 3:  # Print first few for verification
            print(f"   ✅ Created config for {domain_name}: {config_path}")
    
    print(f"\n✅ Created {configs_created} CONFLUENCE configuration files")
    print(f"   📁 Configuration directory: {config_dir}")
    
    return configs_created

# Try to execute the GGMN script
script_success = run_ggmn_script_from_notebook()

# If script fails or is not available, create configs manually
if not script_success:
    print(f"\n🔄 Script execution failed or unavailable. Creating configurations manually...")
    configs_created = create_manual_ggmn_configs()
    
    print(f"\n📋 Manual Configuration Summary:")
    print(f"   ⚙️  Configuration files created: {configs_created}")
    print(f"   🎯 Ready for CONFLUENCE execution")
    print(f"   📁 Next step: Submit jobs using created configurations")
    
    if experiment_config.get('no_submit', True):
        print(f"\n💡 To submit jobs:")
        print(f"   1. Review configs in: {experiment_config['config_dir']}")
        print(f"   2. Submit individual jobs: python CONFLUENCE.py --config path/to/config.yaml")
        print(f"   3. Or use batch submission scripts")

print(f"\n✅ Step 2 Complete: GGMN Large Sample Processing Setup")
print(f"   🎯 Stations configured: {len(complete_stations)}")
print(f"   ⚙️  Processing approach: {'Script execution' if script_success else 'Manual configuration'}")
print(f"   📊 Ready for groundwater simulation and validation")

## Step 3: Multi-Site Groundwater Validation and Analysis
Demonstrate comprehensive groundwater validation through systematic multi-site analysis using GGMN observations. This showcases groundwater process evaluation and hydrogeological performance assessment.

In [None]:
import glob
import xarray as xr

def discover_completed_groundwater_domains():
    """
    Discover completed GGMN domain directories and groundwater outputs
    """
    print(f"\n🔍 Discovering Completed GGMN Groundwater Modeling Domains...")
    
    base_path = Path(experiment_config['base_path'])
    domain_pattern = str(base_path / "domain_*")
    domain_dirs = glob.glob(domain_pattern)
    
    print(f"   📁 Found {len(domain_dirs)} total domain directories")
    
    completed_domains = []
    
    for domain_dir in domain_dirs:
        domain_path = Path(domain_dir)
        domain_name = domain_path.name.replace('domain_', '')
        
        # Check if this is a GGMN domain
        if any(domain_name in site for site in complete_stations['Watershed_Name'].values):
            
            domain_info = {
                'domain_name': domain_name,
                'domain_path': domain_path,
                'has_shapefile': (domain_path / "shapefiles" / "catchment").exists(),
                'has_simulations': (domain_path / "simulations").exists(),
                'has_observations': (domain_path / "observations" / "groundwater").exists(),
                'simulation_files': [],
                'groundwater_obs_file': None
            }
            
            # Find simulation files
            if domain_info['has_simulations']:
                sim_dir = domain_path / "simulations"
                nc_files = list(sim_dir.glob("**/*.nc"))
                domain_info['simulation_files'] = nc_files
                domain_info['has_results'] = len(nc_files) > 0
            else:
                domain_info['has_results'] = False
            
            # Find groundwater observation files
            if domain_info['has_observations']:
                gw_dir = domain_path / "observations" / "groundwater" / "raw_data"
                if gw_dir.exists():
                    gw_files = list(gw_dir.glob("*.csv"))
                    if gw_files:
                        domain_info['groundwater_obs_file'] = gw_files[0]
            
            completed_domains.append(domain_info)
    
    print(f"   💧 GGMN domains found: {len(completed_domains)}")
    print(f"   📊 Domains with simulation results: {sum(1 for d in completed_domains if d['has_results'])}")
    print(f"   📋 Domains with observations: {sum(1 for d in completed_domains if d['has_observations'])}")
    
    return completed_domains

def create_groundwater_overview_visualization(completed_domains):
    """
    Create overview visualization of groundwater domain processing status
    """
    print(f"\n📈 Creating Groundwater Domain Overview Visualization...")
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Processing status map
    ax1 = axes[0, 0]
    
    # Plot all selected sites
    ax1.scatter(complete_stations['longitude'], complete_stations['latitude'], 
               c='lightgray', alpha=0.5, s=25, label='Selected stations', marker='o')
    
    # Plot completed domains with status colors
    for domain in completed_domains:
        domain_name = domain['domain_name']
        site_row = complete_stations[complete_stations['Watershed_Name'] == domain_name]
        
        if not site_row.empty:
            lat = site_row['latitude'].iloc[0]
            lon = site_row['longitude'].iloc[0]
            
            if domain['has_results'] and domain['has_observations']:
                color, marker, size = 'green', 's', 50
            elif domain['has_results']:
                color, marker, size = 'orange', '^', 40
            elif domain['has_observations']:
                color, marker, size = 'blue', 'D', 35
            else:
                color, marker, size = 'red', 'v', 25
            
            ax1.scatter(lon, lat, c=color, s=size, marker=marker, alpha=0.8,
                       edgecolors='black', linewidth=0.5)
    
    ax1.set_xlabel('Longitude')
    ax1.set_ylabel('Latitude')
    ax1.set_title('GGMN Groundwater Processing Status')
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim(-140, -60)
    ax1.set_ylim(25, 65)
    
    # 2. Processing statistics
    ax2 = axes[0, 1]
    
    total_selected = len(complete_stations)
    total_discovered = len(completed_domains)
    total_with_results = sum(1 for d in completed_domains if d['has_results'])
    total_with_obs = sum(1 for d in completed_domains if d['has_observations'])
    total_complete = sum(1 for d in completed_domains if d['has_results'] and d['has_observations'])
    
    categories = ['Selected', 'Processing\nStarted', 'Simulation\nComplete', 'Observations\nFound', 'Ready for\nValidation']
    counts = [total_selected, total_discovered, total_with_results, total_with_obs, total_complete]
    colors = ['lightblue', 'yellow', 'orange', 'cyan', 'green']
    
    bars = ax2.bar(categories, counts, color=colors, alpha=0.7, edgecolor='black')
    
    for bar, count in zip(bars, counts):
        ax2.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.5,
                str(count), ha='center', va='bottom', fontweight='bold')
    
    ax2.set_ylabel('Number of Sites')
    ax2.set_title('Groundwater Processing Progress')
    ax2.grid(True, alpha=0.3, axis='y')
    
    # 3. Well depth distribution of processed sites
    ax3 = axes[1, 0]
    
    processed_wells = []
    for domain in completed_domains:
        domain_name = domain['domain_name']
        site_row = complete_stations[complete_stations['Watershed_Name'] == domain_name]
        if not site_row.empty:
            processed_wells.append(site_row['well_depth'].iloc[0])
    
    if processed_wells:
        ax3.hist(processed_wells, bins=15, alpha=0.7, color='lightgreen', edgecolor='black')
        ax3.set_xlabel('Well Depth (m)')
        ax3.set_ylabel('Number of Wells')
        ax3.set_title('Processed Wells by Depth')
        ax3.grid(True, alpha=0.3, axis='y')
    
    # 4. Regional processing status
    ax4 = axes[1, 1]
    
    if 'region' in complete_stations.columns:
        region_status = {}
        
        for domain in completed_domains:
            domain_name = domain['domain_name']
            site_row = complete_stations[complete_stations['Watershed_Name'] == domain_name]
            
            if not site_row.empty:
                region = site_row['region'].iloc[0]
                if region not in region_status:
                    region_status[region] = {'total': 0, 'complete': 0}
                
                region_status[region]['total'] += 1
                if domain['has_results'] and domain['has_observations']:
                    region_status[region]['complete'] += 1
        
        if region_status:
            regions = list(region_status.keys())
            complete_counts = [region_status[r]['complete'] for r in regions]
            total_counts = [region_status[r]['total'] for r in regions]
            pending_counts = [total_counts[i] - complete_counts[i] for i in range(len(regions))]
            
            x_pos = range(len(regions))
            
            ax4.bar(x_pos, complete_counts, label='Complete', color='green', alpha=0.7)
            ax4.bar(x_pos, pending_counts, bottom=complete_counts, 
                   label='Pending', color='orange', alpha=0.7)
            
            ax4.set_xticks(x_pos)
            ax4.set_xticklabels([r.replace('_', ' ') for r in regions], rotation=45, ha='right')
            ax4.set_ylabel('Number of Sites')
            ax4.set_title('Processing Status by Region')
            ax4.legend()
            ax4.grid(True, alpha=0.3, axis='y')
    
    plt.suptitle('GGMN Large Sample Groundwater Study - Processing Overview', 
                 fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Save plot
    overview_path = experiment_dir / 'plots' / 'groundwater_domain_overview.png'
    overview_path.parent.mkdir(exist_ok=True)
    plt.savefig(overview_path, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"✅ Groundwater overview saved: {overview_path}")
    
    return total_selected, total_discovered, total_with_results, total_with_obs, total_complete

def create_groundwater_analysis_summary(completed_domains):
    """
    Create summary analysis of groundwater modeling results
    """
    print(f"\n💧 Creating Groundwater Analysis Summary...")
    
    # Analyze completed simulations
    simulation_summary = {
        'domains_with_results': 0,
        'domains_with_observations': 0,
        'total_simulation_files': 0,
        'average_well_depth': 0,
        'depth_range': (0, 0)
    }
    
    well_depths = []
    regions_represented = set()
    
    for domain in completed_domains:
        if domain['has_results']:
            simulation_summary['domains_with_results'] += 1
            simulation_summary['total_simulation_files'] += len(domain['simulation_files'])
        
        if domain['has_observations']:
            simulation_summary['domains_with_observations'] += 1
        
        # Get site metadata
        domain_name = domain['domain_name']
        site_row = complete_stations[complete_stations['Watershed_Name'] == domain_name]
        
        if not site_row.empty:
            well_depths.append(site_row['well_depth'].iloc[0])
            if 'region' in site_row.columns:
                regions_represented.add(site_row['region'].iloc[0])
    
    if well_depths:
        simulation_summary['average_well_depth'] = np.mean(well_depths)
        simulation_summary['depth_range'] = (min(well_depths), max(well_depths))
    
    # Create summary visualization
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # 1. Processing efficiency
    ax1 = axes[0]
    
    labels = ['With Results', 'With Observations', 'Complete']
    values = [
        simulation_summary['domains_with_results'],
        simulation_summary['domains_with_observations'],
        sum(1 for d in completed_domains if d['has_results'] and d['has_observations'])
    ]
    colors = ['orange', 'blue', 'green']
    
    bars = ax1.bar(labels, values, color=colors, alpha=0.7, edgecolor='black')
    
    for bar, value in zip(bars, values):
        ax1.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.5,
                str(value), ha='center', va='bottom', fontweight='bold')
    
    ax1.set_ylabel('Number of Domains')
    ax1.set_title('Processing Completion Status')
    ax1.grid(True, alpha=0.3, axis='y')
    
    # 2. Well depth distribution
    ax2 = axes[1]
    
    if well_depths:
        ax2.hist(well_depths, bins=12, alpha=0.7, color='lightblue', edgecolor='black')
        ax2.axvline(x=np.mean(well_depths), color='red', linestyle='--', 
                   label=f'Mean: {np.mean(well_depths):.1f}m')
        ax2.set_xlabel('Well Depth (m)')
        ax2.set_ylabel('Number of Wells')
        ax2.set_title('Well Depth Distribution')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
    
    # 3. Regional representation
    ax3 = axes[2]
    
    if regions_represented:
        region_counts = {}
        for domain in completed_domains:
            domain_name = domain['domain_name']
            site_row = complete_stations[complete_stations['Watershed_Name'] == domain_name]
            if not site_row.empty and 'region' in site_row.columns:
                region = site_row['region'].iloc[0]
                region_counts[region] = region_counts.get(region, 0) + 1
        
        if region_counts:
            regions = list(region_counts.keys())
            counts = list(region_counts.values())
            
            ax3.pie(counts, labels=[r.replace('_', ' ') for r in regions], 
                   autopct='%1.1f%%', startangle=90)
            ax3.set_title('Regional Distribution')
    
    plt.suptitle('GGMN Groundwater Study - Analysis Summary', 
                 fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Save plot
    summary_path = experiment_dir / 'plots' / 'groundwater_analysis_summary.png'
    plt.savefig(summary_path, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"✅ Groundwater analysis summary saved: {summary_path}")
    
    return simulation_summary

def create_final_ggmn_report(completed_domains, processing_stats, simulation_summary):
    """
    Create final comprehensive report for GGMN study
    """
    print(f"\n📋 Creating Final GGMN Groundwater Study Report...")
    
    report_path = experiment_dir / 'reports' / 'ggmn_final_report.txt'
    report_path.parent.mkdir(exist_ok=True)
    
    total_selected, total_discovered, total_with_results, total_with_obs, total_complete = processing_stats
    
    with open(report_path, 'w') as f:
        f.write("GGMN Large Sample Groundwater Study - Final Report\n")
        f.write("=" * 52 + "\n\n")
        f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
        
        f.write("PROCESSING SUMMARY:\n")
        f.write(f"  Stations selected: {total_selected}\n")
        f.write(f"  Processing initiated: {total_discovered}\n")
        f.write(f"  Simulation results: {total_with_results}\n")
        f.write(f"  Observations found: {total_with_obs}\n")
        f.write(f"  Complete validation: {total_complete}\n\n")
        
        f.write("DATASET CHARACTERISTICS:\n")
        f.write(f"  Quality threshold: ≥{experiment_config['min_completeness']}% completeness\n")
        f.write(f"  Minimum records: {experiment_config['min_records']}\n")
        f.write(f"  Time period: {experiment_config.get('start_year', 'N/A')} - {experiment_config.get('end_year', 'N/A')}\n")
        
        if simulation_summary['average_well_depth'] > 0:
            f.write(f"  Average well depth: {simulation_summary['average_well_depth']:.1f}m\n")
            f.write(f"  Depth range: {simulation_summary['depth_range'][0]:.1f} - {simulation_summary['depth_range'][1]:.1f}m\n")
        
        f.write("\nSIMULATION RESULTS:\n")
        f.write(f"  Domains with results: {simulation_summary['domains_with_results']}\n")
        f.write(f"  Total output files: {simulation_summary['total_simulation_files']}\n")
        f.write(f"  Observation files: {simulation_summary['domains_with_observations']}\n")
        
        if total_complete > 0:
            success_rate = (total_complete / total_selected) * 100
            f.write(f"\nSUCCESS METRICS:\n")
            f.write(f"  Processing success rate: {success_rate:.1f}%\n")
            f.write(f"  Ready for validation: {total_complete} sites\n")
        
        f.write("\nNEXT STEPS:\n")
        f.write("  1. Extract groundwater level time series from simulations\n")
        f.write("  2. Load and process GGMN observation data\n")
        f.write("  3. Perform statistical comparison (correlation, bias, RMSE)\n")
        f.write("  4. Analyze performance by well depth and region\n")
        f.write("  5. Identify patterns in groundwater model performance\n")
    
    print(f"✅ Final report saved: {report_path}")
    
    return report_path

# Execute Step 3 Analysis
print(f"\n🔍 Step 3.1: Groundwater Domain Discovery")

# Discover completed domains
completed_domains = discover_completed_groundwater_domains()

# Create overview visualization
processing_stats = create_groundwater_overview_visualization(completed_domains)

print(f"\n💧 Step 3.2: Groundwater Analysis Summary")

# Create analysis summary
simulation_summary = create_groundwater_analysis_summary(completed_domains)

print(f"\n📋 Step 3.3: Final Report Generation")

# Create final report
report_path = create_final_ggmn_report(completed_domains, processing_stats, simulation_summary)

# Print final summary
total_selected, total_discovered, total_with_results, total_with_obs, total_complete = processing_stats

print(f"\n✅ Step 3 Complete: GGMN Groundwater Analysis")
print(f"   📁 Results directory: {experiment_dir}")
print(f"   💧 Processing status: {total_complete}/{total_selected} sites with complete validation")
print(f"   📊 Success rate: {(total_complete/total_selected)*100:.1f}% complete")

if simulation_summary['domains_with_results'] > 0:
    print(f"   📈 Simulation results: {simulation_summary['domains_with_results']} domains processed")
    print(f"   💾 Output files: {simulation_summary['total_simulation_files']} simulation files")

if simulation_summary['average_well_depth'] > 0:
    print(f"   🏗️  Well characteristics: {simulation_summary['average_well_depth']:.1f}m average depth")
    print(f"   📏 Depth range: {simulation_summary['depth_range'][0]:.1f} - {simulation_summary['depth_range'][1]:.1f}m")

print(f"\n🎉 Large Sample GGMN Groundwater Analysis Complete!")
print(f"   💧 Multi-site groundwater modeling framework established")
print(f"   📊 Foundation for systematic groundwater validation created")
print(f"   🌍 Hydrogeological diversity captured across North America")

## Tutorial Summary: Large Sample Groundwater Hydrology with GGMN

This tutorial demonstrated systematic groundwater modeling through large sample validation across diverse hydrogeological environments in North America. Using the comprehensive GGMN observation network, we established a framework for continental-scale groundwater process evaluation.

**Key Accomplishments:**
- **Multi-site groundwater evaluation** across well depth gradients from shallow to deep aquifer systems
- **Hydrogeological diversity** spanning alluvial, bedrock, and glacial aquifer types
- **Regional coverage** across North American groundwater provinces
- **Systematic validation framework** for groundwater level simulation assessment

**Scientific Foundation:**
The large sample approach enables identification of systematic patterns in groundwater model performance across hydrogeological settings. This methodology reveals how subsurface processes vary with aquifer characteristics, regional climate, and surface-groundwater interactions.

**Methodological Innovation:**
This workflow demonstrates CONFLUENCE's capacity for **automated groundwater-specific configuration**, **multi-site processing across hydrogeological gradients**, and **standardized validation protocols** for systematic groundwater science.

**Connection to Comprehensive Large Sample Studies:**
Having explored energy balance (FLUXNET), snow processes (NorSWE), and groundwater dynamics (GGMN), we have established the foundation for integrated multi-variable hydrological validation across the complete terrestrial water cycle.

The large sample methodology transforms groundwater hydrology from individual well studies to systematic, generalizable science—essential for understanding subsurface water resources in a changing climate.

### Ready for the next challenge?
**Explore integrated multi-variable analysis** → **[Tutorial 05: Advanced Model Comparison](./05_model_comparison.ipynb)**