# SYMFLUENCE Tutorial 1a — Point-Scale Workflow (Paradise SNOTEL)

## Introduction

This notebook demonstrates the point-scale modeling workflow in **SYMFLUENCE**, a framework for reproducible and modular computational hydrology. At the point scale, we simulate vertical energy and water fluxes at a single site, independent of routing or lateral flow, to isolate and evaluate model process representations.

Here, we focus on the **Paradise SNOTEL station (ID 602)**, located at 1,630 m elevation in Washington’s Cascade Range. This site represents a transitional snow climate and provides long-term observations of snow water equivalent (SWE) and soil moisture across multiple depths. By reproducing the observed seasonal snow and soil moisture dynamics, this tutorial demonstrates how SYMFLUENCE structures a controlled, transparent, and fully reproducible point-scale experiment.

Through this example, you will see how configuration-driven workflows manage experiment setup, geospatial definition, input data preprocessing, model instantiation, and performance evaluation—building a foundation for more complex distributed modeling studies later in the series.


# Step 1 — Configuration (pick or generate)

We begin by selecting (or programmatically generating) a single configuration file that fully specifies the experiment. This keeps the workflow reproducible and makes initialization a one-liner.


In [None]:
# Step 1 — Create a site-specific configuration for the Paradise SNOTEL example


from pathlib import Path
import yaml
from symfluence.resources import get_config_template

SYMFLUENCE_CODE_DIR = Path.cwd().resolve()

# Path to the default template configuration
config_template = get_config_template()

# Load the base configuration
with open(config_template, "r") as f:
    config = yaml.safe_load(f)

# === Modify key entries for the Paradise SNOTEL point-scale case ===

# Define code directory — ensures relative paths resolve correctly
config['SYMFLUENCE_CODE_DIR'] = str(SYMFLUENCE_CODE_DIR)

# Get SYMFLUENCE_DATA_DIR from template
SYMFLUENCE_DATA_DIR = Path(config.get('SYMFLUENCE_DATA_DIR', 
                                       str(SYMFLUENCE_CODE_DIR.parent / 'data' / 'SYMFLUENCE_data'))).resolve()
config["SYMFLUENCE_DATA_DIR"] = str(SYMFLUENCE_DATA_DIR)

print(f"Using SYMFLUENCE_DATA_DIR: {SYMFLUENCE_DATA_DIR}")

# Restrict the spatial domain to a single site using latitude/longitude bounds
config["DOMAIN_DEFINITION_METHOD"] = "point"
config["DOMAIN_DISCRETIZATION"] = "GRUs"
config["BOUNDING_BOX_COORDS"] = "46.781/-121.751/46.779/-121.749"
config["POUR_POINT_COORDS"] = "46.78/-121.75"

# Enable automatic download of SNOTEL data for this station
config["DOWNLOAD_SNOTEL"] = True

# Specify model and forcing dataset used in this example
config["HYDROLOGICAL_MODEL"] = "SUMMA"
config["FORCING_DATASET"] = "ERA5"

# Define the temporal extent of the experiment
config["EXPERIMENT_TIME_START"] = "2000-01-01 01:00"
config["EXPERIMENT_TIME_END"] = "2002-12-31 23:00"
config['CALIBRATION_PERIOD'] = "2000-10-01, 2001-09-30"
config['EVALUATION_PERIOD'] = "2001-10-01, 2002-09-30"
config['SPINUP_PERIOD'] = "2000-01-01, 2000-09-30"

# Assign a descriptive domain name and experiment ID
config["DOMAIN_NAME"] = "paradise"
config["EXPERIMENT_ID"] = "run_1"

# MAF paths and settings
config['DATATOOL_DATASET_ROOT'] = '/path/to/meteorological-data/'
config['GISTOOL_DATASET_ROOT'] = '/path/to/geospatial-data/'
config['TOOL_CACHE'] = '/path/to/cache/dir'
config['CLUSTER_JSON'] = '/path/to/cluster.json'
config['SNOW_DATA_SOURCE'] = 'SNOTEL'
config['SNOW_STATIONS'] = '679'
config['ISMN_NETWORK'] = 'SCAN'
config['ISMN_STATIONS'] = '679'

# Optimization settings
config['OPTIMIZATION_TARGET'] = 'swe'
config['PARAMS_TO_CALIBRATE'] = 'tempCritRain,tempRangeTimestep,frozenPrecipMultip,albedoMax,albedoMinWinter,albedoDecayRate,constSnowDen,mw_exp,k_snow,z0Snow'
config['ITERATIVE_OPTIMIZATION_ALGORITHM'] = 'DDS'
config['OPTIMIZATION_METRIC'] = 'RMSE'
config['CALIBRATION_TIMESTEP'] = 'daily'

# === Save the customized configuration ===
out_config = Path('./config_paradise.yaml')
with open(out_config, "w") as f:
    yaml.dump(config, f, default_flow_style=False, sort_keys=False)

print(f"✅ New configuration written to: {out_config}")

## Step 1b — Download Example Data (Optional)

If you don't have access to MAF-supported HPC resources, you can download pre-processed example data from GitHub releases. This step downloads and extracts the example data to your SYMFLUENCE_DATA_DIR.

In [None]:
# Step 1c — Initialize SYMFLUENCE
from symfluence import SYMFLUENCE
from symfluence.resources import get_config_template

config_path = Path('./config_paradise.yaml')
symfluence = SYMFLUENCE(config_path)

print("✅ SYMFLUENCE initialized successfully.")
print(f"Configuration loaded from: {config_path}")

## Step 1b — Initialize SYMFLUENCE

With the configuration prepared, we now initialize **SYMFLUENCE**.  
This step reads the configuration file, sets up the project directory, and registers all workflow managers (data, domain, model, and evaluation).  


In [None]:
# Step 1b — Initialize SYMFLUENCE
import os, sys
from symfluence import SYMFLUENCE  # adjust if your import path differs
from symfluence.resources import get_config_template

config_path = "./config_paradise.yaml"
symfluence = SYMFLUENCE(config_path)

print("✅ SYMFLUENCE initialized successfully.")
print(f"Configuration loaded from: {config_path}")

## Step 1c — Project structure setup

We now create the standardized project directory and a pour-point feature for the site.  
This anchors the experiment in a clear, reproducible file layout and records the site location for downstream domain and data steps.


In [None]:
# Step 1c — Project structure setup

from pathlib import Path

# 1) Create the standardized project layout (logs, config link, data/output folders, etc.)
project_dir = symfluence.managers['project'].setup_project()

# 2) Create a pour-point feature (the site reference geometry for point-scale workflows)
pour_point_path = symfluence.managers['project'].create_pour_point()

print("✅ Project structure created.")
print(f"Project root: {project_dir}")
print(f"Pour point:   {pour_point_path}")

# 3) Brief top-level directory preview
print("\nTop-level structure:")
for p in sorted(Path(project_dir).iterdir()):
    if p.is_dir():
        print(f"├── {p.name}")

## Step 2 — Domain definition (point-scale GRU)

For the Paradise SNOTEL example, the domain is a **single GRU** representing the site footprint.  
This keeps the workflow strictly point-scale (no routing), aligning the geometry with the pour point created in Step 1.

### Step 2a — Geospatial attribute acquisition - **Only available through MAF supported HPCs**

We first acquire site attributes (elevation, land cover, soils, etc.).  
These are model-agnostic inputs used to parameterize vertical energy and water balance at the site.

- If you are using the downloaded example data. Copy the attributes, forcing and observation directories into the newly created domain directory from step 1c

In [None]:
# Step 2a — Acquire attributes (model-agnostic)
# If you are using MAF supported HPC, uncomment the below line
#symfluence.managers['data'].acquire_attributes()
print("✅ Attribute acquisition complete")

### Step 2b — Domain definition (point-scale)

With attributes prepared, we define a point-scale domain consistent with the pour point.  
For this example, the domain is a minimal footprint around the Paradise SNOTEL site.

In [None]:
# Step 2b — Define the point-scale domain
watershed_path = symfluence.managers['domain'].define_domain()
print("✅ Domain definition complete")
print(f"Domain file: {watershed_path}")

### Step 2c — Discretization (required even for 1 GRU = 1 HRU)

Discretization writes the **catchment HRU shapefile** and related artifacts required by downstream steps.  
For the point-scale case we set `DOMAIN_DISCRETIZATION: GRUs`, which creates a **single HRU** identical to the GRU while still generating the standardized outputs.


In [None]:
# Step 2c — Discretization (GRUs → HRUs 1:1, but files are still created)
hru_path = symfluence.managers['domain'].discretize_domain()
print("✅ Domain discretization complete")
print(f"HRU file: {hru_path}")

## Step 2d — Verification & inspection (Paradise SNOTEL)

We verify that discretization produced the expected shapefiles in the standardized locations, then plot a minimal GRU–HRU overlay.

**Expected files**
- `domain_dir/shapefiles/river_basins/paradise_riverBasins_point.shp` (GRU)
- `domain_dir/shapefiles/catchment/paradise_HRUs_GRUs.shp` (HRU)


In [None]:
# Step 2d — Verify domain outputs and inspect geometry

from pathlib import Path
import geopandas as gpd
import matplotlib.pyplot as plt
import yaml

# 1) Read config to get data dir and domain name
with open("./config_paradise.yaml") as f:
    cfg = yaml.safe_load(f)

data_dir   = Path(cfg["SYMFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"
shp_dir    = domain_dir / "shapefiles"

# 2) Explicit expected shapefiles for Paradise
gru_fp = shp_dir / "river_basins" / "paradise_riverBasins_point.shp"
hru_fp = shp_dir / "catchment"     / "paradise_HRUs_GRUs.shp"

# 3) Verify presence
for label, path in [("GRU", gru_fp), ("HRU", hru_fp)]:
    if not path.exists():
        raise FileNotFoundError(f"❌ Expected {label} file not found: {path}")
    print(f"✅ {label} file found: {path}")

# 4) Minimal overlay plot
gru = gpd.read_file(gru_fp)
hru = gpd.read_file(hru_fp)
if hru.crs != gru.crs:
    hru = hru.to_crs(gru.crs)

ax = gru.plot(figsize=(6, 6))
hru.plot(ax=ax, facecolor="none")
ax.set_title("Paradise SNOTEL — GRU vs HRU")
ax.set_xlabel("")
ax.set_ylabel("")
ax.set_aspect("equal")
plt.tight_layout()
plt.show()

# Step 3 — Input preprocessing (model-agnostic)

We prepare inputs in three small moves:
1) acquire **meteorological forcings**,  
2) process **observations** (SNOTEL), and  
3) run **model-agnostic preprocessing** to standardize time steps, variables, and units for downstream use.

### Step 3a — Acquire meteorological forcings (ERA5)

Downloads/subsets the forcings for the Paradise domain.


In [None]:
# Step 3a — Forcings
# If you are using MAF supported HPC, uncomment the below line
# symfluence.managers['data'].acquire_forcings()
print("✅ Forcing data acquisition complete")


### Step 3b — Process observations (SNOTEL)

Parses site observations (e.g., SWE, soil moisture), applies basic QA/QC, and stores standardized outputs.


In [None]:
# Step 3b — Observations
# If you are using MAF supported HPC, uncomment the below line
#symfluence.managers['data'].process_observed_data()
print("✅ Observational data processing complete")

### Step 3c — Model-agnostic preprocessing

Standardizes variable names, units, and time steps (and fills required diagnostics) so multiple models can consume the same inputs consistently.

In [None]:
# Step 3c — Model-agnostic preprocessing
symfluence.managers['data'].run_model_agnostic_preprocessing()
print("✅ Model-agnostic preprocessing complete")

### Step 3d — Quick verification

We confirm the expected folders exist and contain files:

- `forcing/raw_data/`
- `forcing/basin_averaged_data/`
- `observations/snow/{raw,processed}/`
- `observations/soil_moisture/{raw,processed}/`


In [None]:
from pathlib import Path
import yaml

# Derive paths from the config (no hard-coding)
with open("./config_paradise.yaml") as f:
    cfg = yaml.safe_load(f)

data_dir   = Path(cfg["SYMFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"

targets = {
    "forcing/raw_data":                  domain_dir / "forcing" / "raw_data",
    "forcing/basin_averaged_data":       domain_dir / "forcing" / "basin_averaged_data",
    "observations/snow/raw":             domain_dir / "observations" / "snow" / "swe" / "raw",
    "observations/snow/processed":       domain_dir / "observations" / "snow" / "swe" / "processed",
    "observations/soil_moisture/raw":    domain_dir / "observations" / "soil_moisture" / "ismn" / "raw",
    "observations/soil_moisture/processed": domain_dir / "observations" / "soil_moisture" / "ismn" / "processed",
}

def count_files(p: Path) -> int:
    return sum(1 for x in p.iterdir() if x.is_file()) if p.exists() else 0

for label, path in targets.items():
    exists = path.exists()
    n = count_files(path)
    status = "✅" if exists and n > 0 else ("⚠️ empty" if exists else "❌ missing")
    suffix = f"({n} files)" if exists else ""
    print(f"{status} {label}  {suffix}")

# Step 4 — Model-specific preprocessing & model run (SUMMA)

We now convert the model-agnostic inputs into **SUMMA-ready inputs**, then instantiate and run the model for the Paradise point-scale case.


### Step 4a — SUMMA-specific preprocessing

Creates the SUMMA input bundle (metadata, parameter tables, forcing links) from the standardized inputs.


In [None]:
# Step 4a — SUMMA-specific preprocessing
symfluence.managers['model'].preprocess_models()
print("✅ Model-specific preprocessing complete")

## Step 4b — Instantiate & run the model

Instantiates the model using the prepared inputs and executes the point-scale simulation.


In [None]:
# Step 4b — Instantiate & run SUMMA
print(f"Running {symfluence.config['HYDROLOGICAL_MODEL']} for point-scale simulation…")
symfluence.managers['model'].run_models()
print("✅ Point-scale model run complete")

### Step 4c - Quick verification

Print where SUMMA inputs and run outputs were written (paths are derived from the configuration).


In [None]:
from pathlib import Path
import yaml

with open("./config_paradise.yaml") as f:
    cfg = yaml.safe_load(f)

data_dir   = Path(cfg["SYMFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"

# Common locations used by the model manager
summa_in   = domain_dir / "forcing" / "SUMMA_input"
results    = domain_dir / "simulations" / cfg['EXPERIMENT_ID'] / 'SUMMA' 

print("SUMMA input dir:", summa_in if summa_in.exists() else "(not found)")
print("Results dir:",    results if results.exists()    else "(not found)")

In [None]:
# Step 4c — SWE only (obs vs sim) with robust NetCDF open + unit auto-detect

from pathlib import Path
import yaml, pandas as pd, numpy as np, xarray as xr
import matplotlib.pyplot as plt
import re

# --- Paths from config ---
with open("./config_paradise.yaml") as f:
    cfg = yaml.safe_load(f)
data_dir   = Path(cfg["SYMFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"

# Find a daily SUMMA output (e.g., *_day.nc) under the domain folder
nc_files = list(domain_dir.rglob("*_day.nc"))
if not nc_files:
    # fallback: any .nc under results
    nc_files = list((domain_dir / "simulations" / cfg['EXPERIMENT_ID'] / "SUMMA").rglob("_day.nc"))
if not nc_files:
    raise FileNotFoundError(f"No netCDF files found under {domain_dir}")
nc_path = nc_files[0]

def open_dataset_safe(path: Path) -> xr.Dataset:
    if not path.exists():
        raise FileNotFoundError(f"Path does not exist: {path}")
    if path.is_dir() or path.suffix.lower() == ".zarr":
        # Zarr store
        return xr.open_zarr(path)
    errs = []
    for eng in ("netcdf4", "scipy"):
        try:
            return xr.open_dataset(path, engine=eng)
        except Exception as e:
            errs.append(f"{eng}: {e}")
    raise ValueError(f"Could not open {path} with engines netcdf4/scipy.\n" + "\n".join(errs))

# --- Helpers ---
def rmse(a, b): 
    d = (a - b).to_numpy(dtype=float)
    return float(np.sqrt(np.nanmean(d**2)))

def bias(a, b):
    return float((a - b).mean())

def first_numeric_col(df):
    cols = [c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])]
    return cols[0] if cols else None

def reduce_to_time_series(da: xr.DataArray) -> pd.Series:
    if "time" not in da.dims:
        if "time" in da.coords:
            da = da.swap_dims({list(da.dims)[0]: "time"})
        else:
            raise ValueError("DataArray has no 'time' dimension or coordinate.")
    # pick first layer if present; average any spatial dims (hru/gru)
    for d in [d for d in da.dims if d != "time"]:
        if re.search("layer|soil", d, re.I):
            da = da.isel({d: 0})
        else:
            da = da.mean(d)
    s = da.to_series()
    s.index = pd.to_datetime(s.index)
    return s.sort_index()

def read_obs_csv(path: Path):
    df = pd.read_csv(path, index_col=0)
    if not isinstance(df.index, pd.DatetimeIndex):
        df = pd.read_csv(path)
        date_col = next(c for c in df.columns if re.search("date|time", str(c), re.I))
        df[date_col] = pd.to_datetime(df[date_col].astype(str).str.strip(), dayfirst=True, errors="raise")
        df = df.set_index(date_col)
    df = df.sort_index()
    col = first_numeric_col(df)
    if not col:
        raise ValueError(f"No numeric SWE column found in {path}")
    return df[col].astype(float), str(col)

def align(a, b):
    idx = a.index.intersection(b.index)
    return a.loc[idx].astype(float), b.loc[idx].astype(float)

# --- Open dataset robustly & handle spinup safely ---
ds = open_dataset_safe(nc_path)

# robust time extraction
time_vals = ds.get("time")
if time_vals is None or np.size(time_vals) == 0:
    raise ValueError("Dataset has no usable 'time' coordinate.")
time_pd = pd.to_datetime(np.array(time_vals.values), errors="coerce")
if pd.isna(time_pd).all():
    # cftime fallback
    time_pd = pd.to_datetime([f"{t.year:04d}-{t.month:02d}-{t.day:02d}" for t in time_vals.values])
start_year = int(time_pd.min().year) + 1

ds_eval = ds.sel(time=slice(f"{start_year}-01-01", None))
if ds_eval.sizes.get("time", 0) == 0:
    ds_eval = ds  # if skipping a year empties it, use full record

# SWE (simulation) — expected in mm
swe_var_candidates = ["scalarSWE", "scalarSnowWaterEquivalent", "SWE"]
swe_var = next((v for v in swe_var_candidates if v in ds_eval.data_vars), None)
if not swe_var:
    raise KeyError(f"SWE variable not found. Tried: {swe_var_candidates}")
swe_sim = reduce_to_time_series(ds_eval[swe_var])

# --- Load SWE observations (processed; unknown units) ---
swe_obs_csv1 = domain_dir / "observations" / "snow" / "processed" / f"{cfg['DOMAIN_NAME']}_swe_processed.csv"
swe_obs_csv2 = domain_dir / "observations" / "snow" / "swe" / "processed" / f"{cfg['DOMAIN_NAME']}_swe_processed.csv"
swe_obs_csv  = swe_obs_csv1 if swe_obs_csv1.exists() else swe_obs_csv2
swe_obs_raw, swe_obs_col = read_obs_csv(swe_obs_csv)

# --- Unit auto-detection: assume sim is mm; try mm/cm/in for obs ---
candidates = {"mm": 1.0, "cm": 10.0, "in": 25.4}
scores, aligned_examples = {}, {}
for unit, factor in candidates.items():
    obs_scaled = swe_obs_raw * factor
    sim_a, obs_a = align(swe_sim, obs_scaled)
    if len(sim_a) == 0:
        scores[unit] = np.inf
    else:
        scores[unit] = rmse(sim_a, obs_a)
        aligned_examples[unit] = (sim_a, obs_a)

best_unit = min(scores, key=scores.get)
scale = candidates[best_unit]
if not np.isfinite(scores[best_unit]):
    raise ValueError("No overlapping dates between SWE sim and obs.")
swe_sim_a, swe_obs_a = aligned_examples[best_unit]

# --- Metrics & diagnostics ---
def corr(a, b): return float(pd.Series(a).corr(pd.Series(b)))

swe_metrics = dict(RMSE=rmse(swe_sim_a, swe_obs_a),
                   Bias=bias(swe_sim_a, swe_obs_a),
                   r=corr(swe_sim_a, swe_obs_a))

print(f"Opened: {nc_path}")
print(f"SWE sim range: {swe_sim.index.min()} → {swe_sim.index.max()}")
print(f"SWE obs range: {swe_obs_raw.index.min()} → {swe_obs_raw.index.max()}")
print(f"SWE overlap days: {len(swe_sim_a)}")
print(f"Detected obs SWE units: {best_unit} (×{scale})")
print("SWE metrics:", {k: round(v,3) for k,v in swe_metrics.items()})

# --- Plot ---
plt.figure(figsize=(8,4))
plt.plot(swe_obs_a.index, swe_obs_a.values, label=f"Obs (→ mm from {best_unit})")
plt.plot(swe_sim_a.index, swe_sim_a.values, label="Sim (mm)")
plt.title("SWE (obs vs sim)")
plt.legend()
plt.tight_layout()
plt.show()

ds.close()


# Step 5 — Calibration (SUMMA, Differential Evolution)

We enable **iterative calibration** for SUMMA, set the **calibration/evaluation periods**, choose **parameters**, and pick a **single objective** (KGE).  
SYMFLUENCE exposes a one-liner to run calibration once config is set.


## Step 5a — Minimal config (what matters)

Add/confirm these in `config_paradise.yaml`:

```yaml
# Enable iterative calibration with DE, use KGE on the calibration window
OPTIMIZATION_METHODS: [iteration]
ITERATIVE_OPTIMIZATION_ALGORITHM: DE      # DE, DDS, PSO, SCE-UA, NSGA-II
OPTIMIZATION_METRIC: KGE                  # KGE, NSE, RMSE, MAE, KGEp

# Parameters to calibrate (point-scale set)
PARAMS_TO_CALIBRATE: tempCritRain,k_soil,vGn_n,theta_sat


In [None]:
# Step 5b — Run calibration (DE + KGE)

results_file = symfluence.managers['optimization'].calibrate_model()  
print("Calibration results file:", results_file)