# CONFLUENCE Tutorial 1b — Point-Scale Workflow (FLUXNET CA-NS7)

## Introduction
This notebook mirrors the concise, configuration-first style established in **Tutorial 01a** and adapts it for **energy-balance validation at a FLUXNET tower (CA-NS7)**. We simulate point-scale land–atmosphere exchanges and evaluate **evapotranspiration (LE)** and **sensible heat (H)** using FLUXNET observations.

The workflow is strictly configuration-driven and fully reproducible:
1) write a minimal config, 2) initialize CONFLUENCE and standard project layout, 3) define the point-scale domain, 4) acquire & preprocess inputs, 5) run **SUMMA**, and 6) evaluate fluxes.


# Step 1 — Configuration (pick or generate)

We start by generating a compact configuration for the **CA-NS7** FLUXNET site using the same pattern as 01a. This keeps initialization a one-liner and the workflow fully reproducible.

In [None]:
# Step 1 — Create a site-specific configuration for the CA-NS7 FLUXNET example
from pathlib import Path
import yaml

# Path to the default template configuration (same pattern as 01a)
config_template = Path("../../0_config_files/config_template.yaml")

# Load the base configuration
with open(config_template, "r") as f:
    config = yaml.safe_load(f)

# === Modify key entries for the CA-NS7 point-scale case ===
# Code & data directories
config["CONFLUENCE_CODE_DIR"] = str(Path("../").resolve())
config["CONFLUENCE_DATA_DIR"] = str(Path("/path/to/CONFLUENCE_data").resolve())

# Point-scale domain settings
config["DOMAIN_DEFINITION_METHOD"] = "point"
config["DOMAIN_DISCRETIZATION"] = "GRUs"  # 1 GRU => 1 HRU
config["DOMAIN_NAME"] = "CA-NS7"
config["POUR_POINT_COORDS"] = "56.6358/-99.9483"  # CA-NS7 coordinates

# Data/forcing & model
config["HYDROLOGICAL_MODEL"] = "SUMMA"
config["FORCING_DATASET"] = "ERA5"  # Used for meteorological inputs
config["DOWNLOAD_FLUXNET"] = True
config["FLUXNET_STATION"] = "CA-NS7"

# Experiment timeline (follows original 01b intent)
config["EXPERIMENT_TIME_START"] = "2001-01-01 01:00"
config["EXPERIMENT_TIME_END"]   = "2005-12-31 23:00"
config["CALIBRATION_PERIOD"]    = "2002-01-01, 2003-12-31"
config["EVALUATION_PERIOD"]     = "2004-01-01, 2005-12-31"
config["SPINUP_PERIOD"]         = "2001-01-01, 2001-12-31"

# (Optional) Paths to institutional data roots — customize if using shared infra
config['DATATOOL_DATASET_ROOT'] = '/path/to/meteorological-data/'
config['GISTOOL_DATASET_ROOT']  = '/path/to/geospatial-data/'
config['TOOL_CACHE']            = '/path/to/cache/dir'
config['CLUSTER_JSON']          = '/path/to/cluster.json'

# Basic optimization knobs if desired (example only)
config['PARAMS_TO_CALIBRATE'] = 'tempCritRain,k_soil,vGn_n,theta_sat'

# Unique experiment ID for outputs
config["EXPERIMENT_ID"] = "run_fluxnet_1"

# === Save the customized configuration ===
out_config = Path("../../0_config_files/config_fluxnet_CA-NS7.yaml")
with open(out_config, "w") as f:
    yaml.dump(config, f, default_flow_style=False, sort_keys=False)

print(f"✅ New configuration written to: {out_config}")

## Step 1b — Initialize CONFLUENCE
Initialize the framework using the configuration prepared above.

In [None]:
# Step 1b — Initialize CONFLUENCE
import os, sys
sys.path.append(os.path.abspath(os.path.join("..", "..")))
from CONFLUENCE import CONFLUENCE  # adjust if your import path differs

config_path = "../../0_config_files/config_fluxnet_CA-NS7.yaml"
confluence = CONFLUENCE(config_path)

print("✅ CONFLUENCE initialized successfully.")
print(f"Configuration loaded from: {config_path}")

## Step 1c — Project structure setup
Create the standardized project directory and a pour-point feature for the site.

In [None]:
# Step 1c — Project structure setup
from pathlib import Path

# 1) Create the standardized project layout (logs, config link, data/output folders, etc.)
project_dir = confluence.managers['project'].setup_project()

# 2) Create a pour-point feature (site reference geometry for point-scale workflows)
pour_point_path = confluence.managers['project'].create_pour_point()

print("✅ Project structure created.")
print(f"Project root: {project_dir}")
print(f"Pour point:   {pour_point_path}")

# 3) Brief top-level directory preview
print("\nTop-level structure:")
for p in sorted(Path(project_dir).iterdir()):
    if p.is_dir():
        print(f"├── {p.name}")

# Step 2 — Domain definition (point-scale GRU)
The domain is a **single GRU** around the flux tower footprint, ensuring a strictly point-scale (non-routed) experiment.

### Step 2a — Geospatial attribute acquisition

In [None]:
# Step 2a — Acquire attributes (model-agnostic)
confluence.managers['data'].acquire_attributes()
print("✅ Attribute acquisition complete")

### Step 2b — Domain definition (point-scale)
Define a minimal footprint around **CA-NS7** consistent with the pour point.

In [None]:
# Step 2b — Define the point-scale domain
watershed_path = confluence.managers['domain'].define_domain()
print("✅ Domain definition complete")
print(f"Domain file: {watershed_path}")

### Step 2c — Discretization (required even for 1 GRU = 1 HRU)
Creates the **catchment HRU** artifacts required by downstream steps (still 1:1 with the GRU for point scale).

In [None]:
# Step 2c — Discretization (GRUs → HRUs 1:1)
hru_path = confluence.managers['domain'].discretize_domain()
print("✅ Domain discretization complete")
print(f"HRU file: {hru_path}")

## Step 2d — Verification & inspection (CA-NS7)
We verify the expected shapefiles in standardized locations, then draw a minimal GRU–HRU overlay.

In [None]:
# Step 2d — Verify domain outputs and inspect geometry
from pathlib import Path
import geopandas as gpd
import matplotlib.pyplot as plt
import yaml

# 1) Read config to derive data & domain paths
with open("../0_config_files/config_fluxnet_CA-NS7.yaml") as f:
    cfg = yaml.safe_load(f)

data_dir   = Path(cfg["CONFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"
shp_dir    = domain_dir / "shapefiles"

# 2) Explicit expected shapefiles for CA-NS7
gru_fp = shp_dir / "river_basins" / f"{cfg['DOMAIN_NAME']}_riverBasins_point.shp"
hru_fp = shp_dir / "catchment"     / f"{cfg['DOMAIN_NAME']}_HRUs_GRUs.shp"

# 3) Verify presence
for label, path in [("GRU", gru_fp), ("HRU", hru_fp)]:
    if not path.exists():
        raise FileNotFoundError(f"❌ Expected {label} file not found: {path}")
    print(f"✅ {label} file found: {path}")

# 4) Minimal overlay plot
gru = gpd.read_file(gru_fp)
hru = gpd.read_file(hru_fp)
if hru.crs != gru.crs:
    hru = hru.to_crs(gru.crs)
ax = gru.plot(figsize=(6, 6))
hru.plot(ax=ax, facecolor="none")
ax.set_title("CA-NS7 — GRU vs HRU")
ax.set_xlabel("")
ax.set_ylabel("")
ax.set_aspect("equal")
plt.tight_layout()
plt.show()

# Step 3 — Input preprocessing (model-agnostic)
We prepare inputs in three small moves: 1) acquire **meteorological forcings**, 2) process **FLUXNET observations**, and 3) run **model-agnostic preprocessing** to standardize variables and time steps.

### Step 3a — Acquire meteorological forcings (ERA5)

In [None]:
# Step 3a — Forcings
confluence.managers['data'].acquire_forcings()
print("✅ Forcing data acquisition complete")

### Step 3b — Process observations (FLUXNET)

In [None]:
# Step 3b — Observations
confluence.managers['data'].process_observed_data()
print("✅ FLUXNET observational data processing complete")

### Step 3c — Model-agnostic preprocessing

In [None]:
# Step 3c — Model-agnostic preprocessing
confluence.managers['data'].run_model_agnostic_preprocessing()
print("✅ Model-agnostic preprocessing complete")

### Step 3d — Quick verification
Confirm the expected folders exist and contain files (derived from configuration; no hard-coded paths).

In [None]:
from pathlib import Path
import yaml

with open("../0_config_files/config_fluxnet_CA-NS7.yaml") as f:
    cfg = yaml.safe_load(f)

data_dir   = Path(cfg["CONFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"

targets = {
    "forcing/raw_data":                        domain_dir / "forcing" / "raw_data",
    "forcing/basin_averaged_data":             domain_dir / "forcing" / "basin_averaged_data",
    "observations/fluxnet/raw_data":           domain_dir / "observations" / "fluxnet" / "raw_data",
    "observations/energy_fluxes/processed":    domain_dir / "observations" / "energy_fluxes" / "fluxnet" / "processed",
}

def count_files(p: Path) -> int:
    return sum(1 for x in p.iterdir() if x.is_file()) if p.exists() else 0

for label, path in targets.items():
    exists = path.exists()
    n = count_files(path)
    status = "✅" if exists and n > 0 else ("⚠️ empty" if exists else "❌ missing")
    suffix = f"({n} files)" if exists else ""
    print(f"{status} {label}  {suffix}")

# Step 4 — Model-specific preprocessing & model run (SUMMA)

### Step 4a — SUMMA-specific preprocessing

In [None]:
# Step 4a — SUMMA-specific preprocessing
confluence.managers['model'].preprocess_models()
print("✅ Model-specific preprocessing complete")

## Step 4b — Instantiate & run the model

In [None]:
# Step 4b — Instantiate & run SUMMA
print(f"Running {confluence.config['HYDROLOGICAL_MODEL']} for point-scale simulation…")
confluence.managers['model'].run_models()
print("✅ Point-scale model run complete")

### Step 4c — Quick verification
Print where SUMMA inputs and run outputs were written (paths are derived from the configuration).

In [None]:
from pathlib import Path
import yaml

with open("../0_config_files/config_fluxnet_CA-NS7.yaml") as f:
    cfg = yaml.safe_load(f)

data_dir   = Path(cfg["CONFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"

# Common locations used by the model manager
summa_in   = domain_dir / "forcing" / "SUMMA_input"
results    = domain_dir / "simulations" / cfg['EXPERIMENT_ID'] / 'SUMMA'

print("SUMMA input dir:", summa_in if summa_in.exists() else "(not found)")
print("Results dir:",    results if results.exists()    else "(not found)")

# Step 5 — ET & H Validation (FLUXNET vs Simulation)
We compute basic metrics and draw quick comparisons between **observed** and **simulated** latent heat/ET and sensible heat. The code is resilient to different variable names in SUMMA outputs.

In [None]:
from pathlib import Path
import yaml, pandas as pd, numpy as np, xarray as xr
import matplotlib.pyplot as plt
import re

# --- Paths from config ---
with open("../0_config_files/config_fluxnet_CA-NS7.yaml") as f:
    cfg = yaml.safe_load(f)

data_dir   = Path(cfg["CONFLUENCE_DATA_DIR"])
domain_dir = data_dir / f"domain_{cfg['DOMAIN_NAME']}"

# Find a daily SUMMA output (e.g., *_day.nc) under the domain folder
nc_files = list(domain_dir.rglob("*_day.nc"))
if not nc_files:
    raise FileNotFoundError("No daily netCDF found (pattern '*_day.nc').")
nc_path = nc_files[0]

# --- Load simulation (daily) and skip spinup year ---
ds = xr.open_dataset(nc_path)
start_year = int(ds["time"].dt.year.min()) + 1
ds_eval = ds.sel(time=slice(f"{start_year}-01-01", None))

# Helpers to choose variables robustly
def pick_var(candidates, ds_vars):
    return next((v for v in candidates if v in ds_vars), None)

# Latent heat flux candidates (W/m^2) or ET (mm/day)
le_candidates = [
    "Qle", "qle", "latentHeatFlux", "scalarLatHeatTotal", "LE", "LE_F_MDS_sim"
]
et_candidates = ["ET", "evapotranspiration", "evspsbl", "tEvap", "et_mm_day"]

# Sensible heat flux candidates (W/m^2)
h_candidates  = ["Qh", "qh", "sensibleHeatFlux", "H", "H_F_MDS_sim"]

ds_vars = set(list(ds_eval.data_vars))
le_var = pick_var(le_candidates, ds_vars)
et_var = pick_var(et_candidates, ds_vars)
h_var  = pick_var(h_candidates,  ds_vars)

# Convert LE→ET if needed (W/m^2 → mm/day). 0.0353 ≈ mm/day per W/m^2
if et_var is None and le_var is not None:
    et_series_sim = (ds_eval[le_var].to_series() * 0.0353)
else:
    if et_var is None:
        raise KeyError(f"Could not find ET or LE variables. Tried ET candidates {et_candidates} and LE candidates {le_candidates}")
    et_series_sim = ds_eval[et_var].to_series()

if h_var is None and le_var is not None and "H" in ds_vars:
    h_var = "H"
if h_var is None:
    # Try to back-calculate H if Rn/G available (optional)
    alt = pick_var(["H_F_MDS_sim"], ds_vars)
    if alt is None:
        raise KeyError(f"Could not find sensible heat flux variable. Tried candidates: {h_candidates}")
    h_var = alt
h_series_sim = ds_eval[h_var].to_series()

# --- Load FLUXNET observations (processed) ---
proc_dir = domain_dir / "observations" / "energy_fluxes" / "fluxnet" / "processed"
cand = sorted(proc_dir.rglob("*.csv"))
if not cand:
    raise FileNotFoundError(f"No processed FLUXNET CSV found under: {proc_dir}")
obs_path = cand[0]
obs = pd.read_csv(obs_path)

# Make a timestamp; try common names
ts_col = next((c for c in obs.columns if re.search(r"timestamp|time|date", c, re.I)), None)
if ts_col is None:
    # attempt TIMESTAMP_START (FLUXNET convention)
    if "TIMESTAMP_START" in obs.columns:
        obs["timestamp"] = pd.to_datetime(obs["TIMESTAMP_START"].astype(str), format="%Y%m%d%H%M", errors="coerce")
        ts_col = "timestamp"
    else:
        raise KeyError("Could not find a timestamp column in processed FLUXNET data.")
else:
    obs[ts_col] = pd.to_datetime(obs[ts_col])

obs = obs.dropna(subset=[ts_col])
obs = obs.set_index(ts_col).sort_index()

# Observation variables: latent heat (LE) and sensible heat (H)
le_obs_candidates = ["LE_F_MDS", "LE", "ET_from_LE_mm_per_day"]
h_obs_candidates  = ["H_F_MDS", "H"]

def first_available(cols, df):
    for c in cols:
        if c in df.columns:
            return c
    return None

le_obs_col = first_available(le_obs_candidates, obs)
h_obs_col  = first_available(h_obs_candidates,  obs)
if le_obs_col is None or h_obs_col is None:
    raise KeyError(f"Missing expected FLUXNET columns. Tried LE {le_obs_candidates} and H {h_obs_candidates}")

# Convert LE (W/m^2) to ET (mm/day) if needed
if le_obs_col == "LE_F_MDS":
    obs["ET_obs_mm_day"] = obs[le_obs_col] * 0.0353
elif le_obs_col == "ET_from_LE_mm_per_day":
    obs["ET_obs_mm_day"] = obs[le_obs_col]
else:
    # Fallback conversion if 'LE' provided
    obs["ET_obs_mm_day"] = obs[le_obs_col] * 0.0353

# Daily aggregation for obs (if higher frequency)
obs_daily = obs.resample("1D").mean(numeric_only=True)

et_obs_series = obs_daily["ET_obs_mm_day"].dropna()
h_obs_series  = obs_daily[h_obs_col].dropna()

# Align with simulation time index
def align(a, b):
    idx = a.index.intersection(b.index)
    a2 = pd.Series(a, index=idx).astype(float)
    b2 = pd.Series(b, index=idx).astype(float)
    return a2, b2

et_sim_a, et_obs_a = align(et_series_sim, et_obs_series)
h_sim_a,  h_obs_a  = align(h_series_sim,  h_obs_series)

def rmse(a, b):
    d = (a - b).to_numpy(dtype=float)
    return float(np.sqrt(np.nanmean(d**2)))

def bias(a, b):
    return float((a - b).mean())

def corr(a, b):
    return float(pd.Series(a).corr(pd.Series(b)))

et_metrics = dict(RMSE=round(rmse(et_sim_a, et_obs_a), 3), Bias=round(bias(et_sim_a, et_obs_a), 3), r=round(corr(et_sim_a, et_obs_a), 3))
h_metrics  = dict(RMSE=round(rmse(h_sim_a,  h_obs_a ), 3), Bias=round(bias(h_sim_a,  h_obs_a ), 3), r=round(corr(h_sim_a,  h_obs_a ), 3))
print("ET metrics:", et_metrics)
print("H  metrics:", h_metrics)

# Minimal plots
fig, axes = plt.subplots(2, 1, figsize=(9, 6), sharex=True)
axes[0].plot(et_obs_a.index, et_obs_a.values, label="ET obs (mm/d)")
axes[0].plot(et_sim_a.index, et_sim_a.values, label="ET sim (mm/d)")
axes[0].set_ylabel("mm/day")
axes[0].set_title("Evapotranspiration")
axes[0].legend()

axes[1].plot(h_obs_a.index, h_obs_a.values, label="H obs (W/m²)")
axes[1].plot(h_sim_a.index, h_sim_a.values, label="H sim (W/m²)")
axes[1].set_ylabel("W m$^{-2}$")
axes[1].set_title("Sensible Heat Flux")
axes[1].legend()
plt.tight_layout()
plt.show()

# Scatter plots for quick skill view
fig2, axes2 = plt.subplots(1, 2, figsize=(10, 4))
axes2[0].scatter(et_obs_a, et_sim_a, s=6)
axes2[0].set_xlabel("ET obs (mm/d)")
axes2[0].set_ylabel("ET sim (mm/d)")
axes2[0].set_title(f"ET (r={et_metrics['r']})")

axes2[1].scatter(h_obs_a, h_sim_a, s=6)
axes2[1].set_xlabel("H obs (W/m²)")
axes2[1].set_ylabel("H sim (W/m²)")
axes2[1].set_title(f"H (r={h_metrics['r']})")
plt.tight_layout()
plt.show()

---
**Notes**
- This notebook follows the same step headers, configuration-first approach, manager calls, and verification style as Tutorial 01a.
- Replace `CONFLUENCE_DATA_DIR` with your actual data path, and set shared infra paths if needed.
- The evaluation block is resilient to variable-name differences; adjust candidate lists if your SUMMA outputs use custom names.