## AI4PEX SINDBAD Tutorial 
## 1. Parameter inversion exercise
A notebook by Sujan Koirala, Xu Shan, Jialiang Zhou and Nuno Carvalhais

---
## SINDBAD
[SINDBAD](http://sindbad-mdi.org/) is a model-data integration framework for terrestrial carbon-water processes [[Koirala et al., in prep.](https://essopenarchive.org/users/551954/articles/1271244)]. Is built in Julia with a view on speed and differenciability for the development of representation of processes and responses of ecosystem functioning to meteorological conditions and changes in climate. Sets on the concept of modularity to formaly test hypothesis on the representation of processes / models ($f(X,\theta)$), for given observational constraints ($Y$) and drivers ($X$) of the carbon and water dynamics in terrestrial ecosystems. Modularity is extended to the initial condition problem ($\text{x}^*_0$), cost functions ($\mathcal{L(\theta)}$) and optimization algorithms ($\mathcal{O}$). SINDBAD integrates machine learning for enhancing the representation of processes in mechanistically-inspired models, hybrid modeling [Reichstein et al., 2019], by learning ML-based parameterizations [e.g. Bao et al., 2024], paving way for process abstraction [Son et al., 2024].

---
## WROASTED: a Simple Coupled Carbon–Water Ecosystem Model
The carbon dynamics,  $\frac{dC}{dt}$, are simulated as the difference between gross assimilation and respiratory fluxes
$$
\frac{dC}{dt} = GPP - R_{ECO}
$$

where ${GPP}$, gross primary productivity, results from photosynthetic activity and $R_{ECO}$, ecosystem respiration, is the sum of autotrophic and heterotrophic respiratory fluxes, namely, $R_{A}$ and $R_{H}$. 

$R_{A}$ integrates both maintenance and growth respiration, $R_{M}$ and $R_{G}$, where  $R_{M}$  can be generically written like:

$$
R_{M} =\sum_{i=1}^{N} \tau_i \cdot C_i \cdot f_T 
$$

$i$ representing the different carbon pools ($C_i$) in vegetation - root/wood/leaf/reserves; $\tau_i$ the turnover rate of pool $i$ , and $f_T$ the temperature dependence of metabolic activity, usually a $Q_{10}$ function; while $R_G=Y_G \cdot GPP$, being $Y_G$ and constant growth efficiency parameter [see Amthor, 2001]. 

$R_H$ results from litter and soil decomposition:
$$
R_{H} =\sum_{i=1}^{N} \tau_i \cdot C_i \cdot f_T \cdot f_W
$$

$i$ representing the different heterotrophic carbon pools ($C_i$) in soils - fast and slow litter and organic carbon pools; $\tau_i$ the turnover rate of pool $i$ , $f_T$ and $f_W$ the temperature and soil moisture sensitivity of decomposition function.

Soil moisture dynamics, $\frac{dW}{dt}$:
$$
\frac{dW}{dt} = P_r - E_i - E_s - Q - D - T_r
$$

Being: $Pr$: precipitation; $E_i$: interception evaporation; $E_s$: soil evaporation; $Q$: surface runoff; $D$: drainage; $T_r$: plant transpiration.

Transpiration is tighly coupled to $GPP$, estimated as: 

$$
GPP = min(GPP_D,GPP_S)
$$

Being demand $GPP$:
$$
GPP_S = \epsilon^* \cdot f\text{APAR} \cdot \text{PAR} \cdot (f_L \cdot f_{CI} \cdot f_T \cdot f_{VPD} \cdot f_W)
$$
The product between: maximum light use efficiency, $\epsilon^*$; the fraction of photosynthetically active radiation, $\text{APAR}$, absorbed by leafs, $f\text{APAR}$; and the instantaneous effect of light intensity $f_L$, cloudiness index $f_CI$, vapor pressure deficit $f_VPD$ and soil moisture $f_W$ [see Bao et al., 2023; 2024].

And  supply $GPP$:

$$
GPP_S = PAW^{k_{Tr}} \cdot WUE
$$

Where where the daily variations in water use efficiency, $WUE$, result from changes in $VPD$ and $\quad [CO_2]_{atm}$. Upon $C$ assimilation by vegetation, and deduced $R_A$ costs, the available carbon is transported to the different vegetation pools depending on environmental conditions, as inspired by the growind season index (GSI) model [see Koirala et al., in print; Jolly et al., 2005]. 

Overall, WROASTED includes >40 parameters controlling the responses of carbon and water dynamics in terrestrial ecosystems constrainable by observations of ecosystem fluxes, eddy covariance, plant phenology from remote sensing EO data, and above ground biomass stocks, where available [see Koirala et al., in print].

---
## A simple LUE-model
A simpler Light Use efficiency model is setup  to further test the hybrid modeling setup where solely:
$$
GPP = \epsilon^* \cdot f\text{APAR} \cdot \text{PAR} \cdot (f_L \cdot f_{CI} \cdot f_T \cdot f_{VPD} \cdot f_W)
$$

where there is not supply limitation of GPP,  $f_L=f_W=1$, $f_{VPD}$ follows PRELES [REF], $f_T$ follows CASA [Potter et al., 1993], $f_{CI}$ [Wang et al., 2015] . $f\text{APAR}$ is a constant.

---
## The challenge
To calibrate and generalize the model parameterization.

---
## Parameter inversion
The goal is to find $\theta$ such that the model predictions $f(X, \theta)$ best match observed datasets $y$. Here, the terrestrial ecosystem model, WROASTED, represented by $f(X, \theta)$, predicts a set of ecosystem carbon and water state and flux variables, $\hat{y}$, observed at locations: 
- $X$: meteorological drivers (i.e., temperature, radiation, precipitation, $VPD$, etc);
- $\theta$: parameter vector to be estimated;
- $y$: observations (e.g., $GPP$, $T_r$, evapotranspiration, $R_{ECO}$, aboveground biomass AGB, $f\text{APAR}$)

### Optimization problem
Generically can be written:
$$
\theta^*=\arg\min_{\theta \in \Theta} \; \mathcal{L}(\theta)\quad\text{via}\quad\mathcal{O}
$$

Where:
- $\mathcal{L}(\theta)$: is the cost function quantifying the mismatch between model predictions and observations,
- $\Theta$: feasible parameter space (e.g., bounds or priors on $\theta$),
- $\mathcal{O}$: optimization operator/algorithm (e.g., gradient descent, L-BFGS, CMA-ES)

In the exercise here, for fluxes and phenology time series, the loss function $\mathcal{L}(\theta)$ is set to the normalized Nash-Sutcliffe Efficiency (NNSE)
$$
\text{NNSE}(\theta) = 1 - \frac{1}{2-NSE}
$$
$$
\text{NSE}(\theta) = 1 - \frac{\sum_{i=1}^{N} (y_i - f(X_i, \theta))^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}
$$

While for stocks, AGB, an adjusted normalized mean average error is used
$$
NMAE = \frac{\sum_{i=1}^{N} |y_i - f(X_i, \theta)|}{N \cdot (1+ \bar{y})}
$$
$$
\theta^* = \arg\min_{\theta} \; \mathcal{L}(\theta)
$$

---
## Setting up SINDBAD-Tutorials
Navigate to the [SINDBAD-Tutorials for AI4PEX repository](GitHubLink) and install. Please follow instructions. For us, [VS Code](https://code.visualstudio.com/) has been a very fluid host for [Julia](https://julialang.org/) developments.

### Get the data for these SINDBAD tutorials
The data can be found [here](https://nextcloud.bgc-jena.mpg.de/s/w2mbH59W4nF3Tcd). Suggestion, store it in a child folder of the SINDBAD-Tutorials (e.g. SINDBAD-Tutorials/data/).

## Let's go...
### Get packages and goodies to go.

In [None]:
import Pkg
Pkg.activate(".")
Pkg.instantiate()

In [None]:
# ================================== using tools ==================================================
# some of the things that will be using... Julia tools, SINDBAD tools, local codes...
using Revise
using SindbadTutorials
using SindbadTutorials.Dates
using SindbadTutorials.Plots
using SindbadTutorials.SindbadVisuals
toggleStackTraceNT()
include("tutorial_helpers.jl")

### Get the data and paths to data setup

In [None]:
# ================================== get data / set paths ========================================= 
# data to be used can be found here: https://nextcloud.bgc-jena.mpg.de/s/w2mbH59W4nF3Tcd
# organizing the paths of data sources and outputs for this experiment
path_input_dir      = getSindbadDataDepot(; env_data_depot_var="SINDBAD_DATA_DEPOT", 
                    local_data_depot=joinpath("/home/jovyan/data/ellis_jena_2025")); # for convenience, the data file is set within the SINDBAD-Tutorials path; this needs to be changed otherwise.
path_input          = joinpath("$(path_input_dir)","FLUXNET_v2023_12_1D_REPLACED_Noise003_v1.zarr"); # zarr data source containing all the data necessary for the exercise
path_observation    = path_input; # observations (synthetic or otherwise) are included in the same file
path_output         = "";


### That Zarr file contains many eddy covariance sites - synthetic data. Let's select one to invert.

In [None]:
# ================================== selecting a site =============================================
# there is a collection of several sites in the data files site info; #68 is DE-Hai
site_index      = 68;
domain, y_dist  = getSiteInfo(site_index);

### Now, setting up the experiment.
Here, the full experiment is set up - via JSON files - by defining configuration files for forcing, model structure and the optimization approach; alongside definition of: simulation domain and temporal range; temporal resolution; selection of simulation data types, precision and parallelization, SINDBAD internals; and simulation spin-up and outputs contents and data structure. 

In [None]:
# ================================== setting up the experiment ====================================
# experiment is all set up according to a (collection of) json file(s)
experiment_json     = joinpath(@__DIR__,"settings_WROASTED_HB","experiment_insitu.json");
experiment_name     = "WROASTED_inversion_CMAES";
begin_year          = 1979;
end_year            = 2017;
run_optimization    = true;
isfile(experiment_json) ? nothing : println("Hmmm... does not exist : $(experiment_json)");

# setting up the model spinup sequence : can change according to the site...
spinup_sequence = getSpinupSequenceSite(y_dist, begin_year);

# default setting in experiment_json will be replaced by the "replace_info"
replace_info = Dict("experiment.basics.time.date_begin" => "$(begin_year)-01-01",
    "experiment.basics.domain" => domain,
    "experiment.basics.name" => experiment_name,
    "experiment.basics.time.date_end" => "$(end_year)-12-31",
    "experiment.flags.run_optimization" => run_optimization,
    "experiment.model_spinup.sequence" => spinup_sequence,
    "forcing.default_forcing.data_path" => path_input,
    "forcing.subset.site" => [site_index],
    "experiment.model_output.path" => path_output,
    "optimization.observations.default_observation.data_path" => path_observation,
    );



### A simple forward run

In [None]:
# ================================== forward run ================================================== 
# before running the optimization, check a forward run 
@time out_dflt  = runExperimentForward(experiment_json; replace_info=deepcopy(replace_info)); # full default model

## check the docs for output at: http://sindbad-mdi.org/pages/develop/hybrid_modeling.html and http://sindbad-mdi.org/pages/develop/sindbad_outputs.html

# access some of the internals to do some plots with the forward runs...
info            = getExperimentInfo(experiment_json; replace_info=deepcopy(replace_info)); # note that this will modify information from json with the replace_info
forcing         = getForcing(info); 
run_helpers     = prepTEM(forcing, info); # not needed now
observations    = getObservation(info, forcing.helpers);
obs_array       = [Array(_o) for _o in observations.data]; 
cost_options    = prepCostOptions(obs_array, info.optimization.cost_options);

# plot the default simulations
plotTimeSeriesWithObs(out_dflt,obs_array,cost_options);
println("Outputs of plotting will be here: " * info.output.dirs.figure);



### Inverting the parameters of WROASTED
$\mathcal{O}_{CMA-ES}$ is an expensive approach. For demonstration, maxfeval is set to 1500. Can be changed in the optimization algorithm set up file.

In [None]:
# ================================== optimization ================================================= 
# run the optimization according to the settings above... can take some time...
@time out_opti  = runExperimentOpti(experiment_json; replace_info=deepcopy(replace_info), log_level=:info);

# plot the results
plotTimeSeriesWithObs(out_opti);
plotTimeSeriesDebug(out_opti.info, out_opti.output.optimized, out_opti.output.default);
println("Outputs of plotting will be here: " * info.output.dirs.figure);



### Using a simple - and faster - LUE model
Full workflow in one go now...

In [None]:
# ================================== another model ================================================ 
# all of the above with another model...
# only spin up the moisture pools
spinup_sequence = getSpinupSequenceSite();

# just change the model setup and experiment name
experiment_json = joinpath(@__DIR__,"settings_LUE","experiment.json");
experiment_name = "LUE_inversion_CMAES";
replace_info    = Dict("experiment.basics.time.date_begin" => "$(begin_year)-01-01",
    "experiment.basics.domain" => domain,
    "experiment.basics.name" => experiment_name,
    "experiment.basics.time.date_end" => "$(end_year)-12-31",
    "experiment.flags.run_optimization" => run_optimization,
    "experiment.model_spinup.sequence" => spinup_sequence,
    "forcing.default_forcing.data_path" => path_input,
    "forcing.subset.site" => [site_index],
    "experiment.model_output.path" => path_output,
    "optimization.observations.default_observation.data_path" => path_observation,
    );

# #=
@time out_dflt_lue  = runExperimentForward(experiment_json; replace_info=deepcopy(replace_info)); # full default model
# access some of the internals to do some plots with the forward runs...
info            = getExperimentInfo(experiment_json; replace_info=deepcopy(replace_info)); # note that this will modify information from json with the replace_info
forcing         = getForcing(info); 
run_helpers     = prepTEM(forcing, info); # not needed now
observations    = getObservation(info, forcing.helpers);
obs_array       = [Array(_o) for _o in observations.data]; 
cost_options    = prepCostOptions(obs_array, info.optimization.cost_options);
# =#

# plot the default simulations
plotTimeSeriesWithObs(out_dflt_lue,obs_array,cost_options);
println("Outputs of plotting will be here: " * info.output.dirs.figure);

# run the optimization
@time out_lue_opti  = runExperimentOpti(experiment_json; replace_info=deepcopy(replace_info), log_level=:info);

# plot the results
plotTimeSeriesWithObs(out_lue_opti);
println("Outputs of plotting will be here: " * out_lue_opti.info.output.dirs.figure);



In [None]:
# ================================== time for discussion ========================================== 