# Multi-process models in pywatershed

In notebook `00_processes.ipynb`, we looked at how and individual Process representations work and are designed. In this notebook we learn how to put multiple Processes together into composite models using the `Model` class. 

The starting point for the development of `pywatershed` was the National Hydrologic Model (NHM, Regan et al., 2018) configuration of the Precipitation-Runoff Modeling System (PRMS, Regan et al., 2015). In this notebook, we'll first construct a full NHM configuration. The spatial domain we'll use will again be the Delaware River Basin. Once we construct the full NHM, we'll look at how we can also construct sub-models of the NHM.

## Prerequisites

In [None]:
# auto-format the code in this notebook
%load_ext jupyter_black

In [None]:
from copy import deepcopy
import pathlib as pl
from platform import processor
from pprint import pprint
from sys import platform
import yaml

import pydoc

import hvplot.xarray  # noqa
import numpy as np
import pywatershed as pws

# from tqdm.notebook import tqdm
import xarray as xr

from helpers import gis_files

gis_files.download()  # this downloads GIS files

In [None]:
domain_dir = pws.constants.__pywatershed_root__ / "data/drb_2yr"
nb_output_dir = pl.Path("./01_multi-process_models")

## An NHM multi-process model
The 8 conceptual `Process` classes that comprise the NHM are, in order:

In [None]:
nhm_processes = [
    pws.PRMSSolarGeometry,
    pws.PRMSAtmosphere,
    pws.PRMSCanopy,
    pws.PRMSSnow,
    pws.PRMSRunoff,
    pws.PRMSSoilzone,
    pws.PRMSGroundwater,
    pws.PRMSChannel,
]

We'll use this list of classes shortly to construct the NHM.

A multi-process model is assembled by the `Model` class. We can take a quick look at the first 21 lines of help on `Model`:

In [None]:
# this is equivalent to help() but we get the multiline string and just look at part of it
model_help = pydoc.render_doc(pws.Model, "Help on %s")
# the first 22 lines of help(pws.Model)
print("\n".join(model_help.splitlines()[0:22]))

The `help()` mentions that there are 2 distinct ways of instantiating a `Model` class. In this notebook, we focus on the pywatershed-centric instatation and leave the PRMS-legacy instantiation to the following notebook. 

With the pywatershed-centric approach, the first argument is a "model dictionary" which does nearly all the work (the other arguments will be their default values). The `help()` describes the model dictionary and provides examples. Please use it for reference and more details. Here we'll give an extended concrete example. The `help()` also describes how a `Model` can be instantiated from a model dictionary contained in a YAML file. First we'll build a model dictionary in memory, then we'll write it out as a yaml file and instantiate our model directly from the YAML file. 

## Model dictionary in memory

Because our (pre-existing) parameter files (which come with `pywatershed`) and our `Process` classes are consistently named, we can begin to build the model dictionary quickly.

In [None]:
model_dict = {}

for proc in nhm_processes:
    # this is the class name
    proc_name = proc.__name__
    # the processes can have arbitrary names in the model_dict and
    # an instance should not have capitalized name anyway (according to
    # python convention), so rename from the class name
    proc_rename = "prms_" + proc_name[4:].lower()
    # each process has a dictionary of information
    model_dict[proc_rename] = {}
    # alias to shorten lines below
    proc_dict = model_dict[proc_rename]
    # required key "class" specifys the class
    proc_dict["class"] = proc
    # the "parameters" key provides an instance of Parameters
    proc_param_file = domain_dir / f"parameters_{proc_name}.nc"
    proc_dict["parameters"] = pws.Parameters.from_netcdf(proc_param_file)
    # the "dis" key provides the name of the discretizations
    # which we'll supply shortly to the model dictionary
    if proc_rename == "prms_channel":
        proc_dict["dis"] = "dis_both"
    else:
        proc_dict["dis"] = "dis_hru"

Let's look at what we have so far in the `model_dict`.

In [None]:
pprint(model_dict, sort_dicts=False)

We have given a name to each process and then supplied the class, its parameters, and its discretization for the full set of processes. Now we'll need to add the discretizations to the model dictionary. They are added at the top level and correspond to the names the processes used. 

In [None]:
model_dict = model_dict | {
    "dis_hru": pws.Parameters.from_netcdf(domain_dir / "parameters_dis_hru.nc"),
    "dis_both": pws.Parameters.from_netcdf(domain_dir / "parameters_dis_both.nc"),
}
pprint(model_dict, sort_dicts=False)

For the time being, `PRMSChannel` needs to know about both HRUs and segments, so `dis_both` is used. We plan to remove this requirement in the near future by implementing "exchanges" between processes into the model dictionary. Stay tuned.

You may have noticed that we are missing a `Control` object to provide time information to the processes. We'll create it and we'll also create a list of the order that the processes are executed.

Though we have input available to run 2 years of simulation, we'll restrict the model run to the first 6 months for demonstration purposes. (Feel free to increase this to the full 2 years available, if you like.)

In [None]:
control = pws.Control(
    start_time=np.datetime64("1979-01-01T00:00:00"),
    end_time=np.datetime64("1979-07-01T00:00:00"),
    time_step=np.timedelta64(24, "h"),
    options={
        "input_dir": domain_dir,
        "budget_type": None,
        "netcdf_output_dir": nb_output_dir / "nhm_memory",
        "init_vars_from_file": 0,
        "dprst_flag": True,
    },
)
model_order = ["prms_" + proc.__name__[4:].lower() for proc in nhm_processes]
model_dict = model_dict | {"control": control, "model_order": model_order}
pprint(model_dict, sort_dicts=False)

The `model_dict` now specifies a complete model built from multiple processes. They way these processes are connected can be figured out by the `Model` class, because each process fully describes itself (as we saw in the previous notebook). If we instantiate a model from this `model_dict`,

In [None]:
model_mem = pws.Model(model_dict)

we can examine how the `Processes` are all connected using the `ModelGraph` class. We'll bring in the default color scheme for NHM `Processes`.

In [None]:
palette = pws.analysis.utils.colorbrewer.nhm_process_colors(model_mem)
pws.analysis.utils.colorbrewer.jupyter_palette(palette)
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        model_mem,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=48)
except:
    print("In some cases, dot fails on Mac ARM machines.")

Now we'll initialize NetCDF output and run the model.

In [None]:
%%time
model_mem.run(finalize=True)

We'll take a look at the outputs after we see how to implement this model using a YAML file on disk. 

## Model dictionary yaml file
It may be preferable to have a model dictionary encoded in YAML file in many situations. Let's do that. Necessarily, the contents of the YAML file will look different than above where we had the contents of the model dictionary in memory.

First we'll need to write the control instance as a YAML file. To do that we need a serializable dictionary in Python.

In [None]:
run_dir = pl.Path(nb_output_dir / "nhm_yaml")
run_dir.mkdir(exist_ok=True)
control_dict = control.options | {
    "start_time": str(control.start_time),
    "end_time": str(control.end_time),
    "time_step": str(control.time_step)[0:2],
    "time_step_units": str(control.time_step)[3:4],
    "netcdf_output_dir": run_dir,
}

pprint(control_dict, sort_dicts=False)

We add the option `netcdf_output_dir` to the control since we assume we wont be able to do so at run time. Note that this option and the `input_dir` option are `pathlib.Path` objects. These are not what we want to write to file. We want their string version. We could do `str()` on each one by hand, but it will be more handy to write a small, recursive function to do this on a supplied dictionary since this will be a recurring task with the model dictionary we will create after the control YAML file.

In [None]:
def dict_pl_to_str(the_dict):
    for key, val in the_dict.items():
        if isinstance(val, dict):
            the_dict[key] = dict_pl_to_str(val)
        elif isinstance(val, pl.Path):
            the_dict[key] = str(val)

    return the_dict


control_dict = dict_pl_to_str(control_dict)
pprint(control_dict, sort_dicts=False)

Now we'll create the model dictionary. For the `control` field, we'll need the path to the YAML file to which we will write the information above. For discretization fields, we'll pass paths to NetCDF files instead of instantiated `Parameter` objects. For `model_order` we can supply the same list we used above. 

In [None]:
control_yaml_file = run_dir / "control.yml"
model_dict = {
    "control": control_yaml_file.resolve(),
    "dis_hru": domain_dir / "parameters_dis_hru.nc",
    "dis_both": domain_dir / "parameters_dis_both.nc",
    "model_order": model_order,
}

Now, for each process, we'll use an arbitray name. Then we'll supply the class name (which can be imported from `pws` at the top level), the path to its NetCDF parameter file, and the name of its required discretization.

In [None]:
for proc in nhm_processes:
    proc_name = proc.__name__
    proc_rename = "prms_" + proc_name[4:].lower()
    model_dict[proc_rename] = {}
    proc_dict = model_dict[proc_rename]
    proc_dict["class"] = proc_name
    proc_param_file = domain_dir / f"parameters_{proc_name}.nc"
    proc_dict["parameters"] = proc_param_file
    if proc_rename == "prms_channel":
        proc_dict["dis"] = "dis_both"
    else:
        proc_dict["dis"] = "dis_hru"

Before looking at it, we'll convert `Path` objects to strings:

In [None]:
model_dict = dict_pl_to_str(model_dict)
pprint(model_dict, sort_dicts=False)

A note on paths in the yaml file. Because we are using files in two different locations which are not easily described relative to the location of yaml file, we are using absolute paths. However, one can also describe all paths relative to the location of the yaml file if that is more suitable to your purposes. 

Finally, we have the control and model dictionaries ready to write to yaml.

In [None]:
model_dict_yaml_file = run_dir / "model_dict.yml"
# the control yaml file was given above and is in the model_dict
dump_dict = {control_yaml_file: control_dict, model_dict_yaml_file: model_dict}
for key, val in dump_dict.items():
    with open(key, "w") as file:
        documents = yaml.dump(val, file)

We'll use a little magics to directly examine the written YAML files

In [None]:
! cat 01_multi-process_models/nhm_yaml/control.yml

In [None]:
! cat 01_multi-process_models/nhm_yaml/model_dict.yml

Now we can create a `Model` from these:

In [None]:
model_yml = pws.Model.from_yml(model_dict_yaml_file)
model_yml

In [None]:
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        model_yml,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=48)
except:
    print("In some cases, dot fails on Mac ARM machines.")

That looks identical to the `model_mem` model constructed previously. Let's run the model.

In [None]:
%%time
model_yml.run()
model_yml.finalize()

## Compare the outputs of the two models
Let's see that these constructed and executed the same models.

In [None]:
mem_out_dir = nb_output_dir / "nhm_memory"
yml_out_dir = nb_output_dir / "nhm_yaml"
mem_files = sorted(mem_out_dir.glob("*.nc"))
yml_files = sorted(yml_out_dir.glob("*.nc"))
# We get all the same output files
assert set([ff.name for ff in mem_files]) == set([ff.name for ff in yml_files])

Now compare the values of all variables:

In [None]:
for mf, yf in zip(mem_files, yml_files):
    var = mf.with_suffix("").name
    # print(var)
    mda = xr.open_dataset(mf)[var]
    yda = xr.open_dataset(yf)[var]
    xr.testing.assert_equal(mda, yda)
    mda.close()
    yda.close()

We'll plot the last variable in the loop, `unused_potet`:

In [None]:
mda.hvplot(groupby="nhm_id")

In [None]:
proc_plot = pws.analysis.process_plot.ProcessPlot(gis_files.gis_dir / "drb_2yr")
proc_plot.plot_hru_var(
    var_name=var,
    process=pws.PRMSAtmosphere,
    data=mda.mean(dim="time"),
    data_units=mda.attrs["units"],
    nhm_id=mda["nhm_id"],
)

## Reduce model output to disk
It's worth noting that quite a lot of output was written and that in many cases the amount of output can be reduced in favor of imporving/reducing model run time. Let's show how easily that can be done.

Because we want to reuse the above control dict and model dict for the submodel demonstration in the next section, we'll deep copy them for this purpose.

In [None]:
control_dict_copy = deepcopy(control_dict)
model_dict_copy = deepcopy(model_dict)

In [None]:
run_dir = pl.Path(nb_output_dir / "yml_less_output").resolve()
run_dir.mkdir(exist_ok=True)

control_dict_copy["netcdf_output_dir"] = str(run_dir.resolve())
control_yaml_file = run_dir / "control.yml"
control_dict_copy["netcdf_output_var_names"] = [
    var
    for ll in [
        pws.PRMSGroundwater.get_variables(),
        pws.PRMSChannel.get_variables(),
    ]
    for var in ll
]
pprint(control_dict_copy, sort_dicts=False)

Now we will use the existing `model_dict` in memory, tayloring to the above and just keeping the processes of interest in the submodel.

In [None]:
model_dict_copy["control"] = str(control_yaml_file)
model_dict_yaml_file = run_dir / "model_dict.yml"

Now we write both the control and model dictionary to yaml files.

In [None]:
dump_dict = {
    control_yaml_file: control_dict_copy,
    model_dict_yaml_file: model_dict_copy,
}
for key, val in dump_dict.items():
    with open(key, "w") as file:
        documents = yaml.dump(val, file)

And finally we instantiate the submodel from the model dictionary yaml file. 

In [None]:
submodel = pws.Model.from_yml(model_dict_yaml_file)
submodel

In [None]:
%%time
submodel.run(finalize=True)

Reducing the output significantly reduced the time, in this case (on my machine) from 25s to 15s, or about 60%.

## NHM Submodel for the Delaware River Basin 
In many cases, running the full NHM model may not be necessary and it may be advantageous to just run some of the processes in it. Suppose you wanted to change parameters or model process representation in the PRMSSoilzone to better predict streamflow. As the model is 1-way coupled, you can simply run a submodel starting with PRMSSoilzone and running through PRMSChannel.

In [None]:
submodel_processes = [pws.PRMSSoilzone, pws.PRMSGroundwater, pws.PRMSChannel]

This prompts the question, what inputs/forcing data do we need for this submodel? We can ask each individual process for its inputs

In [None]:
submodel_input_dict = {pp.__name__: pp.get_inputs() for pp in submodel_processes}
pprint(submodel_input_dict)

And which inputs are supplied by variables within this submodel? We ask each process for its variables.

In [None]:
submodel_vars_dict = {pp.__name__: pp.get_variables() for pp in submodel_processes}
pprint(submodel_vars_dict)

We consolidate inputs and variables (each over all processes) and take a set difference of inputs and variables to know what inputs/forcings we need from file. 

In [None]:
submodel_inputs = set([ii for tt in submodel_input_dict.values() for ii in tt])
submodel_variables = set([ii for tt in submodel_vars_dict.values() for ii in tt])
submodel_file_inputs = tuple(submodel_inputs - submodel_variables)
pprint(submodel_file_inputs)

And where will we get these input files? If you pay close attention, you'll notice that these files do not come with the repository. Instead they are generated when we ran the full NHM model above.

In [None]:
yml_output_dir = pl.Path(control_dict["netcdf_output_dir"])
for ii in submodel_file_inputs:
    input_file = yml_output_dir / f"{ii}.nc"
    assert input_file.exists()
    print(input_file)

Well, that was a lot of work. But, as alluded to above, the `Model` object does the above so you dont have to. You just learned something about how the flow of information between processes is enabled by the design and how one can query individual processes in `pywatershed`. But we could instantiate the submodel and plot this wiring up, just as we plotted the `ModelGraph` of the full model. We'll create the submodel in a new `run_dir` and we'll use outputs from the full model above as inputs to this submodel.

In [None]:
run_dir = pl.Path(nb_output_dir / "nhm_sub").resolve()
run_dir.mkdir(exist_ok=True)

# key that inputs exist from previous full-model run
control_dict["input_dir"] = str(yml_output_dir.resolve())
control_dict["netcdf_output_dir"] = str(run_dir.resolve())
control_yaml_file = run_dir / "control.yml"

Now we will use the existing `model_dict` in memory, tayloring to the above and just keeping the processes of interest in the submodel.

In [None]:
model_dict["control"] = str(control_yaml_file)
model_dict_yaml_file = run_dir / "model_dict.yml"
keep_procs = ["prms_soilzone", "prms_groundwater", "prms_channel"]
model_dict["model_order"] = keep_procs
for kk in list(model_dict.keys()):
    if isinstance(model_dict[kk], dict) and kk not in keep_procs:
        del model_dict[kk]
pprint(control_dict, sort_dicts=False)
pprint(model_dict, sort_dicts=False)

Now we write both the control and model dictionary to yaml files.

In [None]:
dump_dict = {control_yaml_file: control_dict, model_dict_yaml_file: model_dict}
for key, val in dump_dict.items():
    with open(key, "w") as file:
        documents = yaml.dump(val, file)

And finally we instantiate the submodel from the model dictionary yaml file. 

In [None]:
submodel = pws.Model.from_yml(model_dict_yaml_file)
submodel

Now to look at the `ModelGraph` for the submodel.

In [None]:
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        submodel,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=48)
except:
    print("In some cases, dot fails on Mac ARM machines.")

Note that the required inputs to the submodel are quire different and rely on the existence of these files having already been output by the full model. 

Now we can initalize output and run the submodel.

In [None]:
%%time
submodel.run(finalize=True)

We'll, that saved us some time. The run is similar to before, just using fewer processes. 

The final time is still in memory. We can take a look at, say, recharge. Before plotting, let's take a look at the data and the metadata for recharge a bit closer.

In [None]:
pprint(pws.meta.find_variables("recharge"))
print(
    "PRMSSoilzone dimension names: ",
    submodel.processes["prms_soilzone"].dimensions,
)
print("nhru: ", submodel.processes["prms_soilzone"].nhru)
print(
    "PRMSSoilzone recharge shape: ",
    submodel.processes["prms_soilzone"]["recharge"].shape,
)
print(
    "PRMSSoilzone recharge type: ",
    type(submodel.processes["prms_soilzone"]["recharge"]),
)
print(
    "PRMSSoilzone recharge dtype: ",
    submodel.processes["prms_soilzone"]["recharge"].dtype,
)

First we access the metadata on `recharge` and we see its description, dimension, type, and units. The we look at the dimension names of the PRMSSoilzone process in whith it is found. We see the length of the `nhru` dimension and that this is the only dimension on `recharge`. We also see that `recharge` is a `numpy.ndarray` with data type `float64`.

So recharge only has spatial dimension. It is written to file with each timestep (or periodically). However, the last timestep is still in memory (even though we've finalized the run) and we can visualize it. This time the data are on the unstructured/polygon grid of Hydrologic Response Units (HRUs) instead of the streamflow network plotted above.

In [None]:
proc_plot = pws.analysis.process_plot.ProcessPlot(gis_files.gis_dir / "drb_2yr")
proc_name = "prms_soilzone"
var_name = "ssr_to_gw"
proc = submodel.processes[proc_name]
display(proc_plot.plot(var_name, proc))

We can easily check the results of our submodel model against our full model. This gives us an opportunity to look at the output files. We can start with recharge as our variable of interest. The model NetCDF output can be read in using `xarray` where we can see all the relevant metadata quickly.

In [None]:
var = "recharge"
nhm_ds = xr.open_dataset(yml_output_dir / f"{var}.nc")
sub_ds = xr.open_dataset(run_dir / f"{var}.nc")

In [None]:
display(nhm_ds)
display(sub_ds)

If you expand the metadata (the little folded page icon on the right side of the recharge line), you get a variable description, dimension, type, and units. 

Now compare all output variables common to both runs, asserting that the two runs gave equal output.

In [None]:
for var in submodel_variables:
    nhm_da = xr.open_dataset(yml_output_dir / f"{var}.nc")[var]
    sub_da = xr.open_dataset(run_dir / f"{var}.nc")[var]
    xr.testing.assert_equal(nhm_da, sub_da)

In [None]:
# var_name = "dprst_seep_hru"
nhm_da = xr.open_dataset(yml_output_dir / f"{var_name}.nc")[var_name]
sub_da = xr.open_dataset(run_dir / f"{var_name}.nc")[var_name]
scat = xr.merge(
    [nhm_da.rename(f"{var_name}_yaml"), sub_da.rename(f"{var_name}_subset")]
)

display(
    scat.hvplot(x=f"{var_name}_yaml", y=f"{var_name}_subset", groupby="nhm_id").opts(
        data_aspect=1
    )
)

scat.hvplot(y=f"{var_name}_subset", groupby="nhm_id")

## References
* Regan, R. S., Markstrom, S. L., Hay, L. E., Viger, R. J., Norton, P. A., Driscoll, J. M., & LaFontaine, J. H. (2018). Description of the national hydrologic model for use with the precipitation-runoff modeling system (prms) (No. 6-B9). US Geological Survey.
* Regan, R.S., Markstrom, S.L., LaFontaine, J.H., 2022, PRMS version 5.2.1: Precipitation-Runoff Modeling System (PRMS): U.S. Geological Survey Software Release, 02/10/2022.