# Building process models in pywatershed

*James McCreight, USGS/UCAR*

## Introduction 

It is a major scientific challenge to simulate and predict water quantity and quality in the environment. Over the years, many analytical, numerical, and statistical approaches have been developed to address this problem. The choice of modeling approaches often depends on specific real-world problems, resulting in a wide variation of models across different applications.

The United States Geological Survey (USGS), among others, has an important legacy of models for simulating water quantity and quality in the environment. The USGS is also building new models for emerging applications. To enhance their effectiveness, it is crucial to establish interoperability of existing and future models. The Enterprise Capacity (EC) project at the USGS aims to create flexible, reusable, and interoperable models that can support water quantity and quality modeling across a broad range of applications.

Focusing on water quantity, this notebook highlights on-going development to redesign core USGS hydrologic simulation capabilities into a modular and interoperable Python package. The goal of this package is to support flexible representations of conceptual hydrologic process models and the hypothesis testing of model suitability for a given application. 

The USGS National Hydrologic Model (NHM, Regan et al., 2018) is a specific, national-scale instance of physical process models within the Precipitation-Runoff Modeling System (PRMS, Regan et al., 2015). The PRMS concepts used in the NHM have been expressed in the Python package `pywatershed`. In this notebook, we demonstrate modeling of the Delaware River Basin (DRB) subdomain of the NHM for a 2 year period using `pywatershed`.

The goal of this notebook is to demonstrate the use of `pywatershed` for interested users, highlighting its modularity and its self-describing nature. The outline of this notebook is as follows:


* Results
    - *Requirements and prerequisites:* What is needed to get up and running. 
    - *NHM in pywatershed:* We from atmospheric forcings/inputs and show how the model is designed at a conceptual process level
    - *Submodels in pywatershed:* We demonstrate modularity by isolating and executing a submodel of the full NHM model.
    - *Zooming in:* We delve into greater detial on how to query and control the model and its execution. 
* Conclusions


## Results

### Requirements:
The conda/mamba environment specified in `env/pws-env.yml` is what is required to run this notebook. The `README.md` file describes setting up the environment, including using frozen versions of the environment also include in the `env/` directory.

We'll use `jupyter_black` to keep our minds off formatting our code.

In [None]:
%load_ext jupyter_black

Our python imports for this notebook follow.

In [None]:
from sys import platform
import pathlib as pl
from platform import processor
from pprint import pprint
import urllib
import zipfile

import hvplot.pandas  # noqa
import pywatershed as pws
import xarray as xr

The directory which contains all the input/output for this notebook, we will call the `root_dir`:

In [None]:
pkg_root_dir = pws.constants.__pywatershed_root__

Well grab the GIS files for our domain from an earlier github release artifact (the repository was previously called `pynhm` when its scope was just the replicating the NHM). This only executes if the files are not already local.

In [None]:
gis_dir = pkg_root_dir / "data/pywatershed_gis"
if not gis_dir.exists():
    gis_url = "https://github.com/EC-USGS/pywatershed/releases/download/v2022.0.1/pynhm_gis.zip"
    gis_file = pkg_root_dir / "pynhm_gis.zip"
    urllib.request.urlretrieve(gis_url, gis_file)

    with zipfile.ZipFile(gis_file, "r") as zz:
        zz.extractall(pkg_root_dir)

    (pkg_root_dir / "pynhm_gis").rename(gis_dir)

### Full NHM configuration for the Delaware River Basin

Now we setup the full NHM model (from atmosphere forcing files through stream channel flow), on the DRB subdomain of the NHM. We will see the details of this model shortly.

Specify where our domain and GIS files for the Delaware River Basin 2-year run are located:

In [None]:
domain_dir = pkg_root_dir / "data/drb_2yr"
domain_gis_dir = gis_dir / "drb_2yr"

The atmospheric forcing files for this model are in the `domain_dir`. The NHM input files als include its parameter and control files, also included in the `domain_dir`. The files being loaded here are native NHM files though these can be re-expressed in other formats. In `pywatershed` we use all of the parameter file necessary to support the model. We do not make extensive use of the control file, but do use parts of it. The Control object/class, manages not only the control data but also the parameter object.

In [None]:
params = pws.parameters.PrmsParameters.load(domain_dir / "myparam.param")
control = pws.Control.load(domain_dir / "control.test")
control.edit_n_time_steps(90)
control.config = control.config | {
    "input_dir": domain_dir,
    "budget_type": "warn",
    "calc_method": "numba",
}

Specify where we'll write model output files for the full:

In [None]:
output_root_dir = pl.Path("./01_process_models_output/nhm")
nhm_output_dir = output_root_dir / "nhm"

With all the requisite IO above, we can initialize the model. 

In [None]:
nhm = pws.Model(
    [
        pws.PRMSSolarGeometry,
        pws.PRMSAtmosphere,
        pws.PRMSCanopy,
        pws.PRMSSnow,
        pws.PRMSRunoff,
        pws.PRMSSoilzone,
        pws.PRMSGroundwater,
        pws.PRMSChannel,
    ],
    control=control,
    parameters=params,
)

The individual processes listed, from `PRMSSolarGeometry` through `PRMSChannel`, are one-way coupled in the order shown from top to bottom. This model initialization clearly indicates the working model conceptual at a high-level. We will see these processes in more fine-grained detail shortly.

The control and directory to scan for input are passed. We ask for a warning if mass budgets do not balance. Finally, we request the `numba` calculation method. The `numba` Python package is a just-in-time compiler that can take the code written using the `numpy` package and accelerate it by compiling at run time. More on this below.

Before running, we request NetCDF output to the desired output directory.

In [None]:
nhm.initialize_netcdf(output_dir=nhm_output_dir)

We'll time our run and ask the model to finalize at the end.

In [None]:
%%time
nhm.run(finalize=True)

Several things are happening, as can be seen in the output produced. When the model starts, we see a pause while the message "X jit compiling with numba" is printed. For each process listed, this is when the numba just-in-time (jit) compiler compiles the code for the process representation. The processes that benefit from jit compiling are challenging to vectorize and have a loop over space which is accelerated by the compiling. The remaining processes, not jit compiled, are all vectorized: PRMSSolarGeometry, PRMSAtmosphere. In fact, because these two processes are strictly pre-processing between input files and the rest of the model (all their input is know in advance), they are both vectorized in space and time. This is why all the variables are written out before the rest of the model runs.  

A note on PRMSGroundwater is that it is both vectorized and jit-compiled. The numba, jit-compiled code is only slightly slower so it has not been removed.

Now that we've run our NHM model on the DRB, let's get a flavor for the "ultimate" variable simulated on the domain: streamflow. Still in memory are the streamflow values from the final timestep, we'll plot those on the stream network overlaid on the watershed. 

In [None]:
proc_plot = pws.analysis.ProcessPlot(domain_gis_dir)
proc_name = "PRMSChannel"
var_name = "seg_outflow"
proc = nhm.processes[proc_name]
display(proc_plot.plot(var_name, proc))

Above we see the spatial domain, the outline of its extent, the network on which streamflow is calculated, and the final simulated values of streamflow. 

Let us turn towards the model structure in more detail: how do atmospheric inputs/forcings result in the simulated streamflow above? We will produce the model graph which shows the flow of information from the input files through all the process representations, all the way down to the channel streamflow. First we print a color legend for each represented process in the NHM. Each process is outlined by a box of this color and values/fluxes flowing from a process have the color of the originating process. Finally, a variable outlined (above and on the sides) participates in the mass budget of its process. This diagram gives some very specific information of the model conceptualization, how the processes relate to each other, and the complexity of the indivdual processes. (Note that the underlying graphviz/dot program that generates the plot is not fully working on Mac ARM/M1, so plots here and below are less detailed if you are are using such a machine, the notebooks in the repo will be complete for your reference.) Each process's data is placed into one of three categories: inputs(blue), parameters(orange), and variables(green). All of this information is public for each process (indeed in static methods) so we can produce these plots programatically without needing to run the Model. The Model object contains all the information needed to generate the plot when it is initialized.

In [None]:
palette = pws.analysis.utils.colorbrewer.nhm_process_colors(nhm)
pws.analysis.utils.colorbrewer.jupyter_palette(palette)
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        nhm,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=55)
except:
    print("In some cases, dot fails on Mac ARM machines.")

### NHM Submodel for the Delaware River Basin 
Suppose you wanted to change parameters or model process representation in the PRMSSoilzone to better predict streamflow. As the model is 1-way coupled, you can simply run a submodel starting with PRMSSoilzone and running through PRMSChannel.

In [None]:
submodel_processes = [pws.PRMSSoilzone, pws.PRMSGroundwater, pws.PRMSChannel]

This prompts the question, what inputs/forcing data do we need for this submodel? We can ask each individual process for its inputs

In [None]:
submodel_input_dict = {
    pp.__name__: pp.get_inputs() for pp in submodel_processes
}
pprint(submodel_input_dict)

And which inputs are supplied by variables within this submodel? We ask each process for its variables.

In [None]:
submodel_vars_dict = {
    pp.__name__: pp.get_variables() for pp in submodel_processes
}
pprint(submodel_vars_dict)

We consolidate inputs and variables (each over all processes) and take a set difference of inputs and variables to know what inputs/forcings we need from file. 

In [None]:
submodel_inputs = set([ii for tt in submodel_input_dict.values() for ii in tt])
submodel_variables = set(
    [ii for tt in submodel_vars_dict.values() for ii in tt]
)
submodel_file_inputs = tuple(submodel_inputs - submodel_variables)
pprint(submodel_file_inputs)

And where will we get these input files? If you pay close attention, you'll notice that these files do not come with the repository. Instead they are generated when we ran the full NHM model above.

In [None]:
for ii in submodel_file_inputs:
    input_file = nhm_output_dir / f"{ii}.nc"
    assert input_file.exists()
    print(input_file)

Well, that was a lot of work. But, as alluded to above, the `Model` object does the above so you dont have to. You just learned something about how the flow of information between processes is enabled by how the classes are designed and how to query the individual processes in `pywatershed`. But we could just instantiate the submodel and plot this wiring up, as we plotted the ModelGraph of the full model

In [None]:
# params = pws.parameters.PrmsParameters.load(domain_dir / "myparam.param")
control = pws.Control.load(domain_dir / "control.test")
control.edit_n_time_steps(90)
control.config = control.config | {
    "input_dir": nhm_output_dir,
    "budget_type": "warn",
    "calc_method": "numba",
}

In [None]:
submodel = pws.Model(
    submodel_processes,
    control=control,
    parameters=params,
)

pws.analysis.ModelGraph(
    submodel,
    hide_variables=not show_params,
    show_params=show_params,
    process_colors=palette,
).SVG(verbose=True, dpi=60)

Now we can initalize output and run the submodel.

In [None]:
submodel_output_dir = output_root_dir / "submodel"
submodel.initialize_netcdf(output_dir=submodel_output_dir)

In [None]:
%%time
submodel.run(finalize=True)

We'll, that saved us some time. The run is similar to before, just using fewer processes. 

The final time is still in memory. We can take a look at, say, recharge. Before plotting, let's take a look at the data and the metadata for recharge a bit closer.

In [None]:
pprint(pws.meta.find_variables("recharge"))
print(
    "PRMSSoilzone dimension names: ",
    submodel.processes["PRMSSoilzone"].dimensions,
)
print("nhru: ", submodel.processes["PRMSSoilzone"].nhru)
print(
    "PRMSSoilzone recharge shape: ",
    submodel.processes["PRMSSoilzone"]["recharge"].shape,
)
print(
    "PRMSSoilzone recharge type: ",
    type(submodel.processes["PRMSSoilzone"]["recharge"]),
)
print(
    "PRMSSoilzone recharge dtype: ",
    submodel.processes["PRMSSoilzone"]["recharge"].dtype,
)

First we access the metadata on `recharge` and we see its description, dimension, type, and units. The we look at the dimension names of the PRMSSoilzone process in whith it is found. We see the length of the `nhru` dimension and that this is the only dimension on `recharge`. We also see that `recharge` is a `numpy.ndarray` with data type `float64`.

So recharge only has spatial dimension. It is written to file with each timestep (or periodically). However, the last timestep is still in memory (even though we've finalized the run) and we can visualize it. This time the data are on the unstructured/polygon grid of Hydrologic Response Units (HRUs) instead of the streamflow network plotted above.

In [None]:
proc_plot = pws.analysis.process_plot.ProcessPlot(domain_gis_dir)
proc_name = "PRMSSoilzone"
var_name = "recharge"
proc = submodel.processes[proc_name]
display(proc_plot.plot(var_name, proc))

We can easily check the results of our submodel model against our full model. This gives us an opportunity to look at the output files. We can start with recharge as our variable of interest. The model NetCDF output can be read in using `xarray` where we can see all the relevant metadata quickly.

In [None]:
var = "recharge"
nhm_ds = xr.open_dataset(nhm_output_dir / f"{var}.nc")
sub_ds = xr.open_dataset(submodel_output_dir / f"{var}.nc")

In [None]:
display(nhm_ds)
display(sub_ds)

If you expand the metadata (the little folded page icon on the right side of the recharge line), you get a variable description, dimension, type, and units. 

Now compare all output variables common to both runs, asserting that the two runs gave equal output.

In [None]:
for var in submodel_variables:
    nhm_da = xr.open_dataset(nhm_output_dir / f"{var}.nc")[var]
    sub_da = xr.open_dataset(submodel_output_dir / f"{var}.nc")[var]
    xr.testing.assert_allclose(nhm_da, sub_da, rtol=1.0e-9, atol=1.0e-15)

### Zooming in, in pywatershed

For illustrative purposes, say you are interested some of the details of the groundwater representation in the above submodel. For a start, you can ask that class for its description. It returns a dictionary with the top-level keys: `class_name`, `inputs`, `mass_budget_terms`, `parameters`, and `variables`. For `inputs`, `parameters`, and `variables`, the corresponding value is a dictionary of metadata about the data in each of those categories. For `mass_budget_terms`, the data involved in the calculation of the budgets are summarized in terms of `inputs`, `outputs`, and `storage_changes`. It is easy to quickly get oriented to the model via this description. 

In [None]:
pprint(pws.PRMSGroundwater.description())

Next, lets say we are interested in looking at certain variables in real time, as the model runs. We'll pull apart the time advance loop.

In [None]:
submodel_output_dir = pkg_root_dir / "model_runs/submodel"
params = pws.parameters.PrmsParameters.load(domain_dir / "myparam.param")
control = pws.Control.load(domain_dir / "control.test")
control.edit_n_time_steps(90)
control.config = control.config | {
    "input_dir": nhm_output_dir,
    "budget_type": "warn",
    "calc_method": "numba",
}
submodel = pws.Model(
    submodel_processes,
    control=control,
    parameters=params,
)

In [None]:
for tt in list(range(control.n_times))[0:3]:
    print("time step: ", tt)
    submodel.advance()
    print(
        "timestep start time: ", control.current_time
    )  # The daily timestep really means "at the end of the day"
    submodel.calculate()
    print(
        "mean gwres_stor: ",
        submodel.processes["PRMSGroundwater"].gwres_stor.mean(),
    )

Finally, if we want mass balance at the current timestep, we can just print a process's budget. The terms in blue in the ModelGraph are those that participate in the budget.

In [None]:
submodel.processes["PRMSGroundwater"].budget

## Conclusions

This notebook demonstrates the capabilities of `pywatershed` to flexibly model process representations in a way that is friendly to both users and developers. Its design is mean to accelerate interaction of conceptual components implemented in Python and beyond and to advance hdyrologic modeling for novel applications.   


## References
* Regan, R. S., Markstrom, S. L., Hay, L. E., Viger, R. J., Norton, P. A., Driscoll, J. M., & LaFontaine, J. H. (2018). Description of the national hydrologic model for use with the precipitation-runoff modeling system (prms) (No. 6-B9). US Geological Survey.
* Regan, R.S., Markstrom, S.L., LaFontaine, J.H., 2022, PRMS version 5.2.1: Precipitation-Runoff Modeling System (PRMS): U.S. Geological Survey Software Release, 02/10/2022.
