Failure running t-route in ngen worker image #472

robertbartel · 2023-12-18T17:01:27Z

Attempts to run framework-integrated t-route execution are failing. Initially, these were encountering a segmentation fault. After some experimental fix attempts, the errors changed first from to a signal 6, then to a signal 7, but t-route still does not run successfully.

The initial suspicion was a problem related to a known NetCDF Python package issue, which is what early fix tries attempted to address (this may still be the root of what's going on).

hellkite500 · 2023-12-18T18:12:43Z

Can you make a pip list of the runtime python env?

robertbartel · 2023-12-18T20:14:01Z

@hellkite500, sure:

Output of pip list for ngen worker image

[mpi@env4 ngen]$ pip list
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)
Package            Version
------------------ ------------
attrs              23.1.0
black              23.11.0
blosc2             2.3.2
bmipy              2.0.1
certifi            2023.11.17
cftime             1.6.3
click              8.1.7
click-plugins      1.1.1
cligj              0.7.2
Cython             3.0.6
dbus-python        1.2.18
Deprecated         1.2.14
fiona              1.9.5
geopandas          0.14.1
gpg                1.15.1
importlib-metadata 7.0.0
Jinja2             3.1.2
joblib             1.3.2
libcomps           0.1.18
MarkupSafe         2.1.3
msgpack            1.0.7
mypy-extensions    1.0.0
ndindex            1.7
netCDF4            1.6.3
numexpr            2.8.7
numpy              1.26.2
nwm-routing        0.0.0
packaging          23.2
pandas             2.1.4
pathspec           0.11.2
pip                23.0.1
platformdirs       4.1.0
py-cpuinfo         9.0.0
pyarrow            14.0.1
pyproj             3.6.1
python-dateutil    2.8.2
pytz               2023.3.post1
PyYAML             6.0.1
rpm                4.16.1.3
setuptools         53.0.0
shapely            2.0.2
six                1.15.0
systemd-python     234
tables             3.9.2
tomli              2.0.1
toolz              0.12.0
troute.network     0.0.0
troute.routing     0.0.0
typing_extensions  4.8.0
tzdata             2023.3
wheel              0.42.0
wrapt              1.16.0
xarray             2023.11.0
zipp               3.17.0
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python3 -m pip install --upgrade pip

hellkite500 · 2023-12-18T20:45:14Z

Can you try with pyarrow 11? Still not sure that underlying issue has been completely addressed upstream.

aaraney · 2023-12-19T18:46:23Z

Yeah I suspect it is either pyarrow or tables. How are you installing tables?

robertbartel · 2023-12-20T21:32:11Z

Yeah I suspect it is either pyarrow or tables. How are you installing tables?

I've tweaked the image to ensure pyarrow 11.0.0 is installed. This was the command to install tables:

env HDF5_DIR=/usr pip3 install --no-cache-dir --no-build-isolation tables

I may be installing t-route incorrectly somehow, as I'm getting this error now. I'll continue looking into it.

FAIL: Unable to import a supported routing module.
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ModuleNotFoundError: No module named 'troute.config'

At:
  /usr/local/lib/python3.9/site-packages/nwm_routing/input.py(10): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(17): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load

hellkite500 · 2023-12-20T21:37:40Z

There is a new package/step needed with recent versions of t-route.

robertbartel · 2023-12-20T21:37:47Z

As an aside, I still have trouble installing the netCDF4 Python package. I can make the image work with v1.6.3 if I use the binary package, but if I ever try to build it (even going the route of cloning the source tree) the build dependencies won't properly bring in mpi4py.

I don't think at this point that's contributing to the primary error, but it could be an issue later.

hellkite500 · 2023-12-20T21:39:03Z

https://github.com/CIROH-UA/NGIAB-CloudInfra/blob/main/docker%2FDockerfile.t-route#L56

aaraney · 2023-12-20T21:44:44Z

Yeah it looks like troute.config is not being installed by t-routes install script. You can installed it with:

pip install "git+https://github.com/noaa-owp/t-route@master#egg=troute_config&subdirectory=src/troute-config"

aaraney · 2023-12-20T22:26:58Z

Sorry, was AFK. Just looked at the install script and it looks like it should be installing troute.config.

aaraney · 2023-12-20T22:27:40Z

@robertbartel, are you checking out a specific commit or branch?

robertbartel · 2023-12-26T16:58:05Z

I may have the issues fixed in the image to get t-route working, though now I am running into some peculiar configuration validation errors:

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ValidationError: 5 validation errors for Config
compute_parameters -> data_assimilation_parameters -> streamflow_da -> lastobs_output_folder
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> streamflow_da -> wrf_hydro_lastobs_file
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> gage_lakeID_crosswalk_file
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usace
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usgs
  extra fields not permitted (type=value_error.extra)

@yuqiong77 provided the original config I was using for testing. I don't have enough experience with t-route to sanity check things beyond ~~not seeing these "extra fields" in the t-route config documentation~~ (correction, they are in the example file ... I'll need to dig some more on that), but they are specific enough for me to remain a bit uncertain.

Regardless, I am at least going to tweak the configuration and run tests until I get a successful job completion.

yuqiong77 · 2024-01-02T14:25:31Z

Happy New Year! I'm pressed for time to complete some multi-year streamflow simulation runs (either within the ngen image Bobby has helped build or as an post-processing step) for my AMS presentation. My sincerest thanks to you all for looking into the t-route issue.

robertbartel · 2024-01-04T17:55:24Z

I'm going to put together at least a draft PR for this to build images for @yuqiong77, but I'm still running into an error. It does appear to be a more t-route-specific problem - perhaps still related to the configuration - and not one with the image.

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  AttributeError: 'NoneType' object has no attribute 'get'

At:
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(32): read_geopkg
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(154): read_geo_file
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(253): __init__
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04

Doing some limited checking, it looks like this is implying data_assimilation_parameters are None from the start, but there is at least something configured (and uncommented) in that section of the t-route config file I'm using. Again, we've gone outside my expertise, and perhaps the configuration simply needs some adjustment.

yuqiong77 · 2024-01-04T18:45:53Z

@robertbartel Thanks. I also suspect the config I used (which was based on an example found in the t-route repository a few weeks ago) may have some issues. The example config file looks quite different from the t-route config files I used back in 2022, which did not have a data assimilation section.

Looking at the DA section of the current config, I think the only line that may cause an issue is the following:

lastobs_output_folder : lastobs/

What if we comment out that line?

robertbartel · 2024-01-05T19:03:27Z

There seem to be at least some t-route problems contributing to this, which I've opened issue NOAA-OWP/t-route#719 to track.

robertbartel · 2024-01-05T22:12:14Z

I think the problems in part are due to using a troute v3.0 config with troute v4.0 execution. If I tweak part of the data_assimilation_parameters config like this:

        reservoir_da:
            #----------
            reservoir_persistence_da:
              reservoir_persistence_usgs  : False
              reservoir_persistence_usace : False

Then I get past the earlier attribute and validation errors, although now I run into this/these:

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  KeyError: 'downstream'

At:
  /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3798): get_loc
  /usr/local/lib64/python3.9/site-packages/pandas/core/frame.py(3893): __getitem__
  /usr/local/lib/python3.9/site-packages/geopandas/geodataframe.py(1474): __getitem__
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(352): preprocess_network
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(269): __init__
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04

/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 8 leaked semlock objects to clean up at shutdown
  warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown
  warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_79e4edc7c8d34a3cbd66375e0821ed87: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_50b6587dc9f240e98727d944908e48e2: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")

yuqiong77 · 2024-01-09T00:45:46Z

Hi Bobby,

Thanks for figuring out the mismatch between t-route config and execution. I find a v4 example of the config in the repository:

v4 example

Based on that, I modified my config file on UCS6:

/local/model_as_a_service/yuqiong/data/troute_config.yaml

I now get the following error:

Finished 744 timesteps.
creating supernetwork connections set
2024-01-09 00:37:13,738 INFO [AbstractNetwork.py:489 - create_independent_networks()]: organizing connections into reaches ...
2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:518 - create_independent_networks()]: reach organization complete in 0.04627418518066406 seconds.
2024-01-09 00:37:13,785 INFO [AbstractNetwork.py:646 - initial_warmstate_preprocess()]: setting channel initial states ...
2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:701 - initial_warmstate_preprocess()]: channel initial states complete in 0.0003256797790527344 seconds.
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): ZeroDivisionError: division by zero

At:
/usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(801): build_forcing_sets
/usr/local/lib/python3.9/site-packages/nwm_routing/main.py(108): main_v04

Any hint?

robertbartel · 2024-01-09T13:40:00Z

Indeed, I encountered the ZeroDivisionError as well. I made some further modifications to the config - mostly under forcing_parameters - to get to the config I'll attach here. I think at this point the troute config is valid and the Docker image is built properly (with respect to troute). Note that I had to compress it to get it to attach, so you'll need to gunzip it first.

troute_config.yaml.gz

There is still some trouble though. In short, ngen seems to be outputting a bogus line at the end of one of the terminal nexus output files (in particular, ~~the one with the largest numeric feature id~~ edit: my mistake: the trouble was with tnx-1000000099_output.csv). I'm going to work on debugging that some today.

yuqiong77 · 2024-01-09T15:01:33Z

Bobby, which tnx file are you referring to specifically? I opened tnx-1000000687_output.csv (the one with the largest numeric id). The last line looked normal to me.

aaraney · 2024-01-09T15:29:50Z

@yuqiong77, we were having issues with tnx-1000000099_output.csv. There is an extra line with the contents 0, 4.08443 at the end of the file.

yuqiong77 · 2024-01-09T15:36:09Z

Thanks! I see that now. Although the last line in my file tnx-1000000099_output.csv looks a bit different:

743, 2012-10-31 23:00:00, 1.22727
.53858

aaraney · 2024-01-09T15:44:06Z

For sure, @yuqiong77! Well that is odd. I am just jumping back into this thread, so I am not sure if @robertbartel was using a different set of forcing data that you are for your simulations. With the modifications, @robertbartel suggested to make to the t-route config, were you able to get a full end to end run of NextGen working? Or are you still running into the divide by zero error?

aaraney · 2024-01-09T16:07:19Z

Probably ignore this, just documenting it because it is related. As @robertbartel, found out yesterday, the extra line in the tnx- csv file mentioned in my previous comment is the source of an InvalidIndexError that gets thrown by t-route (see collapsed stack trace).

stack trace

2024-01-08 20:00:06,620    INFO [AbstractNetwork.py:125 -    assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ... 
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  InvalidIndexError: Reindexing only valid with uniquely valued Index objects
At:
  /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer
  /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result
  /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array
  /usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(121): main_v04

In short, t-route is trying to concatenate pandas DataFrames by row. Each DataFrame is indexed by feature_id (so the 1000000099 in tnx-1000000099_output.csv), however because of the added line mentioned above, 1000000099 ends up being an index value twice. Pandas cannot concatenate by row DataFrames with non-unique index values.

yuqiong77 · 2024-01-09T16:57:26Z

@aaraney I just tested with the config file that @robertbartel posted (I think my config had the binary_nexus_file_foler line commented out). The divided by zero error is gone. Now I'm getting the same InvalidIndexError error message you posted above.

2024-01-09 16:50:42,859 INFO [AbstractNetwork.py:125 - assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ...
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): InvalidIndexError: Reindexing only valid with uniquely valued Index objects

At:
/usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer
/usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result
/usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array
/usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings
/usr/local/lib/python3.9/site-packages/nwm_routing/main.py(121): main_v04

aaraney · 2024-01-09T17:10:57Z

@yuqiong77, well at least we are having issues in the same place! From the directory where the NextGen output files are can you please run find . -name "*_output.csv" -exec awk -F ',' 'NF != 3 {print FILENAME}' {} ';'? This should tell you output files that have spurious lines.

ajkhattak · 2024-01-19T04:09:00Z

> > forcing_filename = '.'
> > output_filename = '.'

I don't think they matter if you are running in the nextgen framework, you could even point them to any fake or real file as they are not read/used (but again when running in the framework).

Naoh-OM does provide precipitation and potential_ET as inputs to CFE (and topmodel too) via BMI, however, the output_filename is not used, I would guess this file is not even generated, when running Noah-OM in the nextgen framework. @SnowHydrology can confirm this or correct me if I said something that does not make sense 😊

SnowHydrology · 2024-01-19T11:56:08Z

@ajkhattak is correct. You could put any string you want in those two entries when running in NextGen. We put compiler directives to skip over the forcing read and output write routines. E.g. https://github.com/NOAA-OWP/noah-owp-modular/blob/04e8ac02532c9a292098f974cdb03aa03bfbfcd6/src/RunModule.f90#L210

yuqiong77 · 2024-01-19T13:40:05Z

@ajkhattak @SnowHydrology That's great to know! I was checking the source code NamelistRead.f90 and noticed that forcing_filename & output_filename inputs were required, but I did not dig deeper to see how those file names get used in other subroutines.

SnowHydrology · 2024-01-19T14:14:03Z

@yuqiong77 Were you ever able to track down the exact time and location (basin ID) of the failure? I'd be interested to see the forcing data corresponding to the failure just in case there is anything interesting in the file.

yuqiong77 · 2024-01-19T14:54:23Z

@SnowHydrology No, I have not been able to track down the exact time and location of the failure. The screen output or error message did not indicate any catchment ids. What makes the debugging difficult is that the error would only occur after 9 to 10 months into the run, which would take close to 20 hours clock time (in the serial mode, since at the moment running in the parallel mode would produce spurious lines in the ngen output). I'll try to dig a bit deeper to see if I can identify the catchment of the failure.

SnowHydrology · 2024-01-19T15:33:55Z

@yuqiong77 that's likely because the error print out is coming from Noah-OM, which doesn't know which catchment it's running in. Maybe the output files can indicate where Noah-OM failed?

SnowHydrology · 2024-01-19T15:43:26Z

Also tagging @GreyEvenson-NOAA here.

The Noah-OM issue is described here originally: #472 (comment)

yuqiong77 · 2024-01-19T16:24:01Z

Some progress on identifying problematic catchments ... For 2933 catchments, the ngent outputs contain nan values from the very first time step, e.g.,

TimeStep,Time,RAIN_RATE,DIRECT_RUNOFF,GIUH_RUNOFF,NASH_LATERAL_RUNOFF,DEEP_GW_TO_CHANNEL_FLUX,Q_OUT,POTENTIAL_ET,ACTUAL_ET,GW_STORAGE,SOIL_STORAGE,SOIL_STORAGE_CHANGE,SURF_RUNOFF_SCHEME
0,2013-10-01 00:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000087867,0.000000000,-nan,0.000000000,0.000000000,1.000000000
1,2013-10-01 01:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000076420,0.000000000,-nan,0.000000000,0.000000000,1.000000000
2,2013-10-01 02:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000065125,0.000000000,-nan,0.000000000,0.000000000,1.000000000
3,2013-10-01 03:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000033762,0.000000000,-nan,0.000000000,0.000000000,1.000000000
4,2013-10-01 04:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000032287,0.000000000,-nan,0.000000000,0.000000000,1.000000000
5,2013-10-01 05:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000051751,0.000000000,-nan,0.000000000,0.000000000,1.000000000
6,2013-10-01 06:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000046923,0.000000000,-nan,0.000000000,0.000000000,1.000000000

I checked the forcing files of these catchments and didn't find anything suspicious ... Will keep digging and report back

SnowHydrology · 2024-01-19T16:37:56Z

@yuqiong77 are you also saving the Noah-OM outputs? That would help with diagnosing any issues.

yuqiong77 · 2024-01-19T16:47:46Z

With help from @ajkhattak , I think we have found the issue. The parameter values in the CFE config files for those problematic catchments were not set correctly. Likely there was a bug in the script that I used to populate the parameter values from the regionalization.

ajkhattak · 2024-01-19T16:58:05Z

but we still need to dig deeper to investigate it further, unless I am missing something, I don't think that wrong CFE config file inputs caused the Noah-OM FIRE = forcing%LWDN + energy%FIRA error as the coupling between CFE and NOM is not two-way

aaraney · 2024-01-22T15:23:17Z

@yuqiong77 / @ajkhattak was the supposed config issue with the dynamic_veg_option parameter?

yuqiong77 · 2024-01-22T16:50:26Z

@aaraney I doubt dynamic_veg_option would be an issue. I had successful runs before with dynamic_veg_option set to 4 (same as what we're using now).

yuqiong77 · 2024-01-26T15:03:55Z

Hi all, just wanted to let you know that Ahmad has been helping me debugging and he was able to run ngen successfully for a year with my realization config (CFE + Noah-OM) and BMI files for HUC-01 catchments. He was also able to run successfully without CFE. He carried out both runs outside of the container. @ajkhattak if I miscommunicated or missed something, please correct.

But all of my runs (with or without CFE) within the container failed at around 9-10 months, with the same error (negative FIRA) in Noah-OM. So I'm wondering if the issue has something to do with the image that @robertbartel help build, in particular related to the Noah-OM module contained in that image?

aaraney · 2024-01-26T15:26:01Z

Thanks for reporting back @yuqiong77! I was afraid it would be difficult to diagnose. Unfortunately it could be a myriad of things from the version of the Noah-OM code @ajkhattak used, the complier (gcc vs clang), the optimization level used by the compiler, or even the CPU architecture. @ajkhattak for starters, did you run the experiment on an arm or x86 machine?

SnowHydrology · 2024-01-26T15:27:41Z

@yuqiong77 Thanks for this update. The error message you got is one of the few checks in Noah-OM that will stop the model. Although the error may manifest as emitted longwave <0; skin T may be wrong due to inconsistent input of SHDFAC with LAI, it can be caused by myriad issues.

robertbartel · 2024-01-26T15:35:17Z

@yuqiong77, thank you for the info. Just to confirm, were your runs always with serial ngen, or did you also experience the errors running parallel ngen? If you haven't tried a parallel ngen scenario because of the current issues with that and t-route, could you try your configs in a parallel run (with routing removed of course) and see if the error still occurs?

yuqiong77 · 2024-01-26T15:44:50Z

@robertbartel yes, all my latest runs that failed were in serial mode. My earlier runs in the parallel mode did not go far because of the t-route issue we ran into. I will launch a parallel run without routing for a year and report back.

yuqiong77 · 2024-01-26T16:30:12Z

@robertbartel The parallel version ran pretty fast, but unfortunately it still failed at around 7300 time steps with the same error.

Running timestep 7300
emitted longwave <0; skin T may be wrong due to inconsistent
input of SHDFAC with LAI
2147483647 2147483647 SHDFAC= 0.800000012 parameters%VAI= 4.74794531 TV= 286.935242 TG= -110.499847
LWDN= 366.299988 energy%FIRA= -15046.4004 water%SNOWH= 0.00000000
Exiting ...

ajkhattak · 2024-01-26T18:08:55Z

sorry guys, there were some other issues. The use of STOP in Noah-OM terminates the problem normally (stops the execution and sends out ZERO to the terminal). I replaced STOP with call ABORT so Noah-OM terminates abnormally, and then my workflow can catch this abnormal behavior, and will throw the problematic catchment ID, so the first such catchment I see is cat-2573.

I am going to test it on the latest Noah-OM master and see if I can reproduce the error.

@GreyEvenson-NOAA I will reach out to you to discuss the debugging further

sorry for any confusion...

GreyREvenson · 2024-02-01T20:43:10Z

Afternoon all,

I spent some time looking for a problem in the energy balance simulations and the calculation of vegetation temperature and ground (below veg) temperature in EnergyMain and EtFluxModule but didn't find anything.

However, I noticed that in the namelist file that Ahmad gave to me, the soil type is specified as 14, which corresponds to 'water'. The simulation ended successfully -- and with realistic ground temp values -- after changing the soil type to something different (I tried several different non-water soil types). Can someone confirm my observation by changing isltyp to 13 or something else and re-running?

@yuqiong77: Does this catchment need to be simulated with a water soil type? If so, I will look into the matter further as the energy and temperature simulations are partly impacted by the properties of the top soil horizon.

SnowHydrology · 2024-02-07T17:30:54Z

@robertbartel, this issue might be close-able. @ajkhattak and @GreyEvenson-NOAA tracked down the issue in the Noah-OM namelist and we're working on a fix in the hydrofabric.

Actually, I just noticed, this particular issue has had quite the evolution, so I don't know if the original issue has been solved. The Noah-OM error has been.

robertbartel · 2024-02-07T18:11:23Z

Thanks @SnowHydrology. The scope did get pretty broad, but I think you are correct in that this can be closed. To be safe though, I want to outline what had been uncovered, and status of addressing that aspect:

The original image dependency and build issues, related to NetCDF Python package
- Fixed via Fix ngen image for building t-route #474
Failure running a parallel modeling job with t-route
- Not directly a DMOD issue
- Turned out to be a subtle problem with ngen and how it writes output files that throws of t-route
- Can be worked around by running ngen serially
Failure running any modeling jobs when time range reaches certain length
- Not directly a DMOD issue
- Seems like choice of configured soil type was contributing to unstable behavior of Noah-OM

@aaraney, @yuqiong77, @ajkhattak, is this all correct? Have I missed anything?

hellkite500 · 2024-02-07T19:17:39Z

@ajkhattak would you be willing to document/describe the workflow on this ngen issue? NOAA-OWP/ngen#723
I've been thinking about various ways to catch library exists and propagate errors through the model engine stack to capture additional information, and it sounds like you have done something that may be useful in helping formalize a mechanism in the model engine to provide these details.

robertbartel added bug Something isn't working maas MaaS Workstream labels Dec 18, 2023

robertbartel mentioned this issue Jan 4, 2024

Fix ngen image for building t-route #474

Merged

aaraney mentioned this issue Jan 9, 2024

Consider asserting that index values are unique. NOAA-OWP/t-route#721

Closed

SnowHydrology mentioned this issue Feb 7, 2024

Land cover and soil classification using generic categories NOAA-OWP/hydrofabric#35

Open

SnowHydrology mentioned this issue Feb 7, 2024

Check for consistency in land cover and soil NOAA-OWP/noah-owp-modular#97

Open

SnowHydrology mentioned this issue Feb 8, 2024

Basin and timestep information for error messages and warnings NOAA-OWP/ngen#723

Open

aaraney mentioned this issue Mar 26, 2024

feat: warn if soil or veg type is water but not both NOAA-OWP/ngen-cal#118

Merged

SnowHydrology mentioned this issue Mar 27, 2024

Report catchment ID and timestep on BMI_FAILURE NOAA-OWP/ngen#777

Closed

robertbartel closed this as completed Apr 30, 2024

Failure running t-route in ngen worker image #472

Failure running t-route in ngen worker image #472

Comments

robertbartel commented Dec 18, 2023 • edited Loading

hellkite500 commented Dec 18, 2023

robertbartel commented Dec 18, 2023

hellkite500 commented Dec 18, 2023

aaraney commented Dec 19, 2023

robertbartel commented Dec 20, 2023

hellkite500 commented Dec 20, 2023

robertbartel commented Dec 20, 2023

hellkite500 commented Dec 20, 2023

aaraney commented Dec 20, 2023

aaraney commented Dec 20, 2023

aaraney commented Dec 20, 2023

robertbartel commented Dec 26, 2023 • edited Loading

yuqiong77 commented Jan 2, 2024

robertbartel commented Jan 4, 2024

yuqiong77 commented Jan 4, 2024

robertbartel commented Jan 5, 2024

robertbartel commented Jan 5, 2024

yuqiong77 commented Jan 9, 2024 • edited Loading

robertbartel commented Jan 9, 2024 • edited Loading

yuqiong77 commented Jan 9, 2024

aaraney commented Jan 9, 2024

yuqiong77 commented Jan 9, 2024

aaraney commented Jan 9, 2024

aaraney commented Jan 9, 2024

yuqiong77 commented Jan 9, 2024 • edited Loading

aaraney commented Jan 9, 2024

ajkhattak commented Jan 19, 2024 • edited Loading

SnowHydrology commented Jan 19, 2024

yuqiong77 commented Jan 19, 2024

SnowHydrology commented Jan 19, 2024 • edited Loading

yuqiong77 commented Jan 19, 2024

SnowHydrology commented Jan 19, 2024

SnowHydrology commented Jan 19, 2024

yuqiong77 commented Jan 19, 2024 • edited Loading

SnowHydrology commented Jan 19, 2024

yuqiong77 commented Jan 19, 2024

ajkhattak commented Jan 19, 2024

aaraney commented Jan 22, 2024

yuqiong77 commented Jan 22, 2024

yuqiong77 commented Jan 26, 2024

aaraney commented Jan 26, 2024

SnowHydrology commented Jan 26, 2024

robertbartel commented Jan 26, 2024

yuqiong77 commented Jan 26, 2024

yuqiong77 commented Jan 26, 2024

ajkhattak commented Jan 26, 2024

GreyREvenson commented Feb 1, 2024

SnowHydrology commented Feb 7, 2024

robertbartel commented Feb 7, 2024

hellkite500 commented Feb 7, 2024

robertbartel commented Dec 18, 2023 •

edited

Loading

robertbartel commented Dec 26, 2023 •

edited

Loading

yuqiong77 commented Jan 9, 2024 •

edited

Loading

robertbartel commented Jan 9, 2024 •

edited

Loading

yuqiong77 commented Jan 9, 2024 •

edited

Loading

ajkhattak commented Jan 19, 2024 •

edited

Loading

SnowHydrology commented Jan 19, 2024 •

edited

Loading

yuqiong77 commented Jan 19, 2024 •

edited

Loading