Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure running t-route in ngen worker image #472

Closed
robertbartel opened this issue Dec 18, 2023 · 77 comments
Closed

Failure running t-route in ngen worker image #472

robertbartel opened this issue Dec 18, 2023 · 77 comments
Labels
bug Something isn't working maas MaaS Workstream

Comments

@robertbartel
Copy link
Contributor

robertbartel commented Dec 18, 2023

Attempts to run framework-integrated t-route execution are failing. Initially, these were encountering a segmentation fault. After some experimental fix attempts, the errors changed first from to a signal 6, then to a signal 7, but t-route still does not run successfully.

The initial suspicion was a problem related to a known NetCDF Python package issue, which is what early fix tries attempted to address (this may still be the root of what's going on).

@robertbartel robertbartel added bug Something isn't working maas MaaS Workstream labels Dec 18, 2023
@hellkite500
Copy link
Member

Can you make a pip list of the runtime python env?

@robertbartel
Copy link
Contributor Author

@hellkite500, sure:

Output of pip list for ngen worker image
[mpi@env4 ngen]$ pip list
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)
Package            Version
------------------ ------------
attrs              23.1.0
black              23.11.0
blosc2             2.3.2
bmipy              2.0.1
certifi            2023.11.17
cftime             1.6.3
click              8.1.7
click-plugins      1.1.1
cligj              0.7.2
Cython             3.0.6
dbus-python        1.2.18
Deprecated         1.2.14
fiona              1.9.5
geopandas          0.14.1
gpg                1.15.1
importlib-metadata 7.0.0
Jinja2             3.1.2
joblib             1.3.2
libcomps           0.1.18
MarkupSafe         2.1.3
msgpack            1.0.7
mypy-extensions    1.0.0
ndindex            1.7
netCDF4            1.6.3
numexpr            2.8.7
numpy              1.26.2
nwm-routing        0.0.0
packaging          23.2
pandas             2.1.4
pathspec           0.11.2
pip                23.0.1
platformdirs       4.1.0
py-cpuinfo         9.0.0
pyarrow            14.0.1
pyproj             3.6.1
python-dateutil    2.8.2
pytz               2023.3.post1
PyYAML             6.0.1
rpm                4.16.1.3
setuptools         53.0.0
shapely            2.0.2
six                1.15.0
systemd-python     234
tables             3.9.2
tomli              2.0.1
toolz              0.12.0
troute.network     0.0.0
troute.routing     0.0.0
typing_extensions  4.8.0
tzdata             2023.3
wheel              0.42.0
wrapt              1.16.0
xarray             2023.11.0
zipp               3.17.0
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)
WARNING: Ignoring invalid distribution -yarrow (/usr/local/lib64/python3.9/site-packages)

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python3 -m pip install --upgrade pip

@hellkite500
Copy link
Member

Can you try with pyarrow 11? Still not sure that underlying issue has been completely addressed upstream.

@aaraney
Copy link
Member

aaraney commented Dec 19, 2023

Yeah I suspect it is either pyarrow or tables. How are you installing tables?

@robertbartel
Copy link
Contributor Author

Yeah I suspect it is either pyarrow or tables. How are you installing tables?

I've tweaked the image to ensure pyarrow 11.0.0 is installed. This was the command to install tables:

env HDF5_DIR=/usr pip3 install --no-cache-dir --no-build-isolation tables

I may be installing t-route incorrectly somehow, as I'm getting this error now. I'll continue looking into it.

FAIL: Unable to import a supported routing module.
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ModuleNotFoundError: No module named 'troute.config'

At:
  /usr/local/lib/python3.9/site-packages/nwm_routing/input.py(10): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(17): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load

@hellkite500
Copy link
Member

There is a new package/step needed with recent versions of t-route.

@robertbartel
Copy link
Contributor Author

As an aside, I still have trouble installing the netCDF4 Python package. I can make the image work with v1.6.3 if I use the binary package, but if I ever try to build it (even going the route of cloning the source tree) the build dependencies won't properly bring in mpi4py.

I don't think at this point that's contributing to the primary error, but it could be an issue later.

@hellkite500
Copy link
Member

@aaraney
Copy link
Member

aaraney commented Dec 20, 2023

Yeah it looks like troute.config is not being installed by t-routes install script. You can installed it with:

pip install "git+https://github.com/noaa-owp/t-route@master#egg=troute_config&subdirectory=src/troute-config"

@aaraney
Copy link
Member

aaraney commented Dec 20, 2023

Sorry, was AFK. Just looked at the install script and it looks like it should be installing troute.config.

@aaraney
Copy link
Member

aaraney commented Dec 20, 2023

@robertbartel, are you checking out a specific commit or branch?

@robertbartel
Copy link
Contributor Author

robertbartel commented Dec 26, 2023

I may have the issues fixed in the image to get t-route working, though now I am running into some peculiar configuration validation errors:

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ValidationError: 5 validation errors for Config
compute_parameters -> data_assimilation_parameters -> streamflow_da -> lastobs_output_folder
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> streamflow_da -> wrf_hydro_lastobs_file
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> gage_lakeID_crosswalk_file
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usace
  extra fields not permitted (type=value_error.extra)
compute_parameters -> data_assimilation_parameters -> reservoir_da -> reservoir_persistence_usgs
  extra fields not permitted (type=value_error.extra)

@yuqiong77 provided the original config I was using for testing. I don't have enough experience with t-route to sanity check things beyond not seeing these "extra fields" in the t-route config documentation (correction, they are in the example file ... I'll need to dig some more on that), but they are specific enough for me to remain a bit uncertain.

Regardless, I am at least going to tweak the configuration and run tests until I get a successful job completion.

@yuqiong77
Copy link

Happy New Year! I'm pressed for time to complete some multi-year streamflow simulation runs (either within the ngen image Bobby has helped build or as an post-processing step) for my AMS presentation. My sincerest thanks to you all for looking into the t-route issue.

@robertbartel
Copy link
Contributor Author

I'm going to put together at least a draft PR for this to build images for @yuqiong77, but I'm still running into an error. It does appear to be a more t-route-specific problem - perhaps still related to the configuration - and not one with the image.

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  AttributeError: 'NoneType' object has no attribute 'get'

At:
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(32): read_geopkg
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(154): read_geo_file
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(253): __init__
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04

Doing some limited checking, it looks like this is implying data_assimilation_parameters are None from the start, but there is at least something configured (and uncommented) in that section of the t-route config file I'm using. Again, we've gone outside my expertise, and perhaps the configuration simply needs some adjustment.

@yuqiong77
Copy link

@robertbartel Thanks. I also suspect the config I used (which was based on an example found in the t-route repository a few weeks ago) may have some issues. The example config file looks quite different from the t-route config files I used back in 2022, which did not have a data assimilation section.

Looking at the DA section of the current config, I think the only line that may cause an issue is the following:

lastobs_output_folder : lastobs/

What if we comment out that line?

@robertbartel
Copy link
Contributor Author

There seem to be at least some t-route problems contributing to this, which I've opened issue NOAA-OWP/t-route#719 to track.

@robertbartel
Copy link
Contributor Author

I think the problems in part are due to using a troute v3.0 config with troute v4.0 execution. If I tweak part of the data_assimilation_parameters config like this:

        reservoir_da:
            #----------
            reservoir_persistence_da:
              reservoir_persistence_usgs  : False
              reservoir_persistence_usace : False

Then I get past the earlier attribute and validation errors, although now I run into this/these:

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  KeyError: 'downstream'

At:
  /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3798): get_loc
  /usr/local/lib64/python3.9/site-packages/pandas/core/frame.py(3893): __getitem__
  /usr/local/lib/python3.9/site-packages/geopandas/geodataframe.py(1474): __getitem__
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(352): preprocess_network
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(269): __init__
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(80): main_v04

/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 8 leaked semlock objects to clean up at shutdown
  warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:314: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown
  warnings.warn(
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_79e4edc7c8d34a3cbd66375e0821ed87: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")
/usr/local/lib/python3.9/site-packages/joblib/externals/loky/backend/resource_tracker.py:330: UserWarning: resource_tracker: /tmp/joblib_memmapping_folder_22_7605d254530b40fc919513833b8b0a71_50b6587dc9f240e98727d944908e48e2: FileNotFoundError(2, 'No such file or directory')
  warnings.warn(f"resource_tracker: {name}: {e!r}")

@yuqiong77
Copy link

yuqiong77 commented Jan 9, 2024

Hi Bobby,

Thanks for figuring out the mismatch between t-route config and execution. I find a v4 example of the config in the repository:

v4 example

Based on that, I modified my config file on UCS6:

/local/model_as_a_service/yuqiong/data/troute_config.yaml

I now get the following error:

Finished 744 timesteps.
creating supernetwork connections set
2024-01-09 00:37:13,738 INFO [AbstractNetwork.py:489 - create_independent_networks()]: organizing connections into reaches ...
2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:518 - create_independent_networks()]: reach organization complete in 0.04627418518066406 seconds.
2024-01-09 00:37:13,785 INFO [AbstractNetwork.py:646 - initial_warmstate_preprocess()]: setting channel initial states ...
2024-01-09 00:37:13,785 DEBUG [AbstractNetwork.py:701 - initial_warmstate_preprocess()]: channel initial states complete in 0.0003256797790527344 seconds.
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): ZeroDivisionError: division by zero

At:
/usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(801): build_forcing_sets
/usr/local/lib/python3.9/site-packages/nwm_routing/main.py(108): main_v04

Any hint?

@robertbartel
Copy link
Contributor Author

robertbartel commented Jan 9, 2024

Indeed, I encountered the ZeroDivisionError as well. I made some further modifications to the config - mostly under forcing_parameters - to get to the config I'll attach here. I think at this point the troute config is valid and the Docker image is built properly (with respect to troute). Note that I had to compress it to get it to attach, so you'll need to gunzip it first.

troute_config.yaml.gz

There is still some trouble though. In short, ngen seems to be outputting a bogus line at the end of one of the terminal nexus output files (in particular, the one with the largest numeric feature id edit: my mistake: the trouble was with tnx-1000000099_output.csv). I'm going to work on debugging that some today.

@yuqiong77
Copy link

Bobby, which tnx file are you referring to specifically? I opened tnx-1000000687_output.csv (the one with the largest numeric id). The last line looked normal to me.

@aaraney
Copy link
Member

aaraney commented Jan 9, 2024

@yuqiong77, we were having issues with tnx-1000000099_output.csv. There is an extra line with the contents 0, 4.08443 at the end of the file.

@yuqiong77
Copy link

Thanks! I see that now. Although the last line in my file tnx-1000000099_output.csv looks a bit different:

743, 2012-10-31 23:00:00, 1.22727
.53858

@aaraney
Copy link
Member

aaraney commented Jan 9, 2024

For sure, @yuqiong77! Well that is odd. I am just jumping back into this thread, so I am not sure if @robertbartel was using a different set of forcing data that you are for your simulations. With the modifications, @robertbartel suggested to make to the t-route config, were you able to get a full end to end run of NextGen working? Or are you still running into the divide by zero error?

@aaraney
Copy link
Member

aaraney commented Jan 9, 2024

Probably ignore this, just documenting it because it is related. As @robertbartel, found out yesterday, the extra line in the tnx- csv file mentioned in my previous comment is the source of an InvalidIndexError that gets thrown by t-route (see collapsed stack trace).

stack trace
2024-01-08 20:00:06,620    INFO [AbstractNetwork.py:125 -    assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ... 
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  InvalidIndexError: Reindexing only valid with uniquely valued Index objects
At:
  /usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer
  /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result
  /usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat
  /usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array
  /usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings
  /usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(121): main_v04

In short, t-route is trying to concatenate pandas DataFrames by row. Each DataFrame is indexed by feature_id (so the 1000000099 in tnx-1000000099_output.csv), however because of the added line mentioned above, 1000000099 ends up being an index value twice. Pandas cannot concatenate by row DataFrames with non-unique index values.

@yuqiong77
Copy link

yuqiong77 commented Jan 9, 2024

@aaraney I just tested with the config file that @robertbartel posted (I think my config had the binary_nexus_file_foler line commented out). The divided by zero error is gone. Now I'm getting the same InvalidIndexError error message you posted above.

2024-01-09 16:50:42,859 INFO [AbstractNetwork.py:125 - assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ...
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): InvalidIndexError: Reindexing only valid with uniquely valued Index objects

At:
/usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer
/usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result
/usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array
/usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings
/usr/local/lib/python3.9/site-packages/nwm_routing/main.py(121): main_v04

@aaraney
Copy link
Member

aaraney commented Jan 9, 2024

@yuqiong77, well at least we are having issues in the same place! From the directory where the NextGen output files are can you please run find . -name "*_output.csv" -exec awk -F ',' 'NF != 3 {print FILENAME}' {} ';'? This should tell you output files that have spurious lines.

@ajkhattak
Copy link

ajkhattak commented Jan 19, 2024

> > forcing_filename = '.'
> > output_filename = '.'

I don't think they matter if you are running in the nextgen framework, you could even point them to any fake or real file as they are not read/used (but again when running in the framework).

Naoh-OM does provide precipitation and potential_ET as inputs to CFE (and topmodel too) via BMI, however, the output_filename is not used, I would guess this file is not even generated, when running Noah-OM in the nextgen framework. @SnowHydrology can confirm this or correct me if I said something that does not make sense 😊

@SnowHydrology
Copy link

@ajkhattak is correct. You could put any string you want in those two entries when running in NextGen. We put compiler directives to skip over the forcing read and output write routines. E.g. https://github.com/NOAA-OWP/noah-owp-modular/blob/04e8ac02532c9a292098f974cdb03aa03bfbfcd6/src/RunModule.f90#L210

@yuqiong77
Copy link

@ajkhattak @SnowHydrology That's great to know! I was checking the source code NamelistRead.f90 and noticed that forcing_filename & output_filename inputs were required, but I did not dig deeper to see how those file names get used in other subroutines.

@SnowHydrology
Copy link

SnowHydrology commented Jan 19, 2024

@yuqiong77 Were you ever able to track down the exact time and location (basin ID) of the failure? I'd be interested to see the forcing data corresponding to the failure just in case there is anything interesting in the file.

@yuqiong77
Copy link

@SnowHydrology No, I have not been able to track down the exact time and location of the failure. The screen output or error message did not indicate any catchment ids. What makes the debugging difficult is that the error would only occur after 9 to 10 months into the run, which would take close to 20 hours clock time (in the serial mode, since at the moment running in the parallel mode would produce spurious lines in the ngen output). I'll try to dig a bit deeper to see if I can identify the catchment of the failure.

@SnowHydrology
Copy link

@yuqiong77 that's likely because the error print out is coming from Noah-OM, which doesn't know which catchment it's running in. Maybe the output files can indicate where Noah-OM failed?

@SnowHydrology
Copy link

Also tagging @GreyEvenson-NOAA here.

The Noah-OM issue is described here originally: #472 (comment)

@yuqiong77
Copy link

yuqiong77 commented Jan 19, 2024

Some progress on identifying problematic catchments ... For 2933 catchments, the ngent outputs contain nan values from the very first time step, e.g.,

TimeStep,Time,RAIN_RATE,DIRECT_RUNOFF,GIUH_RUNOFF,NASH_LATERAL_RUNOFF,DEEP_GW_TO_CHANNEL_FLUX,Q_OUT,POTENTIAL_ET,ACTUAL_ET,GW_STORAGE,SOIL_STORAGE,SOIL_STORAGE_CHANGE,SURF_RUNOFF_SCHEME
0,2013-10-01 00:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000087867,0.000000000,-nan,0.000000000,0.000000000,1.000000000
1,2013-10-01 01:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000076420,0.000000000,-nan,0.000000000,0.000000000,1.000000000
2,2013-10-01 02:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000065125,0.000000000,-nan,0.000000000,0.000000000,1.000000000
3,2013-10-01 03:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000033762,0.000000000,-nan,0.000000000,0.000000000,1.000000000
4,2013-10-01 04:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000032287,0.000000000,-nan,0.000000000,0.000000000,1.000000000
5,2013-10-01 05:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000051751,0.000000000,-nan,0.000000000,0.000000000,1.000000000
6,2013-10-01 06:00:00,0.000000000,0.000000000,0.000000000,0.000000000,-nan,-nan,0.000046923,0.000000000,-nan,0.000000000,0.000000000,1.000000000

I checked the forcing files of these catchments and didn't find anything suspicious ... Will keep digging and report back

@SnowHydrology
Copy link

@yuqiong77 are you also saving the Noah-OM outputs? That would help with diagnosing any issues.

@yuqiong77
Copy link

With help from @ajkhattak , I think we have found the issue. The parameter values in the CFE config files for those problematic catchments were not set correctly. Likely there was a bug in the script that I used to populate the parameter values from the regionalization.

@ajkhattak
Copy link

but we still need to dig deeper to investigate it further, unless I am missing something, I don't think that wrong CFE config file inputs caused the Noah-OM FIRE = forcing%LWDN + energy%FIRA error as the coupling between CFE and NOM is not two-way

@aaraney
Copy link
Member

aaraney commented Jan 22, 2024

@yuqiong77 / @ajkhattak was the supposed config issue with the dynamic_veg_option parameter?

@yuqiong77
Copy link

@aaraney I doubt dynamic_veg_option would be an issue. I had successful runs before with dynamic_veg_option set to 4 (same as what we're using now).

@yuqiong77
Copy link

Hi all, just wanted to let you know that Ahmad has been helping me debugging and he was able to run ngen successfully for a year with my realization config (CFE + Noah-OM) and BMI files for HUC-01 catchments. He was also able to run successfully without CFE. He carried out both runs outside of the container. @ajkhattak if I miscommunicated or missed something, please correct.

But all of my runs (with or without CFE) within the container failed at around 9-10 months, with the same error (negative FIRA) in Noah-OM. So I'm wondering if the issue has something to do with the image that @robertbartel help build, in particular related to the Noah-OM module contained in that image?

@aaraney
Copy link
Member

aaraney commented Jan 26, 2024

Thanks for reporting back @yuqiong77! I was afraid it would be difficult to diagnose. Unfortunately it could be a myriad of things from the version of the Noah-OM code @ajkhattak used, the complier (gcc vs clang), the optimization level used by the compiler, or even the CPU architecture. @ajkhattak for starters, did you run the experiment on an arm or x86 machine?

@SnowHydrology
Copy link

@yuqiong77 Thanks for this update. The error message you got is one of the few checks in Noah-OM that will stop the model. Although the error may manifest as emitted longwave <0; skin T may be wrong due to inconsistent input of SHDFAC with LAI, it can be caused by myriad issues.

@robertbartel
Copy link
Contributor Author

@yuqiong77, thank you for the info. Just to confirm, were your runs always with serial ngen, or did you also experience the errors running parallel ngen? If you haven't tried a parallel ngen scenario because of the current issues with that and t-route, could you try your configs in a parallel run (with routing removed of course) and see if the error still occurs?

@yuqiong77
Copy link

@robertbartel yes, all my latest runs that failed were in serial mode. My earlier runs in the parallel mode did not go far because of the t-route issue we ran into. I will launch a parallel run without routing for a year and report back.

@yuqiong77
Copy link

@robertbartel The parallel version ran pretty fast, but unfortunately it still failed at around 7300 time steps with the same error.

Running timestep 7300
emitted longwave <0; skin T may be wrong due to inconsistent
input of SHDFAC with LAI
2147483647 2147483647 SHDFAC= 0.800000012 parameters%VAI= 4.74794531 TV= 286.935242 TG= -110.499847
LWDN= 366.299988 energy%FIRA= -15046.4004 water%SNOWH= 0.00000000
Exiting ...

@ajkhattak
Copy link

sorry guys, there were some other issues. The use of STOP in Noah-OM terminates the problem normally (stops the execution and sends out ZERO to the terminal). I replaced STOP with call ABORT so Noah-OM terminates abnormally, and then my workflow can catch this abnormal behavior, and will throw the problematic catchment ID, so the first such catchment I see is cat-2573.

I am going to test it on the latest Noah-OM master and see if I can reproduce the error.

@GreyEvenson-NOAA I will reach out to you to discuss the debugging further

sorry for any confusion...

@GreyREvenson
Copy link

Afternoon all,

I spent some time looking for a problem in the energy balance simulations and the calculation of vegetation temperature and ground (below veg) temperature in EnergyMain and EtFluxModule but didn't find anything.

However, I noticed that in the namelist file that Ahmad gave to me, the soil type is specified as 14, which corresponds to 'water'. The simulation ended successfully -- and with realistic ground temp values -- after changing the soil type to something different (I tried several different non-water soil types). Can someone confirm my observation by changing isltyp to 13 or something else and re-running?

@yuqiong77: Does this catchment need to be simulated with a water soil type? If so, I will look into the matter further as the energy and temperature simulations are partly impacted by the properties of the top soil horizon.

@SnowHydrology
Copy link

@robertbartel, this issue might be close-able. @ajkhattak and @GreyEvenson-NOAA tracked down the issue in the Noah-OM namelist and we're working on a fix in the hydrofabric.

Actually, I just noticed, this particular issue has had quite the evolution, so I don't know if the original issue has been solved. The Noah-OM error has been.

@robertbartel
Copy link
Contributor Author

Thanks @SnowHydrology. The scope did get pretty broad, but I think you are correct in that this can be closed. To be safe though, I want to outline what had been uncovered, and status of addressing that aspect:

  • The original image dependency and build issues, related to NetCDF Python package
  • Failure running a parallel modeling job with t-route
    • Not directly a DMOD issue
    • Turned out to be a subtle problem with ngen and how it writes output files that throws of t-route
    • Can be worked around by running ngen serially
  • Failure running any modeling jobs when time range reaches certain length
    • Not directly a DMOD issue
    • Seems like choice of configured soil type was contributing to unstable behavior of Noah-OM

@aaraney, @yuqiong77, @ajkhattak, is this all correct? Have I missed anything?

@hellkite500
Copy link
Member

@ajkhattak would you be willing to document/describe the workflow on this ngen issue? NOAA-OWP/ngen#723
I've been thinking about various ways to catch library exists and propagate errors through the model engine stack to capture additional information, and it sounds like you have done something that may be useful in helping formalize a mechanism in the model engine to provide these details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working maas MaaS Workstream
Projects
None yet
Development

No branches or pull requests

7 participants