Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash when writing basin output #56

Closed
mee067 opened this issue May 2, 2024 · 13 comments
Closed

crash when writing basin output #56

mee067 opened this issue May 2, 2024 · 13 comments
Assignees
Labels
bug Unexpected or incorrect behaviour observed in a component of MESH pending-update The issue is resolved pending a future release

Comments

@mee067
Copy link

mee067 commented May 2, 2024

Code compiled earlier - maybe a month or two ago did work. Re-compiled the same code (using intel 2018 and 2021), and I am getting strange crashes (Segmentation Fault). I recompiled with symbols on but only intel 2021 gives some clue, the 2018 compilation gave no info on where's the issue. This is the dump of the error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
mpi_sa_mesh_2021_  0000000000B6C8DA  Unknown               Unknown  Unknown
libpthread-2.30.s  00001555554270F0  Unknown               Unknown  Unknown
mpi_sa_mesh_2021_  0000000000AC7B0D  save_basin_output         946  save_basin_output.f90
mpi_sa_mesh_2021_  0000000000B44760  MAIN__                   1031  MESH_driver.f90
mpi_sa_mesh_2021_  000000000040CC92  Unknown               Unknown  Unknown
libc-2.30.so       0000155554726E1B  __libc_start_main     Unknown  Unknown
mpi_sa_mesh_2021_  000000000040CBAA  Unknown               Unknown  Unknown

This is a gridded setup. I first thought it has something to do with the LongSimFix but it does not. Line 946 of save_basin_output.f90 reads:

                    if (WF_RTE_fstflout%fout_hyd) write(iun, 1010, advance = 'no') &
                    fms%stmg%qomeas%val(i), out%d%grid%qo(fms%stmg%meta%rnk(i))

I did a bit of debugging and found that the rank of the third gauge went crazy to be 1112486707, while the basin only has 3448 active gridcells.

Any ideas?

@mee067
Copy link
Author

mee067 commented May 3, 2024

This crash is similar to the #22

The same stupid number: 1112486707 got assigned to the rank gauge #3 - not sure where and how.

@dprincz dprincz self-assigned this May 23, 2024
@dprincz dprincz added the bug Unexpected or incorrect behaviour observed in a component of MESH label May 23, 2024
@dprincz
Copy link
Contributor

dprincz commented May 23, 2024

@mee067 provided sample setup

@dprincz dprincz added the pending-reproducibility-by-sample The issuer has provided a sample to demonstrate the issue label May 24, 2024
@mee067
Copy link
Author

mee067 commented Jul 2, 2024

any progress on this issue? It is holding me from trying any new stuff. Compiling works but running hits the issue so often.

@mee067
Copy link
Author

mee067 commented Jul 9, 2024

somehow this seems to be related to resuming. runs starting with RESUMEFLAG 0 worked fine, runs starting with RESUMEFLAG 6 gave the above issue.

@dprincz

This comment was marked as off-topic.

@dprincz dprincz removed the pending-reproducibility-by-sample The issuer has provided a sample to demonstrate the issue label Aug 8, 2024
@dprincz

This comment was marked as off-topic.

@dprincz

This comment was marked as off-topic.

@dprincz

This comment was marked as off-topic.

@dprincz

This comment was marked as off-topic.

@dprincz
Copy link
Contributor

dprincz commented Aug 8, 2024

@mee067 I can't replicate the behavior with the sample provided. I changed the simulation start to 1951/001 based on the start date of the files and find it runs normally with RESUMEFLAG 0, RESUMEFLAG 6, and RESUMEFLAG 6 auto.

Can you provide me a fully-contained example where I shouldn't need to modify anything?

@dprincz dprincz added the pending-reproducibility-by-sample The issuer has provided a sample to demonstrate the issue label Aug 8, 2024
@mee067
Copy link
Author

mee067 commented Aug 12, 2024

Setup and code sent by email. I see tgz is an accepted format but it still said "we won't support that file type". 12 MB is possibly too large for Github.

@dprincz dprincz added pending-update The issue is resolved pending a future release and removed pending-reproducibility-by-sample The issuer has provided a sample to demonstrate the issue labels Aug 13, 2024
@dprincz
Copy link
Contributor

dprincz commented Aug 13, 2024

One of the changes by MESH-Model/MESH_Code@52c7367 "LongSimFix" was to revise the use of fhr from tracking incremental hours in the simulation to always being equal to 1.

When originally implemented, a temporary fix for observed variables arbitrarily allocated these arrays to 999999 blocks (observation hours). This caused issues for simulations exceeding 999999 hours (approx.. 114 years); for example, climate simulations from 1980 to 2100, which likely would cause an index out-of-bounds error in 2094 or later.

Changes by @mee067 to address this removed the allocation of these variables from 999999 to (:, 1), setting the current record fhr always equal to 1. However, the change only initialized fhr = 1 when resumed from seq format resume files. fhr not being properly initialized without this condition seems to have contributed to the corruption of data used by the module, i.e., the record of observations index by gauge ID, causing the indexing issue observed with the streamflow observations printed by save_basin_output.

Adding the same condition, fhr = 1, during initialization resolves this issue.

!> Set fhr to 1 (the counting of incremental hours in the routing simulation is not used like it is in standalone WATROUTE).
fhr = 1

Further, additional modifications were made when resuming from the seq format resume files for the case when the file was written by a previous version of the code where fhr /= 1.

real(kind = 4), dimension(:, :), allocatable :: lake_elv_temp
...

if (fms%rsvr%n > 0) then
    allocate(lake_elv_temp(noresv, fhr))
    read(iun) lake_elv_temp(:, fhr)
    lake_elv(:, 1) = lake_elv_temp(:, fhr)
else
    read(iun)
end if
...

!> Set fhr to 1 (the counting of incremental hours in the routing simulation is not used like it is in standalone WATROUTE).
fhr = 1

@dprincz
Copy link
Contributor

dprincz commented Aug 13, 2024

These changes are incorporated in dprincz/MESH-Dev@5fc2d3b.

@dprincz dprincz closed this as completed Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected or incorrect behaviour observed in a component of MESH pending-update The issue is resolved pending a future release
Projects
None yet
Development

No branches or pull requests

2 participants