Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATES P32x2 test appears to fail reliably when using I2000FatesCruRsGs compset #1518

Closed
glemieux opened this issue Oct 12, 2021 · 6 comments
Closed
Assignees
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations

Comments

@glemieux
Copy link
Collaborator

glemieux commented Oct 12, 2021

Brief summary of bug

Trying to run ERP_D_P32x2_Ld3.f19_g17.I2000Clm50FatesCruRsGs.cheyenne_intel.clm-FatesColdDef results in a weird RUN failure during deallocation of a patch. However, if the test is run using I2000Clm50FatesCru it seem to run more reliably.

General bug information

CTSM version you are using: ctsm5.1.dev058-36-g8b6c8f0c

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected:

Details of bug

This test builds successfully, but fails the run with the following error in cesm.log:

282 0:MCT::m_Router::initp_: GSMap indices not increasing...Will correct
283 0:MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
284 0:MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
285 0:MCT::m_Router::initp_: GSMap indices not increasing...Will correct
286 27:forrtl: severe (153): allocatable array or pointer is not allocated
287 27:Image              PC                Routine            Line        Source
288 27:cesm.exe           0000000004500843  Unknown               Unknown  Unknown
289 27:cesm.exe           000000000163DA65  edpatchdynamicsmo        2615  EDPatchDynamicsMod.F90
290 27:cesm.exe           000000000163A14A  edpatchdynamicsmo        2400  EDPatchDynamicsMod.F90
291 27:cesm.exe           00000000015D2F36  edmainmod_mp_ed_e         276  EDMainMod.F90
292 27:cesm.exe           0000000000A016E2  clmfatesinterface         924  clmfates_interfaceMod.F90
293 27:cesm.exe           000000000098F232  clm_driver_mp_clm        1111  clm_driver.F90
294 27:libiomp5.so        00002B4309015CC3  __kmp_invoke_micr     Unknown  Unknown
295 27:libiomp5.so        00002B4308F9B283  Unknown               Unknown  Unknown
296 27:libiomp5.so        00002B4308F9A24E  Unknown               Unknown  Unknown
297 27:libiomp5.so        00002B430901619C  Unknown               Unknown  Unknown
298 27:libdplace.so.0.0.  00002B4301A391AD  Unknown               Unknown  Unknown
299 27:libpthread-2.22.s  00002B43099D4734  Unknown               Unknown  Unknown
300 27:libc-2.22.so       00002B430AA92D3D  clone                 Unknown  Unknown
301 11:forrtl: severe (153): allocatable array or pointer is not allocated
302 11:Image              PC                Routine            Line        Source
303 11:cesm.exe           0000000004500843  Unknown               Unknown  Unknown
304 11:cesm.exe           000000000163DA65  edpatchdynamicsmo        2615  EDPatchDynamicsMod.F90
305 11:cesm.exe           000000000163A14A  edpatchdynamicsmo        2400  EDPatchDynamicsMod.F90
306 11:cesm.exe           00000000015D2F36  edmainmod_mp_ed_e         276  EDMainMod.F90
307 11:cesm.exe           0000000000A016E2  clmfatesinterface         924  clmfates_interfaceMod.F90
308 11:cesm.exe           000000000098F232  clm_driver_mp_clm        1111  clm_driver.F90
309 11:libiomp5.so        00002AE37CB1CCC3  __kmp_invoke_micr     Unknown  Unknown
310 11:libiomp5.so        00002AE37CAA2283  Unknown               Unknown  Unknown
311 11:libiomp5.so        00002AE37CAA124E  Unknown               Unknown  Unknown
312 11:libiomp5.so        00002AE37CB1D19C  Unknown               Unknown  Unknown
313 11:libdplace.so.0.0.  00002AE3755401AD  Unknown               Unknown  Unknown
314 11:libpthread-2.22.s  00002AE37D4DB734  Unknown               Unknown  Unknown
315 11:libc-2.22.so       00002AE37E599D3D  clone                 Unknown  Unknown
316 -1:MPT ERROR: MPI_COMM_WORLD rank 27 has terminated without calling MPI_Finalize()
317 -1:     aborting job

This was discovered in the course of PR #1275 in an attempt to reduce the workload for the test times by removing compsets that use MOSART with most fates testmods. @rgknox and I believe we have seen this issue before, but it has usually cleared up upon resubmission. As such we had usually chalked it up to machine instability

Note that this is the only multi-threading test in the fates test suite.

@glemieux glemieux added the tests additions or changes to tests label Oct 12, 2021
@billsacks billsacks added bug something is working incorrectly and removed tests additions or changes to tests labels Oct 12, 2021
@billsacks
Copy link
Member

@glemieux I'm changing the type from tests to bug because this looks likely to be a real issue to me, not just something test-specific.

@glemieux
Copy link
Collaborator Author

Discussing this a little more with with @rgknox, he suggested that the issue could possibly be the result of a race condition since the failure mode occurs when there is an attempted deallocation of a patch. Perhaps this hypothesis has some weigh that the faster running river stub version of this test is reliably hitting the race condition, where as the slower running mosart version hits this only rarely?

What would be considered a good method in which to debug a potential race condition?

@billsacks
Copy link
Member

Given that you said this is the only FATES threading test, my first suspicions would be with threading. You could see if wrapping the relevant code in an an OMP CRITICAL block solves the problem. If that solves the problem, it could indicate that there are issues with two threads trying to execute this block of code nearly simultaneously. If it doesn't solve the problem, though, it could suggest a more fundamental logic issue related to the management of these data structures with multiple threads.

I would probably also take this opportunity to increase your testing with multiple threads. It's really easy to introduce an accidental threading issue in CTSM, and I wonder if the same is true of FATES. And (like a lot of issues, but maybe more so) it is a lot easier to catch a threading issue when it is first introduced rather than trying to find it months or years later.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Oct 14, 2021
@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Oct 14, 2021
@billsacks billsacks added the priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations label Nov 29, 2021
@billsacks billsacks added this to Needs triage in CTSM: High priority via automation Nov 29, 2021
@billsacks
Copy link
Member

@adrifoster ran into this (in possibly a slightly different way) in her recent testing. Since this is not super common, our feeling (from discussion with @adrifoster @ekluzek and @glemieux ) is that, when we run into this, we'll try rerunning the test to get it to pass.

What @adrifoster saw in test ERP_D_P32x2_Ld3.f19_g17.I2000Clm50FatesCru.cheyenne_intel.clm-FatesColdDef was:

14:forrtl: severe (153): allocatable array or pointer is not allocated

without any stack backtrace.

@billsacks billsacks moved this from Needs triage to To do in CTSM: High priority Dec 2, 2021
@billsacks billsacks moved this from To do to Next in CTSM: High priority Dec 2, 2021
@glemieux glemieux self-assigned this Mar 24, 2022
@glemieux
Copy link
Collaborator Author

glemieux commented Mar 24, 2022

Since #1275 the P32x2 multi-threading testmod has been working reliably with having been changed to use I2000Clm50FatesCru via 282920a. @ekluzek @billsacks I feel like we can probably close this out for now.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 24, 2022

Thanks for checking that @glemieux . Currently there is a FATES threading test ERP_D_P32x2_Ld3.f19_g17.I2000Clm50FatesCru.cheyenne_intelclm-FatesColdDef that is passing. The issue #1275 that is mentioned above was ctsm5.1.dev021. There isn't mention of this in expected fails, so closing this.

@ekluzek ekluzek closed this as completed Mar 24, 2022
CTSM: High priority automation moved this from Next to Done Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations
Projects
Development

No branches or pull requests

3 participants