-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang in dp_coupling::d_p_coupling
with newer module versions and compilers (GNU version 12.3)
#6451
Comments
dp_coupling::d_p_coupling
with newer intel compiler (version 2023.2.0)
After adding a temporary work-around to the GNU issue noted above, I can now run with GNU built exe. And I see that it also suffers same fate -- hangs in what looks like same place. Also, I can still see the hang without the test modifier. For both intel/gnu
Without DEBUG, the test completes (for both intel/gnu) |
Since there appears to be a difference in behavior DEBUG vs OPT, I'm trying a few different things. If I stay with DEBUG but simplify the flags to only use
I also try running with OPT, but without |
Adjusting compiler flags, I was able to get a stack trace -- which may or may not be same issue.
|
With a slightly diff flag variation I see this error:
|
I've been adjusting compiler flags in attempt to debug. With
|
dp_coupling::d_p_coupling
with newer intel compiler (version 2023.2.0)dp_coupling::d_p_coupling
with newer module versions and compilers (Intel version 2023.2.0, GNU version 12.3)
I'm not sure what has changed, but when I try the test again with updated repo, it is completing for intel (using updated version). I can check again with updated GNU compiler as well, but this may be enough for me to make a PR to at least update intel compiler version. |
dp_coupling::d_p_coupling
with newer module versions and compilers (Intel version 2023.2.0, GNU version 12.3)dp_coupling::d_p_coupling
with newer module versions and compilers (GNU version 12.3)
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far. kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing. It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default. Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower. Fixes #6655 I also found some older issues (some with lower node-count) that this fixes: Fixes #6516 Fixes #6451 Fixes #6521 [bfb]
Originally reporting issue with newer Intel compiler, but as of Sep7, 2024, no longer seeing the issue with intel and created PR to upgrade (#6596), but still see issue with GNU. I might close this issue and open a fresh one for GNU only, but for now, leaving text below as-is:
Trying to update module versions on pm-cpu, but I have hit a few issues. One with intel is that this test hangs in init.
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_intel.allactive-wcprodssp
I'm noting the hang in HOMME, but as I don't know root cause, it may not actually be issue there.
The test works with current intel version (
intel/2023.1.0
) and what I'd like to use is the new default for the machine (intel/2023.2.0
)We see this in cpl.log (to indicate still in init):
Looking at where the stack is on compute node:
Above, I pasted results from running on muller-cpu, but I can see same behavior on pm-cpu (just need to update the module versions).
I made a copy of the case on PSCRATCH in case someone wanted to look at logs:
I would like to try this test with other compilers, but we currently have a segfault with GNU #6428
The text was updated successfully, but these errors were encountered: