-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slurmstepd: error: _is_a_lwp: open() on cori-knl #3138
Comments
Noting that over the last several days, I'm still running into this issue. It happens about 5-10% of the time with the lowres coupled cases I'm trying and (I think) 0% of the time with other jobs (other repos and smaller node counts). I'm certainly not yet convinced it's an issue triggered by the cases I'm running. No new information from NERSC -- I was hoping they could verify if it is for sure an issue on the system or not. |
I also hit this same error (3 different times) with a recent master running F case with 43 nodes and 81 nodes. This makes me think it might not matter which repo/case, but may have higher chance of happening with more nodes. |
I had already suspected these errors could be related to the cray hugepages module, but I'm still not sure. Here is last reply from NERSC: ```Some investigation into system logs suggests that the One possibility based on some curious messages in the system logs is that E3SM has a bad interaction with hugepages. If you unload the I did try this, but I still see some errors. It's no longer the same error message, so I might close this issue and create another one. |
Closing as that error message appears to not stop the code, nor be a solid indicator there is an issue. I can't recall, but I don't think I've seen the message in a while. |
Several of my jobs have failed on cori-knl with errors like so:
I can resubmit the job and it will run (or have same error, but then run the third time).
It's happening with lowres coupled cases as I try different PE layouts, but I'm not yet sure if it has anything to do with what I'm trying. It is true that many other (smaller node count) jobs in other repos have not hit this error. I reported to NERSC and the response was:
this is indeed a Slurm error - it's something of a race condition but it triggers only very rarely. If you resubmit your job I believe it should run without any problems. If you resubmit and see this error again, please let me know.
I replied that it does not seem to be very rare. It started happening on the 14th and maybe 5-10% of my jobs hit this.
The text was updated successfully, but these errors were encountered: