Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurmstepd: error: _is_a_lwp: open() on cori-knl #3138

Closed
ndkeen opened this issue Aug 16, 2019 · 4 comments
Closed

slurmstepd: error: _is_a_lwp: open() on cori-knl #3138

ndkeen opened this issue Aug 16, 2019 · 4 comments
Assignees
Labels

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Aug 16, 2019

Several of my jobs have failed on cori-knl with errors like so:

   0: slurmstepd: error: *** STEP 23893144.0 ON nid02518 CANCELLED AT 2019-08-15T19:17:26 ***
2304: slurmstepd: error: _is_a_lwp: open() /proc/83665/status failed: No such file or directory
2496: slurmstepd: error: _is_a_lwp: open() /proc/37943/status failed: No such file or directory
2736: slurmstepd: error: _is_a_lwp: open() /proc/27304/status failed: No such file or directory

I can resubmit the job and it will run (or have same error, but then run the third time).

It's happening with lowres coupled cases as I try different PE layouts, but I'm not yet sure if it has anything to do with what I'm trying. It is true that many other (smaller node count) jobs in other repos have not hit this error. I reported to NERSC and the response was:

this is indeed a Slurm error - it's something of a race condition but it triggers only very rarely. If you resubmit your job I believe it should run without any problems. If you resubmit and see this error again, please let me know.

I replied that it does not seem to be very rare. It started happening on the 14th and maybe 5-10% of my jobs hit this.

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 18, 2019

Noting that over the last several days, I'm still running into this issue. It happens about 5-10% of the time with the lowres coupled cases I'm trying and (I think) 0% of the time with other jobs (other repos and smaller node counts). I'm certainly not yet convinced it's an issue triggered by the cases I'm running. No new information from NERSC -- I was hoping they could verify if it is for sure an issue on the system or not.

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 19, 2019

I also hit this same error (3 different times) with a recent master running F case with 43 nodes and 81 nodes. This makes me think it might not matter which repo/case, but may have higher chance of happening with more nodes.

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 23, 2019

I had already suspected these errors could be related to the cray hugepages module, but I'm still not sure. Here is last reply from NERSC:

```Some investigation into system logs suggests that the __is_a_lwp error is in fact a red herring - apparently that function is an internal Slurm function used purely for its own accounting, and should not cause any program to crash, even if the `__is_a_lwp` function fails. So the error messages you see are likely misleading - the code is crashing independently of __is_a_lwp.

One possibility based on some curious messages in the system logs is that E3SM has a bad interaction with hugepages. If you unload the craype-hugepages2M module before you run e3sm.exe, do you still get the error? (You shouldn't need to recompile the code without the hugepages module - I believe simply unloading it at run time should be sufficient to disable them).```

I did try this, but I still see some errors. It's no longer the same error message, so I might close this issue and create another one.

@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 4, 2019

Closing as that error message appears to not stop the code, nor be a solid indicator there is an issue. I can't recall, but I don't think I've seen the message in a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants