Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failures on busy systems - 2s timeout may be too low #92

Closed
lars-t-hansen opened this issue Aug 17, 2023 · 2 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

On some of our heavily loaded ML nodes I see intermittent failures in running the nvidia-smi command. The error (here on ml8, which has been overly busy since the end of vacation) is

[2023-08-17T03:00:04Z ERROR sonar::ps] GPU (Nvidia) process listing failed: "Hung(\"COMMAND:\\nnvidia-smi pmon -c 1 -s mu\")"

which is either in response to a timeout or a SIGTERM exit. It would be useful to distinguish these, so that's one tweak to implement. But it is likely a timeout. It's possible that the timeout should be longer, or that the default 2s should be overridable on the command line to allow sonar to adapt more easily to its environment.

@lars-t-hansen
Copy link
Collaborator Author

It's a timeout. nvidia-smi can be slow to respond when the GPUs are very busy, I'm observing this live now:

...,cpu%=2186.9,cpukib=164278284,"gpus=3,1,0,2",gpu%=390,gpumem%=299,gpukib=20684800,cputime_sec=817224,rolledup=36

@lars-t-hansen
Copy link
Collaborator Author

(Running an experiment with 5s now, I got flooded with timeouts from ML3 overnight, ML3 is at 100% GPU and > 100% RAM use.)

@bast bast closed this as completed in 66fe3e0 Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant