Intermittent failures on busy systems - 2s timeout may be too low #92

lars-t-hansen · 2023-08-17T06:06:24Z

On some of our heavily loaded ML nodes I see intermittent failures in running the nvidia-smi command. The error (here on ml8, which has been overly busy since the end of vacation) is

[2023-08-17T03:00:04Z ERROR sonar::ps] GPU (Nvidia) process listing failed: "Hung(\"COMMAND:\\nnvidia-smi pmon -c 1 -s mu\")"

which is either in response to a timeout or a SIGTERM exit. It would be useful to distinguish these, so that's one tweak to implement. But it is likely a timeout. It's possible that the timeout should be longer, or that the default 2s should be overridable on the command line to allow sonar to adapt more easily to its environment.

The text was updated successfully, but these errors were encountered:

lars-t-hansen · 2023-09-29T07:59:46Z

It's a timeout. nvidia-smi can be slow to respond when the GPUs are very busy, I'm observing this live now:

...,cpu%=2186.9,cpukib=164278284,"gpus=3,1,0,2",gpu%=390,gpumem%=299,gpukib=20684800,cputime_sec=817224,rolledup=36

lars-t-hansen · 2023-10-02T06:36:14Z

(Running an experiment with 5s now, I got flooded with timeouts from ML3 overnight, ML3 is at 100% GPU and > 100% RAM use.)

lars-t-hansen added the enhancement New feature or request label Aug 17, 2023

lars-t-hansen self-assigned this Aug 17, 2023

lars-t-hansen mentioned this issue Aug 17, 2023

Improve error reporting with more detail #96

Closed

lars-t-hansen mentioned this issue Sep 29, 2023

Compute uptime reports NAICNO/Jobanalyzer#104

Closed

8 tasks

bast closed this as completed in 66fe3e0 Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent failures on busy systems - 2s timeout may be too low #92

Intermittent failures on busy systems - 2s timeout may be too low #92

lars-t-hansen commented Aug 17, 2023

lars-t-hansen commented Sep 29, 2023

lars-t-hansen commented Oct 2, 2023

Intermittent failures on busy systems - 2s timeout may be too low #92

Intermittent failures on busy systems - 2s timeout may be too low #92

Comments

lars-t-hansen commented Aug 17, 2023

lars-t-hansen commented Sep 29, 2023

lars-t-hansen commented Oct 2, 2023