nvidia monitoring #40

lars-t-hansen · 2023-06-14T12:37:04Z

This adds GPU monitoring for nvidia cards, as well as zombie hunting functionality for the same. It uses the nvidia-smi tool for (relative) simplicity, but needs to invoke it twice and the output is documented as not being stable, so it may be better to consider going directly to the monitoring libraries. Also, the nvidia code has been folded into ps.rs for now but we could discuss whether it's worth disentangling.

bast · 2023-06-15T07:51:40Z

Thanks! I read through the change and will also test it locally. One thing I am thinking now is that we should perhaps somehow check whether the node is GPU-able/aware before issuing the NVIDIA_PMON_COMMAND. But I will test how this behaves on computers and nodes that do not have such capability.

lars-t-hansen · 2023-06-15T07:58:02Z

Thanks! I read through the change and will also test it locally. One thing I am thinking now is that we should perhaps somehow check whether the node is GPU-able/aware before issuing the NVIDIA_PMON_COMMAND. But I will test how this behaves on computers and nodes that do not have such capability.

It works locally on my laptop (no nvidia) and on a local HPC system with only AMD cards, but the reason in both cases is that nvidia-smi is not found. That feels a little brittle and it might be good to have a better test than that. Also, I probably want to expand the functionality to encompass AMD cards.

bast · 2023-06-15T07:59:47Z

I confirm that it works fine on an nvidia-unaware laptop. Thanks for the work! But I agree that it feels a bit unsafe. I am mostly worried that the process starts hanging somewhere and piling up so we can think about a robust pre-test.

lars-t-hansen · 2023-06-15T08:00:51Z

Will the timeout in safe_command not prevent us from hanging?

bast · 2023-06-15T12:27:10Z

Will the timeout in safe_command not prevent us from hanging?

Ideally/hopefully yes. And I hope I tested it well but I am still really worried. But ideally this should prevent it.

bast · 2023-06-16T08:28:22Z

Thanks!

Lars T Hansen added 8 commits June 13, 2023 15:02

WIP nvidia-smi hook

ac546a1

Works

1368363

Cleaner

35d6184

Add unit test

5ff7fe5

Format + clean up

13db4d8

Add zombie hunt, missing unit test for this

6ce2950

Unit test for zombie hunt

2a0b5e6

Format again

d7bf7cf

lars-t-hansen changed the title ~~Nvidia smi~~ nvidia monitoring Jun 14, 2023

lars-t-hansen mentioned this pull request Jun 14, 2023

Add GPU usage statistics #37

Closed

small fixes thanks to cargo clippy

52a14d7

bast marked this pull request as ready for review June 16, 2023 08:24

bast mentioned this pull request Jun 16, 2023

Add a pre-check whether host/node is GPU-aware before running any other GPU commands #44

Closed

bast merged commit 28db125 into NordicHPC:main Jun 16, 2023

lars-t-hansen deleted the nvidia_smi branch June 23, 2023 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia monitoring #40

nvidia monitoring #40

lars-t-hansen commented Jun 14, 2023

bast commented Jun 15, 2023

lars-t-hansen commented Jun 15, 2023

bast commented Jun 15, 2023

lars-t-hansen commented Jun 15, 2023

bast commented Jun 15, 2023

bast commented Jun 16, 2023

nvidia monitoring #40

nvidia monitoring #40

Conversation

lars-t-hansen commented Jun 14, 2023

bast commented Jun 15, 2023

lars-t-hansen commented Jun 15, 2023

bast commented Jun 15, 2023

lars-t-hansen commented Jun 15, 2023

bast commented Jun 15, 2023

bast commented Jun 16, 2023