Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia monitoring #40

Merged
merged 9 commits into from
Jun 16, 2023
Merged

nvidia monitoring #40

merged 9 commits into from
Jun 16, 2023

Conversation

lars-t-hansen
Copy link
Collaborator

This adds GPU monitoring for nvidia cards, as well as zombie hunting functionality for the same. It uses the nvidia-smi tool for (relative) simplicity, but needs to invoke it twice and the output is documented as not being stable, so it may be better to consider going directly to the monitoring libraries. Also, the nvidia code has been folded into ps.rs for now but we could discuss whether it's worth disentangling.

@lars-t-hansen lars-t-hansen changed the title Nvidia smi nvidia monitoring Jun 14, 2023
@bast
Copy link
Member

bast commented Jun 15, 2023

Thanks! I read through the change and will also test it locally. One thing I am thinking now is that we should perhaps somehow check whether the node is GPU-able/aware before issuing the NVIDIA_PMON_COMMAND. But I will test how this behaves on computers and nodes that do not have such capability.

@lars-t-hansen
Copy link
Collaborator Author

Thanks! I read through the change and will also test it locally. One thing I am thinking now is that we should perhaps somehow check whether the node is GPU-able/aware before issuing the NVIDIA_PMON_COMMAND. But I will test how this behaves on computers and nodes that do not have such capability.

It works locally on my laptop (no nvidia) and on a local HPC system with only AMD cards, but the reason in both cases is that nvidia-smi is not found. That feels a little brittle and it might be good to have a better test than that. Also, I probably want to expand the functionality to encompass AMD cards.

@bast
Copy link
Member

bast commented Jun 15, 2023

I confirm that it works fine on an nvidia-unaware laptop. Thanks for the work! But I agree that it feels a bit unsafe. I am mostly worried that the process starts hanging somewhere and piling up so we can think about a robust pre-test.

@lars-t-hansen
Copy link
Collaborator Author

Will the timeout in safe_command not prevent us from hanging?

@bast
Copy link
Member

bast commented Jun 15, 2023

Will the timeout in safe_command not prevent us from hanging?

Ideally/hopefully yes. And I hope I tested it well but I am still really worried. But ideally this should prevent it.

@bast bast marked this pull request as ready for review June 16, 2023 08:24
@bast bast merged commit 28db125 into NordicHPC:main Jun 16, 2023
@bast
Copy link
Member

bast commented Jun 16, 2023

Thanks!

@lars-t-hansen lars-t-hansen deleted the nvidia_smi branch June 23, 2023 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants