-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia monitoring #40
Conversation
Thanks! I read through the change and will also test it locally. One thing I am thinking now is that we should perhaps somehow check whether the node is GPU-able/aware before issuing the NVIDIA_PMON_COMMAND. But I will test how this behaves on computers and nodes that do not have such capability. |
It works locally on my laptop (no nvidia) and on a local HPC system with only AMD cards, but the reason in both cases is that nvidia-smi is not found. That feels a little brittle and it might be good to have a better test than that. Also, I probably want to expand the functionality to encompass AMD cards. |
I confirm that it works fine on an nvidia-unaware laptop. Thanks for the work! But I agree that it feels a bit unsafe. I am mostly worried that the process starts hanging somewhere and piling up so we can think about a robust pre-test. |
Will the timeout in safe_command not prevent us from hanging? |
Ideally/hopefully yes. And I hope I tested it well but I am still really worried. But ideally this should prevent it. |
Thanks! |
This adds GPU monitoring for nvidia cards, as well as zombie hunting functionality for the same. It uses the nvidia-smi tool for (relative) simplicity, but needs to invoke it twice and the output is documented as not being stable, so it may be better to consider going directly to the monitoring libraries. Also, the nvidia code has been folded into ps.rs for now but we could discuss whether it's worth disentangling.