Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about enabling remote telemetry permissions #179

Open
MarcelFerrari opened this issue Jul 16, 2024 · 0 comments
Open

Questions about enabling remote telemetry permissions #179

MarcelFerrari opened this issue Jul 16, 2024 · 0 comments

Comments

@MarcelFerrari
Copy link

Hi all,

I have been working on a tool designed to give an overview of the performance metrics for all GPUs associated to a SLURM job similarly to what the top command does for different processes. The way this is done is by reading information from SLURM regarding the list of hosts associated to a job and establishing a remote connection to the nv-hostengine daemons running on those hosts.

According to this GitHub issue, In order for the nv-hostengine process to listen to remote ips, one must specify the -b flag when starting the hostengine. I modified the systemd script for the nv-hostengine to include the -b ALL flag and it works great. However, I would like to have a more permanent and robust solution for this, so I have a few questions:

  1. Is there any documentation on how to set up the nv-hostengine to support telemetry? I tried searching for quite a while but could not find anything except the previously mentioned issue.
  2. Is there any DCGM config file where I can enable this feature other than the systemd script?
  3. Is there a way to set up permissions to limit which users can connect remotely to a hostengine? One potential problem of simply accepting any connection is that users could potentially read the performance metrics from nodes allocated to other users. Is there a way to ensure that users can only connect to hostengines that are running on hosts allocated to them as part of their SLURM jobs?

Many thanks in advance,
Marcel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant