Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Default GPU Utilization Metrics #281

Open
9 tasks
CoderTH opened this issue Apr 19, 2024 · 7 comments
Open
9 tasks

Request for Default GPU Utilization Metrics #281

CoderTH opened this issue Apr 19, 2024 · 7 comments

Comments

@CoderTH
Copy link
Contributor

CoderTH commented Apr 19, 2024

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

After assigning vGPU to a pod, we can retrieve detailed information about the pod's vGPU usage through metrics, such as memory usage and computational power. However, if the pod does not actually use the GPU, these metrics will be absent. Is it possible to add default values to these metrics? This way, when a pod is allocated vGPU, corresponding metrics will be exposed regardless of actual GPU utilization.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The vgpu-device-plugin container logs
  • The vgpu-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
Copy link

Hi @CoderTH,
Thanks for opening an issue!
We will look into it as soon as possible.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

@CoderTH
Copy link
Contributor Author

CoderTH commented Apr 19, 2024

/kind/feature

Copy link

@CoderTH

Command kind/feature is not found
Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

@CoderTH
Copy link
Contributor Author

CoderTH commented Apr 19, 2024

@archlitchi cc

@chaunceyjiang
Copy link
Contributor

Yes, I'm implementing this feature. Ref to #258

@CoderTH
Copy link
Contributor Author

CoderTH commented Apr 22, 2024

Yes, I'm implementing this feature. Ref to #258

   Even if pod does not use GPU, will there be a default value?

@archlitchi
Copy link
Collaborator

@CoderTH There are two monitors:
{scheduler node ip}:31993/metrics represents the 'allocated resources' of each containers, regardless of whether it uses or not.

{gpu node ip}:31992/metrics represents the real-time usage of gpu-resources of each container. These metrics are read from a mmap cache file. That file is generated by HAMi-core only when the container accesses the GPU-related cuda and nvml interfaces. so if the pod does not actually use the GPU -> that file won't be generated -> these metrics will be absent.

It's hard to add a default value, i suggest to read from {scheduler node ip}:31993/metrics to get the information about these idle containers, or add a nvidia-smi in entrypoint of container to generate this cache file during container creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants