Request for Default GPU Utilization Metrics #281

CoderTH · 2024-04-19T03:12:34Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

After assigning vGPU to a pod, we can retrieve detailed information about the pod's vGPU usage through metrics, such as memory usage and computational power. However, if the pod does not actually use the GPU, these metrics will be absent. Is it possible to add default values to these metrics? This way, when a pod is allocated vGPU, corresponding metrics will be exposed regardless of actual GPU utilization.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The vgpu-device-plugin container logs
The vgpu-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-19T03:12:45Z

Hi @CoderTH,
Thanks for opening an issue!
We will look into it as soon as possible.

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

CoderTH · 2024-04-19T03:13:12Z

/kind/feature

github-actions · 2024-04-19T03:13:25Z

@CoderTH

Command kind/feature is not found

Details

Instructions for interacting with me using comments are available here.
If you have questions or suggestions related to my behavior, please file an issue against the gh-ci-bot repository.

CoderTH · 2024-04-19T03:13:47Z

@archlitchi cc

chaunceyjiang · 2024-04-19T07:43:37Z

Yes, I'm implementing this feature. Ref to #258

CoderTH · 2024-04-22T02:51:49Z

Yes, I'm implementing this feature. Ref to #258

Even if pod does not use GPU, will there be a default value?

archlitchi · 2024-04-22T06:52:08Z

@CoderTH There are two monitors:
{scheduler node ip}:31993/metrics represents the 'allocated resources' of each containers, regardless of whether it uses or not.

{gpu node ip}:31992/metrics represents the real-time usage of gpu-resources of each container. These metrics are read from a mmap cache file. That file is generated by HAMi-core only when the container accesses the GPU-related cuda and nvml interfaces. so if the pod does not actually use the GPU -> that file won't be generated -> these metrics will be absent.

It's hard to add a default value, i suggest to read from {scheduler node ip}:31993/metrics to get the information about these idle containers, or add a nvidia-smi in entrypoint of container to generate this cache file during container creation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Default GPU Utilization Metrics #281

Request for Default GPU Utilization Metrics #281

CoderTH commented Apr 19, 2024

github-actions bot commented Apr 19, 2024

CoderTH commented Apr 19, 2024

github-actions bot commented Apr 19, 2024

CoderTH commented Apr 19, 2024

chaunceyjiang commented Apr 19, 2024

CoderTH commented Apr 22, 2024

archlitchi commented Apr 22, 2024

Request for Default GPU Utilization Metrics #281

Request for Default GPU Utilization Metrics #281

Comments

CoderTH commented Apr 19, 2024

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

github-actions bot commented Apr 19, 2024

CoderTH commented Apr 19, 2024

github-actions bot commented Apr 19, 2024

CoderTH commented Apr 19, 2024

chaunceyjiang commented Apr 19, 2024

CoderTH commented Apr 22, 2024

archlitchi commented Apr 22, 2024