Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exported_pod cause issue with query -> every sample a different metrics #340

Open
amir-bialek opened this issue Jun 9, 2024 · 3 comments
Labels
question Further information is requested

Comments

@amir-bialek
Copy link

Ask your question

Running dcgm-exporter on k8s install via helm chart, default values.
Cluster have 1 master 1 worker, only worker have GPU expose as resource.

Running a simple query:
DCGM_FI_DEV_GPU_TEMP

Return:

DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="ai-artifactory-control", exported_namespace="default", exported_pod="pod1", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="somecontainer", exported_namespace="default", exported_pod="pod2", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="runner", exported_namespace="somenamespace", exported_pod="pod3", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="runner", exported_namespace="somenamespace", exported_pod="pod4", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
...

However since this is only 1 gpu, I would like to receive only 1 result..
To explain better, setting up the dashboard on Grafana give me :

image

And what we would like to get is:

image

@amir-bialek amir-bialek added the question Further information is requested label Jun 9, 2024
@nvvfedorov
Copy link
Collaborator

@amir-bialek, Labels with the "exported_" prefix come from the DCGM-exporter. From the metrics values that you shared with us, I see:

  1. Kubernetes mode is enabled - the DCGM exporter does mapping of GPU metrics to PODs. That's why we see: exported_container, exported_namespace, and exported_pod. Regarding the "exported" prefix, please refer to here: https://prometheus.io/docs/prometheus/latest/configuration/configuration/

  2. GPU=0 was used by the following containers: ai-artifactory-control, somecontainer, and two instances of the runner container.

This behavior is expected for the DCGM-exporter with Kubernetes mode enabled.

@amir-bialek
Copy link
Author

Hey @nvvfedorov , thank you for the answer.

So if I have several pods sharing the same GPU via time slicing, how can I solve this issue?

@nvvfedorov
Copy link
Collaborator

Today, timesharing is not supported by DCGM and DCGM-exporter. However, if you run a few container and each used the same GPU, you will see multiple metrics associated with the same GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants