exported_pod cause issue with query -> every sample a different metrics #340

amir-bialek · 2024-06-09T10:07:54Z

Ask your question

Running dcgm-exporter on k8s install via helm chart, default values.
Cluster have 1 master 1 worker, only worker have GPU expose as resource.

Running a simple query:
DCGM_FI_DEV_GPU_TEMP

Return:

DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="ai-artifactory-control", exported_namespace="default", exported_pod="pod1", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="somecontainer", exported_namespace="default", exported_pod="pod2", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="runner", exported_namespace="somenamespace", exported_pod="pod3", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="runner", exported_namespace="somenamespace", exported_pod="pod4", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
...

However since this is only 1 gpu, I would like to receive only 1 result..
To explain better, setting up the dashboard on Grafana give me :

And what we would like to get is:

The text was updated successfully, but these errors were encountered:

nvvfedorov · 2024-06-10T15:40:17Z

@amir-bialek, Labels with the "exported_" prefix come from the DCGM-exporter. From the metrics values that you shared with us, I see:

Kubernetes mode is enabled - the DCGM exporter does mapping of GPU metrics to PODs. That's why we see: exported_container, exported_namespace, and exported_pod. Regarding the "exported" prefix, please refer to here: https://prometheus.io/docs/prometheus/latest/configuration/configuration/
GPU=0 was used by the following containers: ai-artifactory-control, somecontainer, and two instances of the runner container.

This behavior is expected for the DCGM-exporter with Kubernetes mode enabled.

amir-bialek · 2024-06-10T17:14:34Z

Hey @nvvfedorov , thank you for the answer.

So if I have several pods sharing the same GPU via time slicing, how can I solve this issue?

nvvfedorov · 2024-06-10T17:16:45Z

Today, timesharing is not supported by DCGM and DCGM-exporter. However, if you run a few container and each used the same GPU, you will see multiple metrics associated with the same GPU.

amir-bialek added the question Further information is requested label Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exported_pod cause issue with query -> every sample a different metrics #340

exported_pod cause issue with query -> every sample a different metrics #340

amir-bialek commented Jun 9, 2024

nvvfedorov commented Jun 10, 2024

amir-bialek commented Jun 10, 2024

nvvfedorov commented Jun 10, 2024

exported_pod cause issue with query -> every sample a different metrics #340

exported_pod cause issue with query -> every sample a different metrics #340

Comments

amir-bialek commented Jun 9, 2024

Ask your question

nvvfedorov commented Jun 10, 2024

amir-bialek commented Jun 10, 2024

nvvfedorov commented Jun 10, 2024