Skip to content

feat: add per-pod GPU SM utilization metrics for time-slicing workloads#638

Open
zbennett10 wants to merge 7 commits intoNVIDIA:mainfrom
zbennett10:feat/per-pod-gpu-util-time-slicing
Open

feat: add per-pod GPU SM utilization metrics for time-slicing workloads#638
zbennett10 wants to merge 7 commits intoNVIDIA:mainfrom
zbennett10:feat/per-pod-gpu-util-time-slicing

Conversation

@zbennett10
Copy link

Closes #587

Summary

When CUDA time-slicing is active, multiple pods share a single physical GPU. Standard DCGM per-device metrics (dcgm_fi_dev_gpu_util) report aggregate utilization for the whole device — you cannot tell how much of the GPU proxy, embeddings, or inference pods are each consuming.

This PR adds an opt-in ProcessPodCollector that attributes GPU SM utilization to individual pods by joining:

  1. NVML nvmlDeviceGetProcessUtilization() — per-PID SM utilization sampled directly from the CUDA driver
  2. Kubelet pod-resources gRPC API — maps GPU UUIDs to (pod, namespace, container) tuples
  3. /proc/<pid>/cgroup — links NVML PIDs back to container identities

New metric

# HELP dcgm_fi_dev_sm_util_per_pod SM utilization attributed to a pod (time-slicing)
# TYPE dcgm_fi_dev_sm_util_per_pod gauge
dcgm_fi_dev_sm_util_per_pod{gpu="0",uuid="GPU-abc123",pod="synapse-proxy-...",namespace="synapse-staging",container="proxy"} 42

One gauge is emitted per (pod, namespace, container, gpu_uuid) tuple. The value is the NVML SM utilization percentage (0–100).

Enabling

Standalone DaemonSet

spec:
  template:
    spec:
      hostPID: true
      containers:
        - name: dcgm-exporter
          env:
            - name: DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL
              value: "true"
          volumeMounts:
            - name: pod-resources
              mountPath: /var/lib/kubelet/pod-resources
              readOnly: true
      volumes:
        - name: pod-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources
            type: Directory

With GPU Operator (v24.x+) ClusterPolicy

spec:
  dcgmExporter:
    perPodGPUUtil:
      enabled: true

See companion PR in NVIDIA/gpu-operator that wires this through ClusterPolicy -> NVIDIA/gpu-operator#2178

Test plan

  • Unit tests pass (go test ./internal/pkg/collector/... -run TestProcessPodCollector -v) — 10 test cases, all PASS, no GPU required
  • go build ./... on ubuntu-latest (CI)
  • Integration tested in our AWS environment: verified dcgm_fi_dev_sm_util_per_pod emitted with correct pod/namespace/container labels when time-slicing is active

Closes NVIDIA#587

## Problem

When CUDA time-slicing is active (multiple pods sharing one physical GPU),
`dcgm_fi_dev_gpu_util` reports aggregate device utilization — you cannot
tell how much of the GPU proxy vs embeddings vs inference is consuming.

## Solution

Add opt-in `ProcessPodCollector` that attributes SM utilization to
individual pods by joining:

1. NVML `nvmlDeviceGetProcessUtilization()` — per-PID SM util from driver
2. Kubelet pod-resources gRPC API — maps GPU UUID → (pod, ns, container)
3. /proc/<pid>/cgroup — links NVML PIDs back to container identities

## New metric

    dcgm_fi_dev_sm_util_per_pod{
      gpu="0", uuid="GPU-abc123",
      pod="synapse-proxy-...", namespace="prod", container="proxy"
    } 42

## Enabling

    --enable-per-pod-gpu-util=true
    # or: DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true

Requires hostPID: true (auto-set when using GPU Operator integration).

## Files changed

- internal/pkg/collector/process_pod_collector.go — new collector
- internal/pkg/collector/process_pod_collector_test.go — unit tests (10 cases, no GPU needed)
- internal/pkg/collector/collector_factory.go — register new collector
- internal/pkg/appconfig/types.go — EnablePerPodGPUUtil flag
- internal/pkg/counters/const.go — DCGM_EXP_SM_UTIL_PER_POD counter name
- pkg/cmd/app.go — --enable-per-pod-gpu-util CLI flag
- docs/per-pod-gpu-metrics.md — usage documentation
Run TestProcessPodCollector tests in CI to validate the new
process_pod_collector.go on ubuntu-latest (Linux/x86_64 — the only
supported build platform for this CGo-dependent project).

Also upgrade checkout/setup-go action versions and add apt gcc dep.
…atch

Three compilation errors:

1. Remove 'os' stdlib import from collector_factory.go — the collector
   package already has 'var os osinterface.OS' in variables.go for
   testable os.Exit calls; the stdlib 'os' import conflicts with it.
   (My previous session incorrectly added the stdlib import.)

2. Alias stdlib 'os' to 'stdos' in process_pod_collector.go —
   ReadFile is the only stdlib os usage; the package-level 'os'
   variable must not be shadowed.

3. Add '...grpc.CallOption' to podResourcesClient.List() —
   kubelet's PodResourcesListerClient.List() includes variadic
   CallOption; our interface must match to satisfy the type.
… CI tests

- Remove go:generate mockgen directives for nvmlDevice/nvmlLib/
  podResourcesClient — mockgen cannot generate mocks for unexported
  interfaces from external packages, causing compilation errors.
  Tests use hand-coded fakes in process_pod_collector_test.go instead.

- Scope 'unit-tests' CI job to TestProcessPodCollector only, excluding
  packages that require libdcgm.so.4 / libnvidia-ml (integration_test,
  nvmlprovider, server) which are pre-existing DCGM-only failures
  unrelated to this PR.

Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
DCGM_EXP_SM_UTIL_PER_POD is a synthetic NVML-driven metric and does
not correspond to a real DCGM field, so it cannot appear in the stock
dcp-metrics-included.csv.

Previously NewProcessPodCollector would fail with:
  collector 'DCGM_EXP_SM_UTIL_PER_POD' cannot be initialized;
  err: counter not found in counter list

Fix: define the counter inline with sensible defaults. Allow the user
to override via their metrics CSV if they want custom help text.

Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
Add GetName() to nvmlDevice interface and populate GPU, UUID (label key),
GPUDevice, and GPUModelName fields on the ProcessPodCollector metric.

Previously the Prometheus output had empty label keys (="GPU-UUID...")
and empty gpu/device/modelName fields. Now emits:
  DCGM_EXP_SM_UTIL_PER_POD{gpu="0",UUID="GPU-...",device="nvidia0",
    modelName="NVIDIA A10G",...,pod="...",namespace="...",container="..."} 42

Update fakeNVMLDevice test double with modelName field and GetName() stub.
Update TestProcessPodCollector_EmitsMetricForSinglePod assertions.

Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--kubernetes-virtual-gpus exports identical values for all pods instead of per-pod utilization

1 participant