Skip to content

Add DCGM exporter pod metadata enrichment API#2406

Merged
karthikvetrivel merged 1 commit into
NVIDIA:mainfrom
karthikvetrivel:fix/dcgm-exporter-pod-enrichment
May 12, 2026
Merged

Add DCGM exporter pod metadata enrichment API#2406
karthikvetrivel merged 1 commit into
NVIDIA:mainfrom
karthikvetrivel:fix/dcgm-exporter-pod-enrichment

Conversation

@karthikvetrivel
Copy link
Copy Markdown
Member

Description

Resolves #2009.

Previously, there was no way of enabling pod metrics when using DCGM exporter through GPU Operator. This fix introduces introduces first-class Helm values that provision RBAC and allow pod metrics to include Pod UID & Pod Label.

Design Choices:

  • Introduced enabling pod metrics as first-class Helm values instead of making arbitrary env vars provision RBAC to match the upstream DCGM metrics helm chart.
  • I intentionally did not add resourceSlices RBAC. Standalone dcgm-exporter Helm includes resourceSlices because it also supports DRA-related Kubernetes enrichment. GPU Operator is only exposing pod metadata enrichment here because it does not support DRA yet. I am open to changing this.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Added unit tests.
Confirmed the operator reconciled the expected DCGM exporter resources:

  • nvidia-dcgm-exporter DaemonSet rolled out successfully.
  • DCGM exporter container received:
    - DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS=true
    - DCGM_EXPORTER_KUBERNETES_ENABLE_POD_UID=true
    - DCGM_EXPORTER_KUBERNETES_POD_LABEL_ALLOWLIST_REGEX=^gpu_test_label$
    • Pod template had automountServiceAccountToken: true.
    • nvidia-dcgm-exporter-read-pods ClusterRole and ClusterRoleBinding were created.
  • Exporter service account could get, list, and watch pods cluster-wide.

Scraped DCGM exporter metrics from a labeled GPU pod and confirmed metrics included gpu_test_label="live-test" and pod_uid="..."

Comment thread controllers/object_controls.go Outdated
Comment thread controllers/object_controls.go Outdated
Copy link
Copy Markdown
Contributor

@rahulait rahulait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karthikvetrivel karthikvetrivel force-pushed the fix/dcgm-exporter-pod-enrichment branch 4 times, most recently from aa28ce9 to ab5a207 Compare May 4, 2026 14:46
Comment thread assets/state-dcgm-exporter/0210_clusterrole.yaml Outdated
Comment thread controllers/object_controls_test.go Outdated
@karthikvetrivel karthikvetrivel force-pushed the fix/dcgm-exporter-pod-enrichment branch from ab5a207 to 39fb8d8 Compare May 4, 2026 16:06
Comment thread api/nvidia/v1/clusterpolicy_types.go Outdated
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:advanced"
HPCJobMapping *DCGMExporterHPCJobMappingConfig `json:"hpcJobMapping,omitempty"`

// Enable Kubernetes pod labels in metrics. Requires cluster-level read access to pods.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit too terse. Can we expand on this a bit more? Let's also clarify that this setting adds pod label as a label dimension to the DCGM exporter prometheus metrics.

Also, the "Requires cluster-level read access to pods" is a bit unnecessary IMO. If we do want to mention this, can we enclose it in parenthesis?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, expanded on the label and enclosed "Requires cluster-level read access to pods" in parentheses.

@karthikvetrivel karthikvetrivel force-pushed the fix/dcgm-exporter-pod-enrichment branch from 39fb8d8 to 5e5530f Compare May 5, 2026 14:58
@karthikvetrivel karthikvetrivel requested a review from tariq1890 May 6, 2026 13:09
Comment thread assets/state-dcgm-exporter/0210_clusterrole.yaml
Copy link
Copy Markdown
Contributor

@rajathagasthya rajathagasthya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread assets/state-dcgm-exporter/0800_daemonset.yaml
Comment thread controllers/object_controls.go Outdated
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the fix/dcgm-exporter-pod-enrichment branch from 5e5530f to 146f17f Compare May 12, 2026 16:08
Comment thread controllers/object_controls.go
@karthikvetrivel karthikvetrivel merged commit b2b4d1f into NVIDIA:main May 12, 2026
19 checks passed
rahulait added a commit that referenced this pull request May 21, 2026
This is required so that newer version of dcgm-exporter can have correct rbac permissions if needed

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
@tariq1890
Copy link
Copy Markdown
Contributor

/cherry-pick release-26.3

@github-actions
Copy link
Copy Markdown
Contributor

🤖 Backport PR created for release-26.3: #2473 ⚠️ (has conflicts)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod Label not visible in DCGM Exporter Metrics

4 participants