Skip to content

[Feature]: Allow specific annotation to be set on dgcm-exporter pod's daemonset#2292

Merged
tariq1890 merged 1 commit into
NVIDIA:mainfrom
MadJlzz:add-extra-annotation-dcgm-exporter-ds
Apr 22, 2026
Merged

[Feature]: Allow specific annotation to be set on dgcm-exporter pod's daemonset#2292
tariq1890 merged 1 commit into
NVIDIA:mainfrom
MadJlzz:add-extra-annotation-dcgm-exporter-ds

Conversation

@MadJlzz
Copy link
Copy Markdown
Contributor

@MadJlzz MadJlzz commented Apr 13, 2026

Description

Add support to set extra annotations to dcgm-exporter only.
Solves #2271

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Added multiple units tests for now:

  • add annotations on dcgm-exporter when daemonsets.annotations is empty
  • combine annotations of dcgm-exporter and daemonsets.annotations when both are set

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@MadJlzz
Copy link
Copy Markdown
Contributor Author

MadJlzz commented Apr 13, 2026

Hello there maintainers 👋🏼

This is a naive approach to add extra annotations to dcgm-exporter daemonset as well as it's pod. I have nothing against reviewing the entire logic - this is my first time touching a Kubernetes operator.

@MadJlzz MadJlzz force-pushed the add-extra-annotation-dcgm-exporter-ds branch from 0758246 to 34c82f0 Compare April 13, 2026 21:53
@rahulait
Copy link
Copy Markdown
Contributor

Thanks @MadJlzz. You'll also need to update helm chart templates to render newly added annotations field.
See here:

dcgmExporter:
enabled: {{ .Values.dcgmExporter.enabled }}
{{- if .Values.dcgmExporter.repository }}
repository: {{ .Values.dcgmExporter.repository }}
{{- end }}
{{- if .Values.dcgmExporter.image }}
image: {{ .Values.dcgmExporter.image }}
{{- end }}
{{- if .Values.dcgmExporter.version }}
version: {{ .Values.dcgmExporter.version | quote }}
{{- end }}
{{- if .Values.dcgmExporter.imagePullPolicy }}
imagePullPolicy: {{ .Values.dcgmExporter.imagePullPolicy }}
{{- end }}
{{- if .Values.dcgmExporter.imagePullSecrets }}
imagePullSecrets: {{ toYaml .Values.dcgmExporter.imagePullSecrets | nindent 6 }}
{{- end }}
{{- if .Values.dcgmExporter.resources }}
resources: {{ toYaml .Values.dcgmExporter.resources | nindent 6 }}
{{- end }}
{{- if .Values.dcgmExporter.env }}
env: {{ toYaml .Values.dcgmExporter.env | nindent 6 }}
{{- end }}
{{- if .Values.dcgmExporter.args }}
args: {{ toYaml .Values.dcgmExporter.args | nindent 6 }}
{{- end }}
{{- if and (.Values.dcgmExporter.config) (.Values.dcgmExporter.config.name) }}
config:
name: {{ .Values.dcgmExporter.config.name }}
{{- end }}
{{- if .Values.dcgmExporter.serviceMonitor }}
serviceMonitor: {{ toYaml .Values.dcgmExporter.serviceMonitor | nindent 6 }}
{{- end }}
{{- if .Values.dcgmExporter.service }}
service: {{ toYaml .Values.dcgmExporter.service | nindent 6 }}
{{- end }}
{{- if .Values.dcgmExporter.hostPID }}
hostPID: {{ .Values.dcgmExporter.hostPID }}
{{- end }}
{{- if .Values.dcgmExporter.hostNetwork }}
hostNetwork: {{ .Values.dcgmExporter.hostNetwork }}
{{- end }}
{{- if .Values.dcgmExporter.hpcJobMapping }}
hpcJobMapping: {{ toYaml .Values.dcgmExporter.hpcJobMapping | nindent 6 }}
{{- end }}

And then update dcgmExporter's values in values.yaml with an example

dcgmExporter:
enabled: true
repository: nvcr.io/nvidia/k8s
image: dcgm-exporter
version: 4.5.1-4.8.0-distroless
imagePullPolicy: IfNotPresent
env: []
resources: {}
hostPID: false
hostNetwork: false
# HPC job mapping configuration for correlating GPU metrics with HPC workload manager jobs
# This is used by HPC workload managers like Slurm to label GPU metrics with job IDs
# hpcJobMapping:
# enabled: true
# directory: /var/lib/dcgm-exporter/job-mapping
service:
internalTrafficPolicy: Cluster
serviceMonitor:
enabled: false
interval: 15s
honorLabels: false
additionalLabels: {}
relabelings: []
# - source_labels:
# - __meta_kubernetes_pod_node_name
# regex: (.*)
# target_label: instance
# replacement: $1
# action: replace
# DCGM Exporter configuration
# This block is used to configure DCGM Exporter to emit a customized list of metrics.
# Use "name" to either point to an existing ConfigMap or to create a new one with a
# list of configurations (i.e with create=true).
# When pointing to an existing ConfigMap, the ConfigMap must exist in the same namespace as the release.
# The metrics are expected to be listed under a key called `dcgm-metrics.csv`.
# Use "data" to build an integrated ConfigMap from a set of custom metrics as
# part of the chart. An example of some custom metrics are shown below. Note that
# the contents of "data" must be in CSV format and be valid DCGM Exporter metric configurations.
# config:
# name: custom-dcgm-exporter-metrics
# create: true
# data: |-
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
# DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
# DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

@MadJlzz MadJlzz force-pushed the add-extra-annotation-dcgm-exporter-ds branch from d475d77 to bef9213 Compare April 14, 2026 07:45
@rahulait
Copy link
Copy Markdown
Contributor

/ok to test bef9213

@rahulait
Copy link
Copy Markdown
Contributor

Thanks @MadJlzz. You can test these changes and see if the built image works fine for you.

Steps to verify:

  1. Checkout this branch.
  2. Create a custom values.yaml which includes dcgmExporter with annotations and also overwrite gpu-operator image with this values.yaml:
operator:
  repository: ghcr.io/nvidia
  image: gpu-operator
  version: bef9213a

dcgmExporter:
  annotations:
    <your annotations here>
  1. Helm install
helm install gpu-operator ./deployments/gpu-operator -n gpu-operator -f values.yaml

@rahulait rahulait self-assigned this Apr 15, 2026
@MadJlzz
Copy link
Copy Markdown
Contributor Author

MadJlzz commented Apr 16, 2026

It looks like everything is working as expected.

image

Comment thread .gitignore Outdated
@rahulait
Copy link
Copy Markdown
Contributor

Overall, the PR looks good to me. I'll let other maintainers take a look as well and merge as required.

Comment thread .gitignore Outdated
@MadJlzz MadJlzz force-pushed the add-extra-annotation-dcgm-exporter-ds branch from 45057fc to c2e0c1f Compare April 20, 2026 21:37
Comment thread controllers/object_controls.go Outdated
@rahulait
Copy link
Copy Markdown
Contributor

/ok to test c2e0c1f

@MadJlzz MadJlzz force-pushed the add-extra-annotation-dcgm-exporter-ds branch from c2e0c1f to d238bb1 Compare April 22, 2026 20:21
…ter daemonset

Signed-off-by: Julien Klaer <klaer.julien@gmail.com>
@MadJlzz MadJlzz force-pushed the add-extra-annotation-dcgm-exporter-ds branch from d238bb1 to 7255ce0 Compare April 22, 2026 20:24
@tariq1890
Copy link
Copy Markdown
Contributor

/ok to test 7255ce0

Copy link
Copy Markdown
Contributor

@tariq1890 tariq1890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @MadJlzz !

@tariq1890 tariq1890 merged commit eaa68e3 into NVIDIA:main Apr 22, 2026
19 checks passed
@tariq1890 tariq1890 added this to the v26.7 milestone Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants