Skip to content

Refresh DCGM exporter#1348

Merged
michael-balint merged 1 commit into
masterfrom
dholt/dcgm-exporter-refresh
May 28, 2026
Merged

Refresh DCGM exporter#1348
michael-balint merged 1 commit into
masterfrom
dholt/dcgm-exporter-refresh

Conversation

@dholt
Copy link
Copy Markdown
Contributor

@dholt dholt commented May 28, 2026

Summary

  • Refresh DCGM Exporter from stale Ubuntu 20.04-era pins to NVIDIA/dcgm-exporter 4.5.3-4.8.2.
  • Refresh bundled default counters from the upstream tag.
  • Adjust the Kubernetes DaemonSet manifest for the distroless image entrypoint and tighten container security context fields.

Validation

  • Public GitHub Actions on final head a6242f6a4696d1c06304db988a123d22b457267c: PASS
  • ansible-lint roles/ via the DeepOps role lint target: PASS
  • ansible-playbook --syntax-check playbooks/slurm-cluster/nvidia-dcgm-exporter.yml: PASS
  • YAML parse of modified defaults and DaemonSet manifests: PASS
  • git diff --check: PASS
  • docker manifest inspect nvcr.io/nvidia/k8s/dcgm-exporter:4.5.3-4.8.2-distroless: PASS for an OCI index with linux/amd64 and linux/arm64 manifests
  • Refreshed CSV files compare exactly to upstream etc/default-counters.csv at tag 4.5.3-4.8.2
  • Static OS compatibility audit: DCGM monitoring high findings reduced from 3 to 0
  • Live single-node Ubuntu 22.04 GPU-server deployment: PASS
    • Ran ansible-playbook -i <inventory> playbooks/slurm-cluster/nvidia-dcgm-exporter.yml -e hostlist=all --flush-cache from a pinned DeepOps Ansible 10.7.0 controller virtualenv.
    • Play recap: ok=94 changed=31 unreachable=0 failed=0 skipped=30.
    • Post-run checks: docker.dcgm-exporter.service active/enabled, container image nvcr.io/nvidia/k8s/dcgm-exporter:4.5.3-4.8.2-distroless running, nvidia-smi healthy on two GPUs, and curl http://127.0.0.1:9400/metrics returned GPU metrics including clocks, temperature, power, utilization, and framebuffer gauges.
  • PR body sanitization: PASS

Notes

@dholt dholt marked this pull request as ready for review May 28, 2026 15:35
@dholt dholt requested a review from michael-balint May 28, 2026 15:35
@michael-balint michael-balint merged commit e6553c4 into master May 28, 2026
31 checks passed
@dholt dholt deleted the dholt/dcgm-exporter-refresh branch May 28, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants