feat: add HPA pod autoscaling evidence for CNCF AI Conformance#191
Merged
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom Feb 23, 2026
Merged
Conversation
5e963b1 to
872c859
Compare
872c859 to
462126c
Compare
18973e2 to
9e8dbb7
Compare
Add HPA test manifest, evidence collection, and fix GPU metrics pipeline for faster and correct HPA autoscaling based on custom GPU metrics. Changes: - Add hpa-gpu-test.yaml: Deployment with gpu-burn + HPA targeting gpu_utilization at 50% threshold - Add collect_hpa section to collect-evidence.sh - Fix DCGM ServiceMonitor: enable honorLabels so Prometheus preserves workload namespace/pod labels (required for per-pod HPA metrics) - Reduce ServiceMonitor scrape interval from 60s to 30s - Fix prometheus-adapter: use last_over_time(...[1m]) instead of avg_over_time(...[2m]) for faster metric response (~60s vs ~4min) - Un-deprecate collect-evidence.sh (needed for behavioral tests) - Update evidence index with pod_autoscaling: PASS Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
9e8dbb7 to
d3a7e18
Compare
Contributor
Author
|
Good feedback! Addressed them. PTAL.
Re: KSV-0125 ( |
mchmarny
approved these changes
Feb 23, 2026
Member
mchmarny
left a comment
There was a problem hiding this comment.
9 of 10 review items addressed — nice cleanup. The capture path-stripping fix, tightened verdict logic, and split avg_over_time / last_over_time strategy are all well done.
One remaining nit: The PR description, collect-evidence.sh (line 722: "Deployment running gpu-burn"), and pod-autoscaling.md (line 17: same) still reference "gpu-burn" while the manifest actually uses nvcr.io/nvidia/k8s/cuda-sample:nbody-*. Not a blocker, but worth a quick text fix for consistency.
LGTM otherwise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add HPA pod autoscaling test and evidence collection for CNCF AI Conformance
pod_autoscalingrequirement. Fix GPU metrics pipeline for correct and faster HPA scaling.Motivation / Context
CNCF AI Conformance requires demonstrating that HPA functions correctly for pods utilizing accelerators, including scaling based on custom GPU metrics. This was the last behavioral test missing from our evidence collection.
During testing, we found two issues preventing HPA from reading GPU metrics:
honorLabelscaused Prometheus to overwrite workload namespace/pod labels with the exporter's own labelsavg_over_time(...[2m])in prometheus-adapter caused ~4 minute delay before metrics reflected actual GPU utilizationFixes: N/A
Related: N/A
Type of Change
Component(s) Affected
recipes/components/gpu-operator,recipes/components/prometheus-adapter,docs/conformance/cncfImplementation Notes
GPU Operator (
gpu-operator/values.yaml):serviceMonitor.honorLabels: true— Prometheus preserves DCGM exporter's workloadnamespace/podlabels instead of overwriting them with the exporter pod's labels. Required for per-pod HPA metrics.serviceMonitor.interval: 30s— faster scrape for quicker metric availabilityPrometheus Adapter (
prometheus-adapter/values.yaml):avg_over_time(<<.Series>>[2m])→last_over_time(<<.Series>>[1m])for all custom metric rules. Takes the most recent value instead of averaging over 2 minutes, reducing metric response time from ~4 minutes to ~60 seconds.HPA Test (
hpa-gpu-test.yaml):oguzpastirmaci/gpu-burnwith-d 600to generate sustained 100% GPU utilizationgpu_utilizationat 50% threshold, scales 1→4 replicasEvidence Collection (
collect-evidence.sh):collect_hpasection with prometheus-adapter health, custom metrics API, GPU stress test, HPA scaling verificationaicr validate --evidence-dir)Testing
Verified on
ktsetfavua-dgxc-k8s-aws-use1-non-prodEKS cluster:gpu_utilization: 100via custom metrics APIRisk Assessment
Rollout notes: Existing bundles need to be regenerated.
Checklist
make testwith-race)git commit -S)