Skip to content

feat: add HPA pod autoscaling evidence for CNCF AI Conformance#191

Merged
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:feat/hpa-gpu-autoscaling-evidence
Feb 23, 2026
Merged

feat: add HPA pod autoscaling evidence for CNCF AI Conformance#191
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:feat/hpa-gpu-autoscaling-evidence

Conversation

@yuanchen8911
Copy link
Contributor

Summary

Add HPA pod autoscaling test and evidence collection for CNCF AI Conformance pod_autoscaling requirement. Fix GPU metrics pipeline for correct and faster HPA scaling.

Motivation / Context

CNCF AI Conformance requires demonstrating that HPA functions correctly for pods utilizing accelerators, including scaling based on custom GPU metrics. This was the last behavioral test missing from our evidence collection.

During testing, we found two issues preventing HPA from reading GPU metrics:

  1. DCGM ServiceMonitor without honorLabels caused Prometheus to overwrite workload namespace/pod labels with the exporter's own labels
  2. avg_over_time(...[2m]) in prometheus-adapter caused ~4 minute delay before metrics reflected actual GPU utilization

Fixes: N/A
Related: N/A

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Other: recipes/components/gpu-operator, recipes/components/prometheus-adapter, docs/conformance/cncf

Implementation Notes

GPU Operator (gpu-operator/values.yaml):

  • serviceMonitor.honorLabels: true — Prometheus preserves DCGM exporter's workload namespace/pod labels instead of overwriting them with the exporter pod's labels. Required for per-pod HPA metrics.
  • serviceMonitor.interval: 30s — faster scrape for quicker metric availability

Prometheus Adapter (prometheus-adapter/values.yaml):

  • Changed avg_over_time(<<.Series>>[2m])last_over_time(<<.Series>>[1m]) for all custom metric rules. Takes the most recent value instead of averaging over 2 minutes, reducing metric response time from ~4 minutes to ~60 seconds.

HPA Test (hpa-gpu-test.yaml):

  • Deployment running oguzpastirmaci/gpu-burn with -d 600 to generate sustained 100% GPU utilization
  • HPA targeting gpu_utilization at 50% threshold, scales 1→4 replicas
  • Verified: HPA scaled from 1→2 replicas when GPU utilization exceeded threshold

Evidence Collection (collect-evidence.sh):

  • Added collect_hpa section with prometheus-adapter health, custom metrics API, GPU stress test, HPA scaling verification
  • Un-deprecated script (needed for behavioral tests alongside aicr validate --evidence-dir)

Testing

Verified on ktsetfavua-dgxc-k8s-aws-use1-non-prod EKS cluster:

  • gpu-burn drove GPU to 100% utilization / 575W power draw
  • HPA read gpu_utilization: 100 via custom metrics API
  • HPA scaled deployment from 1→2 replicas with event: "New size: 2; reason: pods metric gpu_utilization above target"

Risk Assessment

  • Low — Metrics config changes improve responsiveness, no breaking changes

Rollout notes: Existing bundles need to be regenerated.

Checklist

  • Tests pass locally (make test with -race)
  • I did not skip/disable tests to make CI green
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 force-pushed the feat/hpa-gpu-autoscaling-evidence branch 3 times, most recently from 5e963b1 to 872c859 Compare February 23, 2026 20:39
@yuanchen8911 yuanchen8911 force-pushed the feat/hpa-gpu-autoscaling-evidence branch from 872c859 to 462126c Compare February 23, 2026 21:01
@yuanchen8911 yuanchen8911 force-pushed the feat/hpa-gpu-autoscaling-evidence branch 2 times, most recently from 18973e2 to 9e8dbb7 Compare February 23, 2026 21:15
mchmarny

This comment was marked as resolved.

Add HPA test manifest, evidence collection, and fix GPU metrics pipeline
for faster and correct HPA autoscaling based on custom GPU metrics.

Changes:
- Add hpa-gpu-test.yaml: Deployment with gpu-burn + HPA targeting
  gpu_utilization at 50% threshold
- Add collect_hpa section to collect-evidence.sh
- Fix DCGM ServiceMonitor: enable honorLabels so Prometheus preserves
  workload namespace/pod labels (required for per-pod HPA metrics)
- Reduce ServiceMonitor scrape interval from 60s to 30s
- Fix prometheus-adapter: use last_over_time(...[1m]) instead of
  avg_over_time(...[2m]) for faster metric response (~60s vs ~4min)
- Un-deprecate collect-evidence.sh (needed for behavioral tests)
- Update evidence index with pod_autoscaling: PASS

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the feat/hpa-gpu-autoscaling-evidence branch from 9e8dbb7 to d3a7e18 Compare February 23, 2026 21:30
@yuanchen8911
Copy link
Contributor Author

yuanchen8911 commented Feb 23, 2026

Good feedback! Addressed them. PTAL.

  1. License header — Added Apache 2.0 header to hpa-gpu-test.yaml
  2. PR description vs implementation — Manifest uses nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1-ubuntu18.04 (not gpu-burn)
  3. kubectl exec -l bug — Fixed to select specific pod name via jsonpath
  4. Namespace deletion race — Changed sleep 5 to kubectl wait --for=delete --timeout=60s
  5. Verdict too lenient — Now requires actual scaling (hpa_scaled=true) for PASS
  6. No early exit for unhealthy pods — Added pod phase check (Failed/CrashLoopBackOff) in wait loop
  7. Local filesystem paths — Fixed capture function to strip REPO_ROOT from command display
  8. last_over_time scope — Kept for custom (per-pod HPA) metrics only; restored avg_over_time for external (dashboard/alerting) metrics with comment
  9. Scrape interval comment — Added tradeoff note for 30s vs 60s
  10. Evidence regenerated — Clean evidence with no leaked paths, correct nvidia-smi output

Re: KSV-0125 (nvcr.io untrusted registry) — opened discussion in comment. nvcr.io is NVIDIA's official container registry and should be trusted.

Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 of 10 review items addressed — nice cleanup. The capture path-stripping fix, tightened verdict logic, and split avg_over_time / last_over_time strategy are all well done.

One remaining nit: The PR description, collect-evidence.sh (line 722: "Deployment running gpu-burn"), and pod-autoscaling.md (line 17: same) still reference "gpu-burn" while the manifest actually uses nvcr.io/nvidia/k8s/cuda-sample:nbody-*. Not a blocker, but worth a quick text fix for consistency.

LGTM otherwise.

Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911 yuanchen8911 merged commit 811643c into NVIDIA:main Feb 23, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants