feat: enhance conformance evidence with gateway conditions, webhook test, and HPA scale-down by yuanchen8911 · Pull Request #205 · NVIDIA/aicr

yuanchen8911 · 2026-02-24T21:04:15Z

Summary

Enhance the evidence collection script and regenerate all evidence with additional checks, manifest embedding, and documentation updates.

Motivation / Context

Review of the Go-based conformance validator (PR #185/#187) identified additional checks that strengthen our evidence. This PR ports those improvements to the evidence collection script and regenerates all evidence.

Fixes: N/A
Related: #192

Type of Change

New feature (non-breaking change that adds functionality)
Documentation update

Component(s) Affected

Docs/examples (docs/, examples/)

Implementation Notes

Script enhancements (collect-evidence.sh):

Manifest embedding — Test manifests (DRA, gang scheduling, HPA) are now included inline in evidence docs for self-contained review
Gateway conditions — Verify GatewayClass.Accepted=True and Gateway.Programmed=True, not just resource existence
Webhook rejection test — Submit invalid DynamoGraphDeployment (empty spec), verify webhook denies it
HPA scale-down — After scale-up, cleanly delete GPU workload, deploy idle container, verify HPA scales back to minReplicas
HPA fix — Eliminate pod Error status during scale-down by deleting deployment before replacing
Path sanitization — capture function strips REPO_ROOT from command display
Namespace cleanup — Use kubectl wait --for=delete instead of sleep 5
Strict HPA verdict — Require actual scaling for PASS, fail fast on unhealthy pods
DRA manifests — Remove readOnlyRootFilesystem (blocks CDI device injection)
Naming — Replace gpu-burn references with CUDA N-Body Simulation

Evidence regenerated (all 8 PASS):

Consistent format, no leaked paths, no sensitive info
Test manifests embedded inline in DRA, gang scheduling, and HPA evidence
Sanitized AMI ID in cluster-autoscaling evidence
Fixed broken table in index.md
Updated README with cluster-autoscaling in directory structure, usage, and comparison table

Testing

Full evidence collection run on ktsetfavua EKS cluster. All 8 sections completed successfully.

Risk Assessment

Low — Documentation and evidence script changes only

Checklist

I did not skip/disable tests to make CI green
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

…and HPA scale-down tests Enhance the evidence collection script and regenerate all evidence with additional checks inspired by the Go-based conformance validator: Script enhancements: - Gateway: verify GatewayClass Accepted and Gateway Programmed conditions (not just existence) - Robust operator: add webhook rejection test (submit invalid CR, verify webhook denies it) - HPA: add scale-down verification after scale-up (replace GPU workload with idle container, verify HPA scales back to minReplicas) - HPA: fix pod Error status during scale-down by deleting deployment cleanly before creating idle replacement - Fix capture function to strip absolute paths from command display - Fix namespace deletion race with kubectl wait --for=delete - Tighten HPA verdict to require actual scaling for PASS - Add early exit for unhealthy pods in HPA wait loop - Remove readOnlyRootFilesystem from DRA test manifests (blocks CDI device injection) - Replace gpu-burn references with CUDA N-Body Simulation - Sanitize AMI ID in cluster-autoscaling evidence Evidence regenerated: - All 8 conformance requirements: PASS - No leaked local paths or sensitive information - Consistent format across all evidence documents Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

dims

LGTM

yuanchen8911 requested a review from a team as a code owner February 24, 2026 21:04

yuanchen8911 added enhancement New feature or request area/tests area/docs labels Feb 24, 2026

github-actions bot added size/XL and removed area/tests labels Feb 24, 2026

yuanchen8911 requested a review from dims February 24, 2026 21:09

dims previously approved these changes Feb 24, 2026

View reviewed changes

yuanchen8911 dismissed dims’s stale review via ce0704e February 24, 2026 21:49

yuanchen8911 requested a review from a team as a code owner February 24, 2026 21:49

yuanchen8911 force-pushed the feat/enhance-evidence-collection branch from ce0704e to 4d0af90 Compare February 24, 2026 22:03

yuanchen8911 requested a review from dims February 24, 2026 23:49

yuanchen8911 force-pushed the feat/enhance-evidence-collection branch 4 times, most recently from ba9755a to 1b7c997 Compare February 25, 2026 00:13

yuanchen8911 force-pushed the feat/enhance-evidence-collection branch from 1b7c997 to dfe36cc Compare February 25, 2026 00:15

yuanchen8911 requested a review from mchmarny February 25, 2026 00:17

Merge branch 'main' into feat/enhance-evidence-collection

ae44654

yuanchen8911 enabled auto-merge (squash) February 25, 2026 00:39

dims approved these changes Feb 25, 2026

View reviewed changes

yuanchen8911 merged commit 1f1758a into NVIDIA:main Feb 25, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhance conformance evidence with gateway conditions, webhook test, and HPA scale-down#205

feat: enhance conformance evidence with gateway conditions, webhook test, and HPA scale-down#205
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:feat/enhance-evidence-collection

yuanchen8911 commented Feb 24, 2026 •

edited

Loading

Uh oh!

dims left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

dims left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Feb 24, 2026 •

edited

Loading