Skip to content

feat: enhance conformance evidence with gateway conditions, webhook test, and HPA scale-down#205

Merged
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:feat/enhance-evidence-collection
Feb 25, 2026
Merged

feat: enhance conformance evidence with gateway conditions, webhook test, and HPA scale-down#205
yuanchen8911 merged 2 commits intoNVIDIA:mainfrom
yuanchen8911:feat/enhance-evidence-collection

Conversation

@yuanchen8911
Copy link
Contributor

@yuanchen8911 yuanchen8911 commented Feb 24, 2026

Summary

Enhance the evidence collection script and regenerate all evidence with additional checks, manifest embedding, and documentation updates.

Motivation / Context

Review of the Go-based conformance validator (PR #185/#187) identified additional checks that strengthen our evidence. This PR ports those improvements to the evidence collection script and regenerates all evidence.

Fixes: N/A
Related: #192

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update

Component(s) Affected

  • Docs/examples (docs/, examples/)

Implementation Notes

Script enhancements (collect-evidence.sh):

  • Manifest embedding — Test manifests (DRA, gang scheduling, HPA) are now included inline in evidence docs for self-contained review
  • Gateway conditions — Verify GatewayClass.Accepted=True and Gateway.Programmed=True, not just resource existence
  • Webhook rejection test — Submit invalid DynamoGraphDeployment (empty spec), verify webhook denies it
  • HPA scale-down — After scale-up, cleanly delete GPU workload, deploy idle container, verify HPA scales back to minReplicas
  • HPA fix — Eliminate pod Error status during scale-down by deleting deployment before replacing
  • Path sanitizationcapture function strips REPO_ROOT from command display
  • Namespace cleanup — Use kubectl wait --for=delete instead of sleep 5
  • Strict HPA verdict — Require actual scaling for PASS, fail fast on unhealthy pods
  • DRA manifests — Remove readOnlyRootFilesystem (blocks CDI device injection)
  • Naming — Replace gpu-burn references with CUDA N-Body Simulation

Evidence regenerated (all 8 PASS):

  • Consistent format, no leaked paths, no sensitive info
  • Test manifests embedded inline in DRA, gang scheduling, and HPA evidence
  • Sanitized AMI ID in cluster-autoscaling evidence
  • Fixed broken table in index.md
  • Updated README with cluster-autoscaling in directory structure, usage, and comparison table

Testing

Full evidence collection run on ktsetfavua EKS cluster. All 8 sections completed successfully.

Risk Assessment

  • Low — Documentation and evidence script changes only

Checklist

  • I did not skip/disable tests to make CI green
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 24, 2026 21:04
@yuanchen8911 yuanchen8911 requested a review from dims February 24, 2026 21:09
dims
dims previously approved these changes Feb 24, 2026
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 24, 2026 21:49
@yuanchen8911 yuanchen8911 force-pushed the feat/enhance-evidence-collection branch from ce0704e to 4d0af90 Compare February 24, 2026 22:03
@yuanchen8911 yuanchen8911 requested a review from dims February 24, 2026 23:49
@yuanchen8911 yuanchen8911 force-pushed the feat/enhance-evidence-collection branch 4 times, most recently from ba9755a to 1b7c997 Compare February 25, 2026 00:13
…and HPA scale-down tests

Enhance the evidence collection script and regenerate all evidence with
additional checks inspired by the Go-based conformance validator:

Script enhancements:
- Gateway: verify GatewayClass Accepted and Gateway Programmed conditions
  (not just existence)
- Robust operator: add webhook rejection test (submit invalid CR, verify
  webhook denies it)
- HPA: add scale-down verification after scale-up (replace GPU workload
  with idle container, verify HPA scales back to minReplicas)
- HPA: fix pod Error status during scale-down by deleting deployment
  cleanly before creating idle replacement
- Fix capture function to strip absolute paths from command display
- Fix namespace deletion race with kubectl wait --for=delete
- Tighten HPA verdict to require actual scaling for PASS
- Add early exit for unhealthy pods in HPA wait loop
- Remove readOnlyRootFilesystem from DRA test manifests (blocks CDI
  device injection)
- Replace gpu-burn references with CUDA N-Body Simulation
- Sanitize AMI ID in cluster-autoscaling evidence

Evidence regenerated:
- All 8 conformance requirements: PASS
- No leaked local paths or sensitive information
- Consistent format across all evidence documents

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the feat/enhance-evidence-collection branch from 1b7c997 to dfe36cc Compare February 25, 2026 00:15
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) February 25, 2026 00:39
Copy link
Collaborator

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuanchen8911 yuanchen8911 merged commit 1f1758a into NVIDIA:main Feb 25, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants