feat(validator): self-contained DRA conformance check with EKS overlays#182
Merged
dims merged 2 commits intoNVIDIA:mainfrom Feb 22, 2026
Merged
Conversation
Add conformance validation checks to EKS overlays: - eks.yaml: 5 base checks (platform-health, gpu-operator-health, dra-support, accelerator-metrics, ai-service-metrics) - h100-eks-ubuntu-inference-dynamo.yaml: 10 checks (adds inference-gateway, robust-controller, secure-accelerator-access, pod-autoscaling, cluster-autoscaling) - h100-eks-ubuntu-training.yaml: 9 checks (adds gang-scheduling, robust-controller, pod-autoscaling, cluster-autoscaling) Add EKS test cases to conformance recipe invariant tests with conditional DRA constraint assertion (EKS training overlay omits version constraint).
a93f3d6 to
2a1c576
Compare
Rewrite CheckSecureAcceleratorAccess to programmatically create DRA test resources instead of expecting pre-deployed pods. The check now creates a namespace, ResourceClaim, and GPU test pod, waits for completion, validates DRA access patterns, and cleans up. - Create dra-test namespace, ResourceClaim, and Pod programmatically - Poll for pod terminal state with 5-minute timeout (image pull) - Validate: resourceClaims present, no device plugin, no hostPath, ResourceClaim exists, pod succeeded - Cleanup: delete pod and claim (skip namespace to avoid finalizer hangs) - Add DRATestPodTimeout constant to pkg/defaults - Expand ClusterRole RBAC: create/delete for namespaces, pods, resourceclaims - Update unit tests with fake client reactors for self-contained flow Verified on live EKS cluster with H100 GPU: TestSecureAcceleratorAccess PASS (3.60s)
2a1c576 to
c6dbeb7
Compare
Collaborator
Author
|
xref: #141 |
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
secure-accelerator-accesscheck to be self-contained: programmatically creates DRA test namespace, ResourceClaim, and GPU pod, waits for completion, validates DRA access patterns, and cleans upDRATestPodTimeoutconstant (5min) to handle image pull latencyTest plan
go test -v ./pkg/validator/checks/conformance/... -run TestCheckSecureAcceleratorAccessgo test -v ./pkg/recipe/... -run TestConformanceRecipeInvariantsmake test(73.5% coverage)TestSecureAcceleratorAccess PASS (3.60s)on H100 node with DRA