Skip to content

feat(validator): self-contained DRA conformance check with EKS overlays#182

Merged
dims merged 2 commits intoNVIDIA:mainfrom
dims:dims/self-contained-dra-conformance-check
Feb 22, 2026
Merged

feat(validator): self-contained DRA conformance check with EKS overlays#182
dims merged 2 commits intoNVIDIA:mainfrom
dims:dims/self-contained-dra-conformance-check

Conversation

@dims
Copy link
Collaborator

@dims dims commented Feb 22, 2026

Summary

  • Add conformance validation checks to EKS overlays (eks, h100-eks-ubuntu-inference-dynamo, h100-eks-ubuntu-training)
  • Add EKS test cases to conformance recipe invariant tests
  • Rewrite secure-accelerator-access check to be self-contained: programmatically creates DRA test namespace, ResourceClaim, and GPU pod, waits for completion, validates DRA access patterns, and cleans up
  • Expand ClusterRole RBAC permissions for DRA check (create/delete namespaces, pods, resourceclaims)
  • Add DRATestPodTimeout constant (5min) to handle image pull latency

Test plan

  • Unit tests pass: go test -v ./pkg/validator/checks/conformance/... -run TestCheckSecureAcceleratorAccess
  • Recipe invariant tests pass: go test -v ./pkg/recipe/... -run TestConformanceRecipeInvariants
  • Full test suite passes: make test (73.5% coverage)
  • Live EKS cluster validation: TestSecureAcceleratorAccess PASS (3.60s) on H100 node with DRA

Add conformance validation checks to EKS overlays:
- eks.yaml: 5 base checks (platform-health, gpu-operator-health, dra-support,
  accelerator-metrics, ai-service-metrics)
- h100-eks-ubuntu-inference-dynamo.yaml: 10 checks (adds inference-gateway,
  robust-controller, secure-accelerator-access, pod-autoscaling,
  cluster-autoscaling)
- h100-eks-ubuntu-training.yaml: 9 checks (adds gang-scheduling,
  robust-controller, pod-autoscaling, cluster-autoscaling)

Add EKS test cases to conformance recipe invariant tests with conditional
DRA constraint assertion (EKS training overlay omits version constraint).
@dims dims force-pushed the dims/self-contained-dra-conformance-check branch 3 times, most recently from a93f3d6 to 2a1c576 Compare February 22, 2026 21:29
Rewrite CheckSecureAcceleratorAccess to programmatically create DRA test
resources instead of expecting pre-deployed pods. The check now creates a
namespace, ResourceClaim, and GPU test pod, waits for completion, validates
DRA access patterns, and cleans up.

- Create dra-test namespace, ResourceClaim, and Pod programmatically
- Poll for pod terminal state with 5-minute timeout (image pull)
- Validate: resourceClaims present, no device plugin, no hostPath,
  ResourceClaim exists, pod succeeded
- Cleanup: delete pod and claim (skip namespace to avoid finalizer hangs)
- Add DRATestPodTimeout constant to pkg/defaults
- Expand ClusterRole RBAC: create/delete for namespaces, pods, resourceclaims
- Update unit tests with fake client reactors for self-contained flow

Verified on live EKS cluster with H100 GPU: TestSecureAcceleratorAccess PASS (3.60s)
@dims dims force-pushed the dims/self-contained-dra-conformance-check branch from 2a1c576 to c6dbeb7 Compare February 22, 2026 21:48
@dims
Copy link
Collaborator Author

dims commented Feb 22, 2026

xref: #141

@dims dims merged commit 770d132 into NVIDIA:main Feb 22, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant