Skip to content

feat(validator): self-contained gang scheduling conformance check#184

Merged
dims merged 1 commit intoNVIDIA:mainfrom
dims:dims/gang-scheduling-conformance
Feb 23, 2026
Merged

feat(validator): self-contained gang scheduling conformance check#184
dims merged 1 commit intoNVIDIA:mainfrom
dims:dims/gang-scheduling-conformance

Conversation

@dims
Copy link
Collaborator

@dims dims commented Feb 23, 2026

Summary

  • Make the gang-scheduling conformance check fully self-contained by programmatically creating test resources (PodGroup + 2 GPU pods with DRA ResourceClaims) instead of relying on pre-deployed manifests
  • Add GPU availability pre-flight check via ResourceSlices/ResourceClaims to fail fast when fewer than 2 GPUs are free, avoiding a 5-minute timeout
  • Add countAvailableGPUs() shared helper in conformance/helpers.go for reuse by other GPU-dependent checks
  • Add PodGroup create/delete RBAC for the validator service account

Test plan

  • Unit tests pass with race detector (8 test cases including insufficient GPUs, pod failure, missing deployments/CRDs)
  • Full conformance test suite passes
  • CI: lint, unit tests, build
  • CI: GPU workflow validates gang scheduling end-to-end

@dims dims requested a review from a team as a code owner February 23, 2026 00:55
@dims dims force-pushed the dims/gang-scheduling-conformance branch from 044d0a1 to d43c901 Compare February 23, 2026 01:01
@dims dims requested a review from a team as a code owner February 23, 2026 01:01
@dims dims force-pushed the dims/gang-scheduling-conformance branch 2 times, most recently from ede7e01 to fffb7ce Compare February 23, 2026 01:15
Make the gang-scheduling conformance check fully self-contained by
programmatically creating test resources instead of relying on
pre-deployed manifests. The check now:

1. Verifies KAI scheduler deployments and CRDs (unchanged)
2. Pre-flight: counts free GPUs via ResourceSlices/ResourceClaims
   and fails fast if fewer than 2 are available
3. Creates a PodGroup with 2 GPU test pods using DRA ResourceClaims
4. Waits for all pods to reach terminal state
5. Validates gang scheduling patterns (kai-scheduler, PodGroup labels,
   DRA resource claims, pod success)
6. Cleans up all test resources

Adds countAvailableGPUs() helper to conformance/helpers.go for reuse
by other GPU-dependent checks.

Remove redundant "Deploy gang scheduling test" and cleanup steps from
the GPU training CI workflow since the conformance check now handles
this end-to-end.
@dims dims force-pushed the dims/gang-scheduling-conformance branch from fffb7ce to 23d3fa0 Compare February 23, 2026 01:32
@dims dims merged commit 2acf5d0 into NVIDIA:main Feb 23, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant