Add gpu-e2e workflow by dims · Pull Request #69 · NVIDIA/nvkind

dims · 2026-04-20T15:45:59Z

First slice of GPU end-to-end CI for nvkind. One workflow, two matrix jobs on NVIDIA self-hosted runners:

linux-amd64-gpu-t4-latest-1 (T4, amd64)
linux-arm64-gpu-l4-latest-1 (L4, arm64)

Three scenarios per matrix job:

S1 — default cluster lifecycle. nvkind cluster create, node count, RuntimeClass presence, nvkind cluster print-gpus.
S2 — GPU Operator (minimal mode) + nvidia-smi pod. Installs nvidia/gpu-operator with driver/toolkit/DCGM disabled and NFD enabled, waits for the device-plugin daemonset rollout, confirms nvidia.com/gpu capacity is advertised, runs a pod that execs nvidia-smi.
S3 — DRA driver + ResourceClaim + nvidia-smi pod. amd64 only. Installs nvidia/nvidia-dra-driver-gpu v25.12.0 into a cluster configured via hack/ci/templates/dra.yaml.tmpl (DynamicResourceAllocation feature gate on control-plane + kubelet, enable_cdi in containerd), applies a ResourceClaimTemplate, runs a pod that execs nvidia-smi via the claim.

Triggers:

push: main, pull-request/<N> (copy-pr-bot mirror), gpu-ci/**
pull_request [labeled] with run-gpu-tests
schedule: 06:00 UTC daily
workflow_dispatch

A check-paths job gates the runner minutes on actual workflow/CI/source changes in the PR diff against main.

cdesiniotis

Thanks @dims! This is great. I left some questions for my understanding.

Add a label-gated GPU e2e workflow for nvkind that runs on both NVIDIA self-hosted runner pools (linux-amd64-gpu-t4-latest-1 and linux-arm64-gpu-l4-latest-1). Three scenarios per matrix job: S1 default cluster lifecycle — nvkind cluster create, node count, RuntimeClass presence, and a set-equality check between the UUIDs reported by `nvkind cluster print-gpus` (JSON) and `nvidia-smi --query-gpu=uuid` on the host. S2 GPU Operator (minimal mode) + nvidia-smi pod — installs `nvidia/gpu-operator` pinned to v26.3.1 with driver/toolkit/ DCGM disabled and NFD enabled (matches aicr's proven stack), waits for the nvidia-device-plugin daemonset rollout, confirms `nvidia.com/gpu` capacity is advertised, and runs a pod that execs `nvidia-smi`. S3 DRA driver + ResourceClaim + nvidia-smi pod — amd64 only. Installs `nvidia/nvidia-dra-driver-gpu` v25.12.0 into a cluster configured via hack/ci/templates/dra.yaml.tmpl (DynamicResourceAllocation feature gate across control-plane and kubelet; enable_cdi in containerd), runs a pod backed by a ResourceClaimTemplate, asserts the pod sees exactly one GPU from `nvidia-smi -L`, and asserts DRA actually engaged via `pod.status.resourceClaimStatuses[name=="gpu"].resourceClaimName` (the claim itself is pod-scoped and GC'd after Succeeded, so checking the status is more reliable than querying the claim). Triggers: - push: main, pull-request/<N> (copy-pr-bot mirror), gpu-ci/** - pull_request [labeled] with `run-gpu-tests` - schedule: 06:00 UTC daily - workflow_dispatch A `check-paths` gate protects GPU runner minutes by only running when workflow/CI/source paths change in the PR diff against main. Artifacts collected on every run: kind export logs, per-cluster pod list + events, docker daemon.json, nvidia-ctk config. Retained 7 days. Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

dims force-pushed the gpu-e2e-phase-1-v2 branch from 7a38e5b to c9e4224 Compare April 20, 2026 17:12

dims changed the title ~~.github/workflows: add gpu-e2e workflow (phase 1)~~ .github/workflows: add gpu-e2e workflow Apr 20, 2026

dims changed the title ~~.github/workflows: add gpu-e2e workflow~~ add gpu-e2e workflow Apr 20, 2026

dims changed the title ~~add gpu-e2e workflow~~ Add gpu-e2e workflow Apr 20, 2026

dims force-pushed the gpu-e2e-phase-1-v2 branch from 47e81ee to b6e0c2b Compare April 20, 2026 18:34

ArangoGutierrez previously approved these changes Apr 20, 2026

View reviewed changes

cdesiniotis reviewed Apr 20, 2026

View reviewed changes

Comment thread .github/workflows/gpu-e2e.yaml Outdated

Comment thread .github/workflows/gpu-e2e.yaml

Comment thread .github/workflows/gpu-e2e.yaml Outdated

Comment thread .github/workflows/gpu-e2e.yaml

Comment thread hack/ci/smi-pod.yaml Outdated

dims dismissed ArangoGutierrez’s stale review via 2f3a756 April 20, 2026 21:45

dims force-pushed the gpu-e2e-phase-1-v2 branch 2 times, most recently from 2f3a756 to b990da4 Compare April 20, 2026 21:48

dims force-pushed the gpu-e2e-phase-1-v2 branch from b990da4 to 5741e80 Compare April 20, 2026 21:49

cdesiniotis approved these changes Apr 20, 2026

View reviewed changes

dims merged commit 78a0a51 into NVIDIA:main Apr 20, 2026
3 checks passed

dims deleted the gpu-e2e-phase-1-v2 branch April 20, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpu-e2e workflow#69

Add gpu-e2e workflow#69
dims merged 1 commit into
NVIDIA:mainfrom
dims:gpu-e2e-phase-1-v2

dims commented Apr 20, 2026 •

edited

Loading

Uh oh!

cdesiniotis left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dims commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdesiniotis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dims commented Apr 20, 2026 •

edited

Loading