Skip to content

Add gpu-e2e workflow#69

Merged
dims merged 1 commit into
NVIDIA:mainfrom
dims:gpu-e2e-phase-1-v2
Apr 20, 2026
Merged

Add gpu-e2e workflow#69
dims merged 1 commit into
NVIDIA:mainfrom
dims:gpu-e2e-phase-1-v2

Conversation

@dims

@dims dims commented Apr 20, 2026

Copy link
Copy Markdown
Collaborator

First slice of GPU end-to-end CI for nvkind. One workflow, two matrix jobs on NVIDIA self-hosted runners:

  • linux-amd64-gpu-t4-latest-1 (T4, amd64)
  • linux-arm64-gpu-l4-latest-1 (L4, arm64)

Three scenarios per matrix job:

  • S1 — default cluster lifecycle. nvkind cluster create, node count, RuntimeClass presence, nvkind cluster print-gpus.
  • S2 — GPU Operator (minimal mode) + nvidia-smi pod. Installs nvidia/gpu-operator with driver/toolkit/DCGM disabled and NFD enabled, waits for the device-plugin daemonset rollout, confirms nvidia.com/gpu capacity is advertised, runs a pod that execs nvidia-smi.
  • S3 — DRA driver + ResourceClaim + nvidia-smi pod. amd64 only. Installs nvidia/nvidia-dra-driver-gpu v25.12.0 into a cluster configured via hack/ci/templates/dra.yaml.tmpl (DynamicResourceAllocation feature gate on control-plane + kubelet, enable_cdi in containerd), applies a ResourceClaimTemplate, runs a pod that execs nvidia-smi via the claim.

Triggers:

  • push: main, pull-request/<N> (copy-pr-bot mirror), gpu-ci/**
  • pull_request [labeled] with run-gpu-tests
  • schedule: 06:00 UTC daily
  • workflow_dispatch

A check-paths job gates the runner minutes on actual workflow/CI/source changes in the PR diff against main.

@dims dims force-pushed the gpu-e2e-phase-1-v2 branch from 7a38e5b to c9e4224 Compare April 20, 2026 17:12
@dims dims changed the title .github/workflows: add gpu-e2e workflow (phase 1) .github/workflows: add gpu-e2e workflow Apr 20, 2026
@dims dims changed the title .github/workflows: add gpu-e2e workflow add gpu-e2e workflow Apr 20, 2026
@dims dims changed the title add gpu-e2e workflow Add gpu-e2e workflow Apr 20, 2026
@dims dims force-pushed the gpu-e2e-phase-1-v2 branch from 47e81ee to b6e0c2b Compare April 20, 2026 18:34

@cdesiniotis cdesiniotis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dims! This is great. I left some questions for my understanding.

Comment thread .github/workflows/gpu-e2e.yaml Outdated
Comment thread .github/workflows/gpu-e2e.yaml
Comment thread .github/workflows/gpu-e2e.yaml Outdated
Comment thread .github/workflows/gpu-e2e.yaml
Comment thread hack/ci/smi-pod.yaml Outdated
@dims dims force-pushed the gpu-e2e-phase-1-v2 branch 2 times, most recently from 2f3a756 to b990da4 Compare April 20, 2026 21:48
Add a label-gated GPU e2e workflow for nvkind that runs on both
NVIDIA self-hosted runner pools (linux-amd64-gpu-t4-latest-1 and
linux-arm64-gpu-l4-latest-1).

Three scenarios per matrix job:

  S1  default cluster lifecycle — nvkind cluster create, node count,
      RuntimeClass presence, and a set-equality check between the
      UUIDs reported by `nvkind cluster print-gpus` (JSON) and
      `nvidia-smi --query-gpu=uuid` on the host.
  S2  GPU Operator (minimal mode) + nvidia-smi pod — installs
      `nvidia/gpu-operator` pinned to v26.3.1 with driver/toolkit/
      DCGM disabled and NFD enabled (matches aicr's proven stack),
      waits for the nvidia-device-plugin daemonset rollout, confirms
      `nvidia.com/gpu` capacity is advertised, and runs a pod that
      execs `nvidia-smi`.
  S3  DRA driver + ResourceClaim + nvidia-smi pod — amd64 only.
      Installs `nvidia/nvidia-dra-driver-gpu` v25.12.0 into a cluster
      configured via hack/ci/templates/dra.yaml.tmpl
      (DynamicResourceAllocation feature gate across control-plane
      and kubelet; enable_cdi in containerd), runs a pod backed by a
      ResourceClaimTemplate, asserts the pod sees exactly one GPU
      from `nvidia-smi -L`, and asserts DRA actually engaged via
      `pod.status.resourceClaimStatuses[name=="gpu"].resourceClaimName`
      (the claim itself is pod-scoped and GC'd after Succeeded, so
      checking the status is more reliable than querying the claim).

Triggers:
  - push: main, pull-request/<N> (copy-pr-bot mirror), gpu-ci/**
  - pull_request [labeled] with `run-gpu-tests`
  - schedule: 06:00 UTC daily
  - workflow_dispatch

A `check-paths` gate protects GPU runner minutes by only running when
workflow/CI/source paths change in the PR diff against main.

Artifacts collected on every run: kind export logs, per-cluster pod
list + events, docker daemon.json, nvidia-ctk config. Retained 7 days.

Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>
@dims dims force-pushed the gpu-e2e-phase-1-v2 branch from b990da4 to 5741e80 Compare April 20, 2026 21:49
@dims dims merged commit 78a0a51 into NVIDIA:main Apr 20, 2026
3 checks passed
@dims dims deleted the gpu-e2e-phase-1-v2 branch April 20, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants