Wait for mdev parents in vGPU manager validation by karthikvetrivel · Pull Request #2502 · NVIDIA/gpu-operator

karthikvetrivel · 2026-05-28T17:35:39Z

Description

Fixes #2365. On Tesla T4 (Turing) hosts running with gpuWorkloadConfig=vm-vgpu, the sandbox-validator's vgpu-manager-validation initContainer enters its readiness wait and polls for SR-IOV Virtual Functions that the driver will never create.

T4 silicon advertises sriov_totalvfs=16 but the driver uses mdev, never creating VFs, so the wait hangs. Now, the wait succeeds when nvmdev.GetAllParentDevices() returns at least one device, or when all SR-IOV VFs are enabled. The poll loop is restructured around two predicates, mdevParentDevicesExist() and vfsExist(), and the function is renamed to waitForParentDevices since the readiness signal is no longer VF-specific. Works for both T4 (PF is the mdev parent) and Ampere+ SR-IOV vGPU (VFs are the mdev parents).

Design tradeoff

The other alternative I considered was using nvmlDeviceGetHostVgpuMode(), which returns a typed enum (SRIOV / NON_SRIOV / NONE) directly from the driver via ioctl and is strictly more authoritative than checking for an mdev sysfs directory.

I chose the mdev sysfs path because nvmdev is already vendored and imported in this file, no new go.mod entry is required, the validator container does not need libnvidia-ml.so on LD_LIBRARY_PATH or /dev/nvidiactl mounted, and there is no NVML version-skew handling to write.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

Verified end-to-end on two single-node Ubuntu 22.04 clusters with vGPU host driver vgpu-manager:580.65.05: a Tesla T4 and an A100.

T4 with fix — pod Running 1/1, 0 restarts, 42s:

time="2026-06-02T18:24:48Z" level=info msg="Waiting for parent devices to be available..."
time="2026-06-02T18:24:48Z" level=info msg="found 1 mdev parent devices"

A100 with fix — pod Running 1/1, 0 restarts, 14s:

time="2026-06-02T18:25:29Z" level=info msg="found 16 mdev parent devices"

T4 without fix (validator reverted to upstream ghcr.io/nvidia/gpu-operator:3a85e9eb) — pod CrashLoopBackOff after 5-min timeout:

time="2026-05-27T17:59:07Z" level=info msg="Waiting for VFs to be available..."
time="2026-05-27T17:59:07Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"
... (repeats every 5s) ...
time="2026-05-27T17:59:52Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"
time="2026-05-27T17:59:52Z" level=info msg="Error: error validating vGPU Manager installation: vGPU Manager VFs not ready: context deadline exceeded"

cdesiniotis · 2026-05-28T21:34:07Z

The other alternative I considered was using nvmlDeviceGetHostVgpuMode(), which returns a typed enum (SRIOV / NON_SRIOV / NONE) directly from the driver via ioctl and is strictly more authoritative than checking for an mdev sysfs directory.

Have you confirmed that this NVML function returns NON_SRIOV for the T4 GPU that has totalvfs != 0? If yes (which suggests this NVML function gives us the information we need), I slightly prefer going with this approach (based on what I know at the moment). Note, we do have /host and /run/nvidia/driver mounted in the vgpu-manager-validation container, so we should be able to discover the path to libnvidia-ml.so.1 and instantiate a go-nvml instance.

The below pseudocode is what I envision:

waitForVFs() {
  . . .
  for _, gpu := range gpus {
    sriovInfo := gpu.SriovInfo
    if !sriovInfo.IsPF() {
      continue
    }
    if nvmlDevice.GetHostVgpuMode() != "SRIOV" {
      continue
    }
    pfCount++
    totalExpected += sriovInfo.PhysicalFunction.TotalVFs
    totalEnabled += sriovInfo.PhysicalFunction.NumVFs
  }
  . . .
}

cdesiniotis · 2026-06-02T17:21:46Z

 	nvpciLib := nvpci.New()
+	nvmdevLib := nvmdev.New()

 	return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) {


What about the below suggestion?

return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) { if mdevParentDevicesExist() || vfsExist() { return true, nil } return false, nil }) func mdevParentDevicesExist() bool { nvmdev.Lib := nvmdev.New() mdevParentDevices, err := nvmdevLib.GetAllParentDevices() if err != nil { log.Warnf("could not get mdev parent devices: %v", err) return false } if len(mdevParentDevices) == 0 { log.Infof("found 0 mdev parent devices") return false } log.Infof("found %d mdev parent devices", len(mdevParentDevices)) return true } func vfsExist() bool { nvpciLib := nvpci.New() gpus, err := nvpciLib.GetGPUs() if err != nil { log.Warnf("error getting GPUs: %v", err) return false } var totalExpected, totalEnabled uint64 var pfCount int for _, gpu := range gpus { sriovInfo := gpu.SriovInfo if sriovInfo.IsPF() { pfCount++ totalExpected += sriovInfo.PhysicalFunction.TotalVFs totalEnabled += sriovInfo.PhysicalFunction.NumVFs } } if totalExpected == 0 { log.Info("No SR-IOV capable GPUs found") return false } if totalEnabled == totalExpected { log.Infof("All %d VF(s) enabled on %d NVIDIA GPU(s)", totalEnabled, pfCount) return true } log.Infof("Not all VFs have been created. %d/%d enabled across %d GPU(s)", totalEnabled, totalExpected, pfCount) return false }

karthikvetrivel · 2026-06-02T18:46:06Z

@cdesiniotis I implemented your suggestion but have a concern about your last comment. I'm not sure if mdevParentDevicesExist() || allVFsReady() will declare readiness prematurely on Ampere+ SR-IOV vGPU, where the mdev parents are the VFs and the driver registers each one as it brings up the corresponding VF. The OR will return true on the first parent, before totalEnabled == totalExpected. Is my understanding correct here?

For now, I added a third helper, driverUsingSRIOV() as a gate to check allVFsReady() first.

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

karthikvetrivel requested review from cdesiniotis, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners May 28, 2026 17:35

karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch from 0a31691 to ad5f5e0 Compare May 28, 2026 17:59

cdesiniotis reviewed May 28, 2026

View reviewed changes

Comment thread cmd/nvidia-validator/main.go Outdated

cdesiniotis reviewed May 28, 2026

View reviewed changes

Comment thread cmd/nvidia-validator/main.go Outdated

karthikvetrivel marked this pull request as draft May 28, 2026 21:18

karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch 2 times, most recently from b5f02b4 to 9306209 Compare June 2, 2026 16:33

karthikvetrivel marked this pull request as ready for review June 2, 2026 16:41

cdesiniotis reviewed Jun 2, 2026

View reviewed changes

karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch from 9306209 to 71c0e97 Compare June 2, 2026 18:27

karthikvetrivel changed the title ~~Skip mdev-mode GPUs in waitForVFs~~ Wait for mdev parents in vGPU manager validation Jun 2, 2026

karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch 2 times, most recently from d8c9536 to 0584143 Compare June 2, 2026 18:45

karthikvetrivel requested a review from cdesiniotis June 2, 2026 18:46

skip mdev-mode GPUs in waitForVFs

14d74a1

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch from 0584143 to 14d74a1 Compare June 2, 2026 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for mdev parents in vGPU manager validation#2502

Wait for mdev parents in vGPU manager validation#2502
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:fix/sandbox-validator-mdev-skip

karthikvetrivel commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cdesiniotis commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdesiniotis Jun 2, 2026

Uh oh!

karthikvetrivel commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

karthikvetrivel commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Design tradeoff

Checklist

Testing

Uh oh!

Uh oh!

Uh oh!

cdesiniotis commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdesiniotis Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karthikvetrivel commented May 28, 2026 •

edited

Loading