Skip to content

Wait for mdev parents in vGPU manager validation#2502

Open
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:fix/sandbox-validator-mdev-skip
Open

Wait for mdev parents in vGPU manager validation#2502
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:fix/sandbox-validator-mdev-skip

Conversation

@karthikvetrivel
Copy link
Copy Markdown
Member

@karthikvetrivel karthikvetrivel commented May 28, 2026

Description

Fixes #2365. On Tesla T4 (Turing) hosts running with gpuWorkloadConfig=vm-vgpu, the sandbox-validator's vgpu-manager-validation initContainer enters its readiness wait and polls for SR-IOV Virtual Functions that the driver will never create.

T4 silicon advertises sriov_totalvfs=16 but the driver uses mdev, never creating VFs, so the wait hangs. Now, the wait succeeds when nvmdev.GetAllParentDevices() returns at least one device, or when all SR-IOV VFs are enabled. The poll loop is restructured around two predicates, mdevParentDevicesExist() and vfsExist(), and the function is renamed to waitForParentDevices since the readiness signal is no longer VF-specific. Works for both T4 (PF is the mdev parent) and Ampere+ SR-IOV vGPU (VFs are the mdev parents).

Design tradeoff

The other alternative I considered was using nvmlDeviceGetHostVgpuMode(), which returns a typed enum (SRIOV / NON_SRIOV / NONE) directly from the driver via ioctl and is strictly more authoritative than checking for an mdev sysfs directory.

I chose the mdev sysfs path because nvmdev is already vendored and imported in this file, no new go.mod entry is required, the validator container does not need libnvidia-ml.so on LD_LIBRARY_PATH or /dev/nvidiactl mounted, and there is no NVML version-skew handling to write.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Verified end-to-end on two single-node Ubuntu 22.04 clusters with vGPU host driver vgpu-manager:580.65.05: a Tesla T4 and an A100.

T4 with fix — pod Running 1/1, 0 restarts, 42s:

time="2026-06-02T18:24:48Z" level=info msg="Waiting for parent devices to be available..."
time="2026-06-02T18:24:48Z" level=info msg="found 1 mdev parent devices"

A100 with fix — pod Running 1/1, 0 restarts, 14s:

time="2026-06-02T18:25:29Z" level=info msg="found 16 mdev parent devices"

T4 without fix (validator reverted to upstream ghcr.io/nvidia/gpu-operator:3a85e9eb) — pod CrashLoopBackOff after 5-min timeout:

time="2026-05-27T17:59:07Z" level=info msg="Waiting for VFs to be available..."
time="2026-05-27T17:59:07Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"
... (repeats every 5s) ...
time="2026-05-27T17:59:52Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"
time="2026-05-27T17:59:52Z" level=info msg="Error: error validating vGPU Manager installation: vGPU Manager VFs not ready: context deadline exceeded"

Comment thread cmd/nvidia-validator/main.go Outdated
Comment thread cmd/nvidia-validator/main.go Outdated
@karthikvetrivel karthikvetrivel marked this pull request as draft May 28, 2026 21:18
@cdesiniotis
Copy link
Copy Markdown
Contributor

The other alternative I considered was using nvmlDeviceGetHostVgpuMode(), which returns a typed enum (SRIOV / NON_SRIOV / NONE) directly from the driver via ioctl and is strictly more authoritative than checking for an mdev sysfs directory.

Have you confirmed that this NVML function returns NON_SRIOV for the T4 GPU that has totalvfs != 0? If yes (which suggests this NVML function gives us the information we need), I slightly prefer going with this approach (based on what I know at the moment). Note, we do have /host and /run/nvidia/driver mounted in the vgpu-manager-validation container, so we should be able to discover the path to libnvidia-ml.so.1 and instantiate a go-nvml instance.

The below pseudocode is what I envision:

waitForVFs() {
  . . .
  for _, gpu := range gpus {
    sriovInfo := gpu.SriovInfo
    if !sriovInfo.IsPF() {
      continue
    }
    if nvmlDevice.GetHostVgpuMode() != "SRIOV" {
      continue
    }
    pfCount++
    totalExpected += sriovInfo.PhysicalFunction.TotalVFs
    totalEnabled += sriovInfo.PhysicalFunction.NumVFs
  }
  . . .
}

@karthikvetrivel karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch 2 times, most recently from b5f02b4 to 9306209 Compare June 2, 2026 16:33
@karthikvetrivel karthikvetrivel marked this pull request as ready for review June 2, 2026 16:41
Comment thread cmd/nvidia-validator/main.go Outdated
Comment thread cmd/nvidia-validator/main.go Outdated
Comment thread cmd/nvidia-validator/main.go Outdated
Comment thread cmd/nvidia-validator/main.go Outdated
nvpciLib := nvpci.New()
nvmdevLib := nvmdev.New()

return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the below suggestion?

return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) {
    if mdevParentDevicesExist() || vfsExist() {
      return true, nil
    }
    return false, nil
})

func mdevParentDevicesExist() bool {
    nvmdev.Lib := nvmdev.New()
    mdevParentDevices, err := nvmdevLib.GetAllParentDevices()
    if err != nil {
      log.Warnf("could not get mdev parent devices: %v", err)
      return false
    }

    if len(mdevParentDevices) == 0 {
      log.Infof("found 0 mdev parent devices")
      return false
    } 
  
    log.Infof("found %d mdev parent devices", len(mdevParentDevices))
    return true
}

func vfsExist() bool {
        nvpciLib := nvpci.New()
  		gpus, err := nvpciLib.GetGPUs()
		if err != nil {
			log.Warnf("error getting GPUs: %v", err)
			return false
		}

		var totalExpected, totalEnabled uint64
		var pfCount int
		for _, gpu := range gpus {
			sriovInfo := gpu.SriovInfo
			if sriovInfo.IsPF() {
				pfCount++
				totalExpected += sriovInfo.PhysicalFunction.TotalVFs
				totalEnabled += sriovInfo.PhysicalFunction.NumVFs
			}
		}

		if totalExpected == 0 {
			log.Info("No SR-IOV capable GPUs found")
			return false
        }

        if totalEnabled == totalExpected {
			log.Infof("All %d VF(s) enabled on %d NVIDIA GPU(s)", totalEnabled, pfCount)
			return true
		}

        log.Infof("Not all VFs have been created. %d/%d enabled across %d GPU(s)", totalEnabled, totalExpected, pfCount)
		return false
}

@karthikvetrivel karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch from 9306209 to 71c0e97 Compare June 2, 2026 18:27
@karthikvetrivel karthikvetrivel changed the title Skip mdev-mode GPUs in waitForVFs Wait for mdev parents in vGPU manager validation Jun 2, 2026
@karthikvetrivel karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch 2 times, most recently from d8c9536 to 0584143 Compare June 2, 2026 18:45
@karthikvetrivel
Copy link
Copy Markdown
Member Author

@cdesiniotis I implemented your suggestion but have a concern about your last comment. I'm not sure if mdevParentDevicesExist() || allVFsReady() will declare readiness prematurely on Ampere+ SR-IOV vGPU, where the mdev parents are the VFs and the driver registers each one as it brings up the corresponding VF. The OR will return true on the first parent, before totalEnabled == totalExpected. Is my understanding correct here?

For now, I added a third helper, driverUsingSRIOV() as a gate to check allVFsReady() first.

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the fix/sandbox-validator-mdev-skip branch from 0584143 to 14d74a1 Compare June 2, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: nvidia-sandbox-validator pods crash when card doesn't support SR-IOV

2 participants