Wait for mdev parents in vGPU manager validation#2502
Conversation
0a31691 to
ad5f5e0
Compare
Have you confirmed that this NVML function returns The below pseudocode is what I envision: |
b5f02b4 to
9306209
Compare
| nvpciLib := nvpci.New() | ||
| nvmdevLib := nvmdev.New() | ||
|
|
||
| return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
What about the below suggestion?
return wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true, func(ctx context.Context) (bool, error) {
if mdevParentDevicesExist() || vfsExist() {
return true, nil
}
return false, nil
})
func mdevParentDevicesExist() bool {
nvmdev.Lib := nvmdev.New()
mdevParentDevices, err := nvmdevLib.GetAllParentDevices()
if err != nil {
log.Warnf("could not get mdev parent devices: %v", err)
return false
}
if len(mdevParentDevices) == 0 {
log.Infof("found 0 mdev parent devices")
return false
}
log.Infof("found %d mdev parent devices", len(mdevParentDevices))
return true
}
func vfsExist() bool {
nvpciLib := nvpci.New()
gpus, err := nvpciLib.GetGPUs()
if err != nil {
log.Warnf("error getting GPUs: %v", err)
return false
}
var totalExpected, totalEnabled uint64
var pfCount int
for _, gpu := range gpus {
sriovInfo := gpu.SriovInfo
if sriovInfo.IsPF() {
pfCount++
totalExpected += sriovInfo.PhysicalFunction.TotalVFs
totalEnabled += sriovInfo.PhysicalFunction.NumVFs
}
}
if totalExpected == 0 {
log.Info("No SR-IOV capable GPUs found")
return false
}
if totalEnabled == totalExpected {
log.Infof("All %d VF(s) enabled on %d NVIDIA GPU(s)", totalEnabled, pfCount)
return true
}
log.Infof("Not all VFs have been created. %d/%d enabled across %d GPU(s)", totalEnabled, totalExpected, pfCount)
return false
}
9306209 to
71c0e97
Compare
d8c9536 to
0584143
Compare
|
@cdesiniotis I implemented your suggestion but have a concern about your last comment. I'm not sure if For now, I added a third helper, |
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
0584143 to
14d74a1
Compare
Description
Fixes #2365. On Tesla T4 (Turing) hosts running with
gpuWorkloadConfig=vm-vgpu, the sandbox-validator'svgpu-manager-validationinitContainer enters its readiness wait and polls for SR-IOV Virtual Functions that the driver will never create.T4 silicon advertises
sriov_totalvfs=16but the driver uses mdev, never creating VFs, so the wait hangs. Now, the wait succeeds whennvmdev.GetAllParentDevices()returns at least one device, or when all SR-IOV VFs are enabled. The poll loop is restructured around two predicates,mdevParentDevicesExist()andvfsExist(), and the function is renamed towaitForParentDevicessince the readiness signal is no longer VF-specific. Works for both T4 (PF is the mdev parent) and Ampere+ SR-IOV vGPU (VFs are the mdev parents).Design tradeoff
The other alternative I considered was using
nvmlDeviceGetHostVgpuMode(), which returns a typed enum (SRIOV/NON_SRIOV/NONE) directly from the driver via ioctl and is strictly more authoritative than checking for an mdev sysfs directory.I chose the mdev sysfs path because
nvmdevis already vendored and imported in this file, no newgo.modentry is required, the validator container does not needlibnvidia-ml.soonLD_LIBRARY_PATHor/dev/nvidiactlmounted, and there is no NVML version-skew handling to write.Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Verified end-to-end on two single-node Ubuntu 22.04 clusters with vGPU host driver
vgpu-manager:580.65.05: a Tesla T4 and an A100.T4 with fix — pod
Running 1/1, 0 restarts, 42s:A100 with fix — pod
Running 1/1, 0 restarts, 14s:T4 without fix (validator reverted to upstream
ghcr.io/nvidia/gpu-operator:3a85e9eb) — podCrashLoopBackOffafter 5-min timeout: