fix: nvidia-container-toolkit 1.18.0 jit-CDI mode, add nvidia-cdi-refresh is enabled validator for all AKS GPU sku#7963
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an E2E validator to confirm NVIDIA’s new nvidia-cdi-refresh systemd units are enabled and the last run succeeded, in response to NVIDIA Container Toolkit’s shift toward JIT-generated CDI specs.
Changes:
- Introduces
ValidateNvidiaCdiRefreshServiceRunningine2e/validators.go. - Wires the new validator into multiple GPU scenarios (Ubuntu 22.04/24.04 GRID & GPU, Azure Linux v3 GPU) and the GPU NPD scenario helper.
- Refactors
ValidateNvidiaDevicePluginServiceRunninginto a dedicated function block (no behavior change intended).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
e2e/validators.go |
Adds the nvidia-cdi-refresh systemd validation and refactors the device-plugin validator block. |
e2e/test_helpers.go |
Adds the new CDI refresh validation to the GPU NPD scenario helper validator chain. |
e2e/scenario_test.go |
Adds the new CDI refresh validation to multiple GPU scenario validators. |
Comments suppressed due to low confidence (2)
e2e/validators.go:1461
- Indentation in this newly added function uses spaces instead of the gofmt-standard tabs used throughout the file. Please run gofmt (or otherwise format this block) to keep repository formatting consistent and avoid CI/lint diffs.
func ValidateNvidiaDevicePluginServiceRunning(ctx context.Context, s *Scenario) {
s.T.Helper()
s.T.Logf("validating that NVIDIA device plugin systemd service is running")
command := []string{
"set -ex",
"systemctl is-active nvidia-device-plugin.service",
"systemctl is-enabled nvidia-device-plugin.service",
}
execScriptOnVMForScenarioValidateExitCode(ctx, s, strings.Join(command, "\n"), 0, "NVIDIA device plugin systemd service should be active and enabled")
e2e/scenario_test.go:1010
- The PR title/description says this validator should apply to all AKS GPU SKUs, but this change only wires the new validation into scenario_test.go and the GPUNPD helper. GPU managed-experience scenarios in e2e/scenario_gpu_managed_experience_test.go (e.g. Ubuntu2404/Ubuntu2204/AzureLinux3 NvidiaDevicePluginRunning) still won't exercise ValidateNvidiaCdiRefreshServiceRunning, so coverage is incomplete unless those are updated or the scope/title is narrowed.
Validator: func(ctx context.Context, s *Scenario) {
// Ensure nvidia-modprobe install does not restart kubelet and temporarily cause node to be unschedulable
ValidateNvidiaModProbeInstalled(ctx, s)
ValidateKubeletHasNotStopped(ctx, s)
ValidateServicesDoNotRestartKubelet(ctx, s)
ValidateNvidiaCdiRefreshServiceRunning(ctx, s)
},
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
NVIDIA Container Toolkit 1.17.x -> 1.18.x breaking change
NVIDIA Container Toolkit 1.18.0, this release of the NVIDIA Container Toolkit v1.18.0 is feature release with the following high-level changes:
The default mode of the NVIDIA Container Runtime has been updated to make use of a just-in-time-generated CDI specification instead of defaulting to the legacy mode.
Added a systemd unit to generate CDI specifications for available devices automatically. This allows native CDI support in container engines such as Docker and Podman to be used without additional steps.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.18.0/release-notes.html
but we dont even invoke nvidia-cdi-hook in AKS-GPU repo: https://github.com/Azure/aks-gpu/blob/main/config.sh#L4
nvidia-cdi-hook ships inside the nvidia-container-toolkit-base
adding nvidia-cdi-refresh is enabled validator for all AKS GPU sku.
without Azure/aks-gpu#136
all the non-managed AKS GPU related test should fail.
Which issue(s) this PR fixes:
Fixes #