CI: allow specifying custom driver versions in test matrix#2176
Draft
leofang wants to merge 4 commits into
Draft
CI: allow specifying custom driver versions in test matrix#2176leofang wants to merge 4 commits into
leofang wants to merge 4 commits into
Conversation
Extends the DRIVER field in ci/test-matrix.yml beyond 'latest'/'earliest' to accept an explicit version string (e.g. '580.65.06'). For Linux, ci/tools/install_gpu_driver.sh (adapted from nv-gha-runners/vm-images PR NVIDIA#256) swaps the driver in-job via nsenter when the row uses a custom version; for Windows, ci/tools/install_gpu_driver.ps1 is split into install + configure_driver_mode, with the install step gated on the DRIVER value and the mode step always running. The matrix row is routed to a 'latest' runner image when the DRIVER is a custom version (the install scripts perform the swap themselves). Container privileges on Linux (--privileged --pid=host) are added only on rows with a custom DRIVER. Custom DRIVER + FLAVOR=wsl is rejected eagerly in the compute-matrix step. Two existing nightly-numba-cuda rows exercise the new path: - Linux amd64 / 13.3.0 / l4 -> 580.65.06 - Windows amd64 / 13.3.0 / l4 -> 610.47 Closes NVIDIA#293 Closes NVIDIA#1265
Contributor
Member
Author
|
/ok to test b1b6070 |
|
….yml dispatch - install_gpu_driver.sh: pipe the script body to the host-side bash via stdin (bash -s < "$0") instead of re-execing "$0". The script lives in the GH workspace mount (container-only), so the relative path doesn't resolve after nsenter switches the mount namespace. The < "$0" fd is opened before nsenter and survives the flip. - test-matrix.yml: Windows nightly-numba-cuda row 610.47 -> 596.36 (610.47 isn't published on the CDN; install hit 404). - ci.yml: add workflow_dispatch: trigger so the pipeline can be re-run manually. The existing should-skip / detect-changes gates already handle non-PR events.
Member
Author
|
/ok to test 3e016b5 |
leofang
commented
Jun 7, 2026
| # nightly-numba-cuda | ||
| - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', DRIVER_MODE: 'TCC' } | ||
| - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest', DRIVER_MODE: 'TCC' } | ||
| - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: '596.36', DRIVER_MODE: 'TCC' } |
Member
Author
There was a problem hiding this comment.
leofang
commented
Jun 7, 2026
| # nightly-numba-cuda | ||
| - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '12.9.1', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' } | ||
| - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: 'latest' } | ||
| - { MODE: 'nightly-numba-cuda', ARCH: 'amd64', PY_VER: '3.12', CUDA_VER: '13.3.0', LOCAL_CTK: '0', GPU: 'l4', GPU_COUNT: '1', DRIVER: '580.65.06' } |
Member
Author
There was a problem hiding this comment.
So nvidia-smi validates the post-install driver state on custom-DRIVER rows. Windows test-wheel + coverage already use Install -> Configure -> Ensure; this brings the Linux test-wheel job into line.
Exercises the custom-driver install path on every PR (not just nightly). Both rows are amd64 / 13.3.0 / local-CTK, on l4 and rtxpro6000 -- both in the 'open' kernel-module flavor (only Volta needs 'legacy').
Member
Author
|
/ok to test 4a23b23 |
rwgk
reviewed
Jun 7, 2026
Comment on lines
+34
to
+39
| $nvidia_devices = Get-PnpDevice -Class Display -FriendlyName "NVIDIA*" | ||
| foreach ($device in $nvidia_devices) { | ||
| Write-Output "Restarting device: $($device.FriendlyName) ($($device.InstanceId))" | ||
| pnputil /disable-device "$($device.InstanceId)" | ||
| pnputil /enable-device "$($device.InstanceId)" | ||
| } |
Contributor
There was a problem hiding this comment.
It seems this function now cycles display devices on every Windows job, even when no driver was installed and nvidia-smi -fdm reports the mode was already correct (e.g. Driver model is already set to MCDM for GPU 00000000:0A:00.0. in the log). Is that intentional? In the h100/MCDM CI failure, NVML becomes unavailable right after the pnputil cycle here, so I’m wondering if the device restart should only happen after an install or an actual mode change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
closes #293
closes #1265
Extends the
DRIVERfield inci/test-matrix.ymlto accept an explicit driver version string (e.g.580.65.06) in addition to the existinglatest/earliest.ci/tools/install_gpu_driver.shswaps the driver in-job. It is adapted nearly verbatim from the runner team'snvgha-driverCLI (copied so we don't depend on its rollout schedule). The scriptnsenters onto the host for the install and refreshes the toolkit bind mounts back inside the test container. Because our script lives in the GH workspace mount (container-only), the host-side re-exec reads the script from stdin viabash -s < "$0"rather than running"$0"directly (the relative path doesn't resolve after the mount-namespace flip).ci/tools/install_gpu_driver.ps1is split into two scripts:install_gpu_driver.ps1(now install-only, readsDRIVERfrom env, errors onlatest/earliest) and a newci/tools/configure_driver_mode.ps1(driver-mode +pnputildevice cycle, runs on every job). This also fixes a long-standing wart: the previous script unconditionally installed a hardcoded581.15even when the matrix row used alatest/earliestrunner that already carried the right driver.Matrix wiring (in both
test-wheel-linux.ymlandtest-wheel-windows.yml):compute-matrixadds a newRUNNER_DRIVERfield per row — equal toDRIVERforlatest/earliest, otherwiselatest.runs-on:is keyed onRUNNER_DRIVERso custom-DRIVER rows land on the most recent pre-installed runner image (the install scripts perform the actual swap).container.optionsonly adds--privileged --pid=hostfor custom-DRIVER rows (required by the nsenter dance).DRIVERcombined withFLAVOR=wslis rejected eagerly incompute-matrix— the in-container swap doesn't work under WSL.nvidia-smi) now runs after the install / configure step in every workflow, so it validates the post-install driver state on custom-DRIVER rows.coverage.yml(Windows path, hardcodedDRIVER: latest) was updated alongside since it was the other caller of the old combined script.Matrix rows flipped to exercise the new code path:
amd64 / 3.13 / 13.3.0 / local-CTK / rtxpro6000→DRIVER: '610.43.02'amd64 / 3.14 / 13.3.0 / local-CTK / l4→DRIVER: '610.43.02'amd64 / 3.12 / 13.3.0 / l4→DRIVER: '580.65.06'amd64 / 3.12 / 13.3.0 / l4→DRIVER: '596.36'Also enables
workflow_dispatch:onci.ymlso the main CI pipeline can be re-run manually from the Actions UI (no inputs — the workflow already builds every wheel it tests, and the existingshould-skip/detect-changesgates handle non-PR events correctly).All other matrix rows continue to use
DRIVER: latestand are unaffected (same runners, no install step, no privileged container).Checklist