Skip to content

CI: Test different driver versions on Windows #1265

@leofang

Description

@leofang

After #1242 is merged, we have a nice test matrix for different Windows configurations. However, currently they lock at the same driver version latest:

{ "ARCH": "amd64", "PY_VER": "3.10", "CUDA_VER": "12.9.1", "LOCAL_CTK": "0", "GPU": "rtx2080", "DRIVER": "latest", "DRIVER_MODE": "WDDM" },
{ "ARCH": "amd64", "PY_VER": "3.10", "CUDA_VER": "13.0.2", "LOCAL_CTK": "1", "GPU": "rtxpro6000", "DRIVER": "latest", "DRIVER_MODE": "TCC" },
{ "ARCH": "amd64", "PY_VER": "3.11", "CUDA_VER": "12.9.1", "LOCAL_CTK": "1", "GPU": "v100", "DRIVER": "latest", "DRIVER_MODE": "MCDM" },
{ "ARCH": "amd64", "PY_VER": "3.11", "CUDA_VER": "13.0.2", "LOCAL_CTK": "0", "GPU": "rtx4090", "DRIVER": "latest", "DRIVER_MODE": "WDDM" },
{ "ARCH": "amd64", "PY_VER": "3.12", "CUDA_VER": "12.9.1", "LOCAL_CTK": "0", "GPU": "l4", "DRIVER": "latest", "DRIVER_MODE": "MCDM" },
{ "ARCH": "amd64", "PY_VER": "3.12", "CUDA_VER": "13.0.2", "LOCAL_CTK": "1", "GPU": "a100", "DRIVER": "latest", "DRIVER_MODE": "TCC" },
{ "ARCH": "amd64", "PY_VER": "3.13", "CUDA_VER": "12.9.1", "LOCAL_CTK": "1", "GPU": "l4", "DRIVER": "latest", "DRIVER_MODE": "TCC" },
{ "ARCH": "amd64", "PY_VER": "3.13", "CUDA_VER": "13.0.2", "LOCAL_CTK": "0", "GPU": "rtxpro6000", "DRIVER": "latest", "DRIVER_MODE": "MCDM" },
{ "ARCH": "amd64", "PY_VER": "3.14", "CUDA_VER": "12.9.1", "LOCAL_CTK": "0", "GPU": "v100", "DRIVER": "latest", "DRIVER_MODE": "TCC" },
{ "ARCH": "amd64", "PY_VER": "3.14", "CUDA_VER": "13.0.2", "LOCAL_CTK": "1", "GPU": "l4", "DRIVER": "latest", "DRIVER_MODE": "MCDM" },
{ "ARCH": "amd64", "PY_VER": "3.14t", "CUDA_VER": "12.9.1", "LOCAL_CTK": "1", "GPU": "l4", "DRIVER": "latest", "DRIVER_MODE": "TCC" },
{ "ARCH": "amd64", "PY_VER": "3.14t", "CUDA_VER": "13.0.2", "LOCAL_CTK": "0", "GPU": "a100", "DRIVER": "latest", "DRIVER_MODE": "MCDM" }

But looking at this more closely, one would find that this DRIVER label is not used on Windows at all. Instead, the version is hard-wired in the installer script:
# Set the correct URL, filename, and arguments to the installer
# This driver is picked to support Windows 11 & CUDA 13.0
$version = '581.15'

The reason we need the DRIVER label on Linux is because the driver is pre-installed in the runner VMs (and maintained by the runner team), and we need to use the label to compute the runner name, whereas on Windows due to technical challenges we need to install the driver ourselves as part of the CI jobs.

But, it gives us a unique opportunity to do something that we cannot do on Linux runners today, which is to select the driver versions that we intend to cover.

I think the DRIVER label on Windows could be repurposed to specify the UMD version, with the test matrix expanded:

CTK version UMD version purpose
prev major 12.x 12.0 test CUDA minor version compatibility
prev major 12.x 13.x test CUDA backward compatibility
curr major 13.0 13.0
curr major 13.0 13.x test CUDA backward compatibility
curr major 13.x 13.0 test CUDA minor version compatibility
curr major 13.x 13.x

and we find a way to map the UMD version (ex: 13.0) to KMD version (ex: 581.15). Currently there is no public way to do this mapping and we need to hard-code a small table.

Perhaps this can be added to the nightly runs? #294

Metadata

Metadata

Assignees

No one assigned

    Labels

    CI/CDCI/CD infrastructureenhancementAny code-related improvementstriageNeeds the team's attention

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions