Fix skipping the check for nvidia-smi #1084

leofang · 2025-10-03T20:16:46Z

Description

Found during local debugging with Andy.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-10-03T20:16:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2025-10-03T20:17:02Z

/ok to test 4e19418

cpcloud · 2025-10-03T20:20:26Z

cuda_core/build_hooks.py

        if m:
            return m.group(1).split(".")[0]
-    except FileNotFoundError:
+    except (FileNotFoundError, subprocess.CalledProcessError):


TODO: use shutil.which, since CalledProcessError can be literally any error that comes from running the command.

Not blocking the review though!

kkraus14 · 2025-10-03T20:29:57Z

Using nvidia-smi isn't the right answer here regardless. It's 100% valid to build cuda.core with CUDA 12.x libraries on a machine with a CUDA 13+ driver.

cpcloud · 2025-10-03T20:32:41Z

In theory, one could also not have nvidia-smi available and that shouldn't matter.

rwgk · 2025-10-03T22:06:23Z

Purely as a bug fix this PR seems fine to me.

But bigger picture:

For building we don't actually need a GPU, so the driver version and nvidia-smi don't meaningfully matter (what @kkraus14 and @cpcloud said already).

We already have a hard requirement that CUDA_HOME (or CUDA_PATH) are set:

cuda-python/cuda_core/build_hooks.py

Lines 76 to 83 in 872c68e

    
           @functools.cache 
        
           def get_cuda_paths(): 
        
               CUDA_PATH = os.environ.get("CUDA_PATH", os.environ.get("CUDA_HOME", None)) 
        
               if not CUDA_PATH: 
        
                   raise RuntimeError("Environment variable CUDA_PATH or CUDA_HOME is not set") 
        
               CUDA_PATH = CUDA_PATH.split(os.pathsep) 
        
               print("CUDA paths:", CUDA_PATH) 
        
               return CUDA_PATH

That defines conclusively what CUDA version the build is for.

We are also sure that we need the headers for the build. So in the given context this should always work:

$ grep '#define\s\s*CUDA_VERSION' $CUDA_HOME/include/cuda.h
#define CUDA_VERSION 13000

Maybe a better fix is to integrate something like this?

from __future__ import annotations
import re
from pathlib import Path

def get_cuda_version_macro(cuda_home: str | Path) -> int | None:
    """
    Given CUDA_HOME, try to extract the CUDA_VERSION macro from include/cuda.h.

    Example line in cuda.h:
        #define CUDA_VERSION 13000

    Returns the integer (e.g. 13000) or None if not found / on error.
    """
    try:
        cuda_h = Path(cuda_home) / "include" / "cuda.h"
        if not cuda_h.is_file():
            return None
        text = cuda_h.read_text(encoding="utf-8", errors="ignore")
        m = re.search(r"^\s*#define\s+CUDA_VERSION\s+(\d+)", text, re.MULTILINE)
        if m:
            return int(m.group(1))
    except Exception:
        pass
    return None

leofang · 2025-10-03T22:37:32Z

Yes switching to check the major version based on CUDA_VERSION from header would have been my last-minute task if I could have wrapped up today, but looks like I am still hunting down a naughty file descriptor with Andy. @rwgk if you want to rewrite this check, please feel free (either push to my branch or create a new PR, and I'll close this one!)

leofang · 2025-10-03T23:06:32Z

Well let me take a step back. There is a reason that we want nvidia-smi to play a role. For local development, if the user does not already have cuda-bindings installed (thus triggering the latter checks) we want to ensure we build a cuda.core that uses cuda.bindings whose version is runnable on the user's driver. Now, we don't have a way to inject extra run-time dependencies through the build time info yet, so this is not fully doable, but at least we can think about how this can be approached and see the value of checking driver versions.

rwgk · 2025-10-03T23:40:28Z

For local development, if the user does not already have cuda-bindings installed (thus triggering the latter checks) we want to ensure we build a cuda.core that uses cuda.bindings whose version is runnable on the user's driver.

I looked into that, too (before already), I'm attaching the POC implementation for Linux; I believe Windows will work similarly.

The reasoning behind it:

libcuda.so is installed with the driver.
It is needed at boot time, therefore we can count on it being found via the system dynamic library search.
ctypes-based Python code to call cuDriverGetVersion is almost trivial (attached POC, LLM-generated in a second or two).
If the driver is actually installed, we can rely on the Python code to work.

from __future__ import annotations

import ctypes
import os
from typing import Optional


def cuda_driver_version() -> Optional[int]:
    """
    Linux-only. Try to load `libcuda.so` via standard dynamic library lookup
    and call `CUresult cuDriverGetVersion(int* driverVersion)`.

    Returns:
        int  : driver version (e.g., 12040 for 12.4), if successful.
        None : on any failure (load error, missing symbol, non-success CUresult).
    """
    # CUDA_SUCCESS = 0
    CUDA_SUCCESS = 0

    try:
        # Use system search paths only; do not provide an absolute path.
        # Make symbols globally available to any dependent libraries.
        mode = os.RTLD_NOW | os.RTLD_GLOBAL
        lib = ctypes.CDLL("libcuda.so", mode=mode)
    except OSError:
        return None

    try:
        cuDriverGetVersion = lib.cuDriverGetVersion
    except AttributeError:
        # Symbol not found in the loaded library.
        return None

    # int cuDriverGetVersion(int* driverVersion);
    cuDriverGetVersion.restype = ctypes.c_int  # CUresult
    cuDriverGetVersion.argtypes = [ctypes.POINTER(ctypes.c_int)]

    out = ctypes.c_int(0)
    try:
        rc = cuDriverGetVersion(ctypes.byref(out))
    except Exception:
        return None

    if rc != CUDA_SUCCESS:
        return None

    return int(out.value)


if __name__ == "__main__":
    print(cuda_driver_version())

rwgk · 2025-10-03T23:40:46Z

For this PR, I'd say just merge, it's definitely an improvement, and it exists already.

I'll work on another PR to integrate the get_cuda_version_macro() code I posted before and we can continue the discussion there.

github-actions · 2025-10-04T00:12:54Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

fix skipping the check for nvidia-smi

4e19418

leofang added this to the cuda.core beta 7 milestone Oct 3, 2025

leofang requested a review from Andy-Jost October 3, 2025 20:16

leofang self-assigned this Oct 3, 2025

leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Oct 3, 2025

cpcloud approved these changes Oct 3, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

rwgk approved these changes Oct 3, 2025

View reviewed changes

leofang merged commit b09d7ed into NVIDIA:main Oct 4, 2025
74 checks passed

leofang deleted the fix_err branch October 4, 2025 00:00

rwgk pushed a commit to rwgk/cuda-python that referenced this pull request Oct 4, 2025

fix skipping the check for nvidia-smi (NVIDIA#1084)

4dc08e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix skipping the check for nvidia-smi #1084

Fix skipping the check for nvidia-smi #1084

Uh oh!

leofang commented Oct 3, 2025

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

leofang commented Oct 3, 2025

Uh oh!

cpcloud Oct 3, 2025

Uh oh!

This comment has been minimized.

kkraus14 commented Oct 3, 2025

Uh oh!

cpcloud commented Oct 3, 2025

Uh oh!

rwgk commented Oct 3, 2025

Uh oh!

leofang commented Oct 3, 2025 •

edited

Loading

Uh oh!

leofang commented Oct 3, 2025 •

edited

Loading

Uh oh!

rwgk commented Oct 3, 2025

Uh oh!

rwgk commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Oct 4, 2025

Uh oh!

Uh oh!

Fix skipping the check for nvidia-smi #1084

Fix skipping the check for nvidia-smi #1084

Uh oh!

Conversation

leofang commented Oct 3, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

leofang commented Oct 3, 2025

Uh oh!

cpcloud Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

kkraus14 commented Oct 3, 2025

Uh oh!

cpcloud commented Oct 3, 2025

Uh oh!

rwgk commented Oct 3, 2025

Uh oh!

leofang commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwgk commented Oct 3, 2025

Uh oh!

rwgk commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 4, 2025

Uh oh!

Uh oh!

leofang commented Oct 3, 2025 •

edited

Loading

leofang commented Oct 3, 2025 •

edited

Loading

rwgk commented Oct 3, 2025 •

edited

Loading