Skip to content

Conversation

leofang
Copy link
Member

@leofang leofang commented Oct 3, 2025

Description

Found during local debugging with Andy.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@leofang leofang added this to the cuda.core beta 7 milestone Oct 3, 2025
@leofang leofang requested a review from Andy-Jost October 3, 2025 20:16
@leofang leofang self-assigned this Oct 3, 2025
@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Oct 3, 2025
Copy link
Contributor

copy-pr-bot bot commented Oct 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang
Copy link
Member Author

leofang commented Oct 3, 2025

/ok to test 4e19418

if m:
return m.group(1).split(".")[0]
except FileNotFoundError:
except (FileNotFoundError, subprocess.CalledProcessError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: use shutil.which, since CalledProcessError can be literally any error that comes from running the command.

Not blocking the review though!

This comment has been minimized.

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 3, 2025

Using nvidia-smi isn't the right answer here regardless. It's 100% valid to build cuda.core with CUDA 12.x libraries on a machine with a CUDA 13+ driver.

@cpcloud
Copy link
Contributor

cpcloud commented Oct 3, 2025

In theory, one could also not have nvidia-smi available and that shouldn't matter.

@rwgk
Copy link
Collaborator

rwgk commented Oct 3, 2025

Purely as a bug fix this PR seems fine to me.

But bigger picture:

For building we don't actually need a GPU, so the driver version and nvidia-smi don't meaningfully matter (what @kkraus14 and @cpcloud said already).

We already have a hard requirement that CUDA_HOME (or CUDA_PATH) are set:

@functools.cache
def get_cuda_paths():
CUDA_PATH = os.environ.get("CUDA_PATH", os.environ.get("CUDA_HOME", None))
if not CUDA_PATH:
raise RuntimeError("Environment variable CUDA_PATH or CUDA_HOME is not set")
CUDA_PATH = CUDA_PATH.split(os.pathsep)
print("CUDA paths:", CUDA_PATH)
return CUDA_PATH

That defines conclusively what CUDA version the build is for.

We are also sure that we need the headers for the build. So in the given context this should always work:

$ grep '#define\s\s*CUDA_VERSION' $CUDA_HOME/include/cuda.h
#define CUDA_VERSION 13000

Maybe a better fix is to integrate something like this?

from __future__ import annotations
import re
from pathlib import Path

def get_cuda_version_macro(cuda_home: str | Path) -> int | None:
    """
    Given CUDA_HOME, try to extract the CUDA_VERSION macro from include/cuda.h.

    Example line in cuda.h:
        #define CUDA_VERSION 13000

    Returns the integer (e.g. 13000) or None if not found / on error.
    """
    try:
        cuda_h = Path(cuda_home) / "include" / "cuda.h"
        if not cuda_h.is_file():
            return None
        text = cuda_h.read_text(encoding="utf-8", errors="ignore")
        m = re.search(r"^\s*#define\s+CUDA_VERSION\s+(\d+)", text, re.MULTILINE)
        if m:
            return int(m.group(1))
    except Exception:
        pass
    return None

@leofang
Copy link
Member Author

leofang commented Oct 3, 2025

Yes switching to check the major version based on CUDA_VERSION from header would have been my last-minute task if I could have wrapped up today, but looks like I am still hunting down a naughty file descriptor with Andy. @rwgk if you want to rewrite this check, please feel free (either push to my branch or create a new PR, and I'll close this one!)

@leofang
Copy link
Member Author

leofang commented Oct 3, 2025

Well let me take a step back. There is a reason that we want nvidia-smi to play a role. For local development, if the user does not already have cuda-bindings installed (thus triggering the latter checks) we want to ensure we build a cuda.core that uses cuda.bindings whose version is runnable on the user's driver. Now, we don't have a way to inject extra run-time dependencies through the build time info yet, so this is not fully doable, but at least we can think about how this can be approached and see the value of checking driver versions.

@rwgk
Copy link
Collaborator

rwgk commented Oct 3, 2025

For local development, if the user does not already have cuda-bindings installed (thus triggering the latter checks) we want to ensure we build a cuda.core that uses cuda.bindings whose version is runnable on the user's driver.

I looked into that, too (before already), I'm attaching the POC implementation for Linux; I believe Windows will work similarly.

The reasoning behind it:

  • libcuda.so is installed with the driver.
  • It is needed at boot time, therefore we can count on it being found via the system dynamic library search.
  • ctypes-based Python code to call cuDriverGetVersion is almost trivial (attached POC, LLM-generated in a second or two).
  • If the driver is actually installed, we can rely on the Python code to work.
from __future__ import annotations

import ctypes
import os
from typing import Optional


def cuda_driver_version() -> Optional[int]:
    """
    Linux-only. Try to load `libcuda.so` via standard dynamic library lookup
    and call `CUresult cuDriverGetVersion(int* driverVersion)`.

    Returns:
        int  : driver version (e.g., 12040 for 12.4), if successful.
        None : on any failure (load error, missing symbol, non-success CUresult).
    """
    # CUDA_SUCCESS = 0
    CUDA_SUCCESS = 0

    try:
        # Use system search paths only; do not provide an absolute path.
        # Make symbols globally available to any dependent libraries.
        mode = os.RTLD_NOW | os.RTLD_GLOBAL
        lib = ctypes.CDLL("libcuda.so", mode=mode)
    except OSError:
        return None

    try:
        cuDriverGetVersion = lib.cuDriverGetVersion
    except AttributeError:
        # Symbol not found in the loaded library.
        return None

    # int cuDriverGetVersion(int* driverVersion);
    cuDriverGetVersion.restype = ctypes.c_int  # CUresult
    cuDriverGetVersion.argtypes = [ctypes.POINTER(ctypes.c_int)]

    out = ctypes.c_int(0)
    try:
        rc = cuDriverGetVersion(ctypes.byref(out))
    except Exception:
        return None

    if rc != CUDA_SUCCESS:
        return None

    return int(out.value)


if __name__ == "__main__":
    print(cuda_driver_version())

@rwgk
Copy link
Collaborator

rwgk commented Oct 3, 2025

For this PR, I'd say just merge, it's definitely an improvement, and it exists already.

I'll work on another PR to integrate the get_cuda_version_macro() code I posted before and we can continue the discussion there.

@leofang leofang merged commit b09d7ed into NVIDIA:main Oct 4, 2025
74 checks passed
@leofang leofang deleted the fix_err branch October 4, 2025 00:00
Copy link

github-actions bot commented Oct 4, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk pushed a commit to rwgk/cuda-python that referenced this pull request Oct 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants