-
Notifications
You must be signed in to change notification settings - Fork 212
Fix skipping the check for nvidia-smi #1084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/ok to test 4e19418 |
if m: | ||
return m.group(1).split(".")[0] | ||
except FileNotFoundError: | ||
except (FileNotFoundError, subprocess.CalledProcessError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: use shutil.which
, since CalledProcessError
can be literally any error that comes from running the command.
Not blocking the review though!
This comment has been minimized.
This comment has been minimized.
Using |
In theory, one could also not have |
Purely as a bug fix this PR seems fine to me. But bigger picture: For building we don't actually need a GPU, so the driver version and nvidia-smi don't meaningfully matter (what @kkraus14 and @cpcloud said already). We already have a hard requirement that cuda-python/cuda_core/build_hooks.py Lines 76 to 83 in 872c68e
That defines conclusively what CUDA version the build is for. We are also sure that we need the headers for the build. So in the given context this should always work:
Maybe a better fix is to integrate something like this? from __future__ import annotations
import re
from pathlib import Path
def get_cuda_version_macro(cuda_home: str | Path) -> int | None:
"""
Given CUDA_HOME, try to extract the CUDA_VERSION macro from include/cuda.h.
Example line in cuda.h:
#define CUDA_VERSION 13000
Returns the integer (e.g. 13000) or None if not found / on error.
"""
try:
cuda_h = Path(cuda_home) / "include" / "cuda.h"
if not cuda_h.is_file():
return None
text = cuda_h.read_text(encoding="utf-8", errors="ignore")
m = re.search(r"^\s*#define\s+CUDA_VERSION\s+(\d+)", text, re.MULTILINE)
if m:
return int(m.group(1))
except Exception:
pass
return None |
Yes switching to check the major version based on |
Well let me take a step back. There is a reason that we want nvidia-smi to play a role. For local development, if the user does not already have cuda-bindings installed (thus triggering the latter checks) we want to ensure we build a cuda.core that uses cuda.bindings whose version is runnable on the user's driver. Now, we don't have a way to inject extra run-time dependencies through the build time info yet, so this is not fully doable, but at least we can think about how this can be approached and see the value of checking driver versions. |
I looked into that, too (before already), I'm attaching the POC implementation for Linux; I believe Windows will work similarly. The reasoning behind it:
from __future__ import annotations
import ctypes
import os
from typing import Optional
def cuda_driver_version() -> Optional[int]:
"""
Linux-only. Try to load `libcuda.so` via standard dynamic library lookup
and call `CUresult cuDriverGetVersion(int* driverVersion)`.
Returns:
int : driver version (e.g., 12040 for 12.4), if successful.
None : on any failure (load error, missing symbol, non-success CUresult).
"""
# CUDA_SUCCESS = 0
CUDA_SUCCESS = 0
try:
# Use system search paths only; do not provide an absolute path.
# Make symbols globally available to any dependent libraries.
mode = os.RTLD_NOW | os.RTLD_GLOBAL
lib = ctypes.CDLL("libcuda.so", mode=mode)
except OSError:
return None
try:
cuDriverGetVersion = lib.cuDriverGetVersion
except AttributeError:
# Symbol not found in the loaded library.
return None
# int cuDriverGetVersion(int* driverVersion);
cuDriverGetVersion.restype = ctypes.c_int # CUresult
cuDriverGetVersion.argtypes = [ctypes.POINTER(ctypes.c_int)]
out = ctypes.c_int(0)
try:
rc = cuDriverGetVersion(ctypes.byref(out))
except Exception:
return None
if rc != CUDA_SUCCESS:
return None
return int(out.value)
if __name__ == "__main__":
print(cuda_driver_version()) |
For this PR, I'd say just merge, it's definitely an improvement, and it exists already. I'll work on another PR to integrate the |
|
Description
Found during local debugging with Andy.
Checklist