nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL#2118
nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL#2118mdboom wants to merge 5 commits into
Conversation
|
rwgk
left a comment
There was a problem hiding this comment.
I used Cursor GPT-5.5 1M Extra High
GPT Findings
-
Medium:
cuda_bindings/docs/source/release/13.2.0-notes.rstand
cuda_bindings/docs/source/release/13.3.0-notes.rstgive an incomplete
raw-NVML workaround. The PR and nvbug context indicate the corruption happens
whennvmlDeviceGetComputeRunningProcesses_v3first primes the PID cache, so
setting the locale to"C"only beforenvml.system_get_process_namecan be
too late if the cache was already populated. The notes should say to prime and
read under"C", matching thecuda.coreworkaround. -
Medium:
cuda_core/cuda/core/system/_system.pyxnow makes
get_process_name()depend on successfully enumerating compute processes on
every device. Any unrelated device-level NVML failure now breaks a per-PID
lookup, including on non-WSL where this is a behavior change. Consider trying
the direct read first on non-WSL, or making priming failures narrower. -
Low:
cuda_core/docs/source/release/1.1.0-notes.rstsays the WSL workaround
may hold a global lock, but the implementation uses POSIX per-thread locale
APIs and no global lock is present. That note looks stale or misleading.
Lightweight Thread-Safety Recommendation
The new locale switching implementation is reasonably thread-safe because it
uses POSIX newlocale/uselocale/freelocale, which scope the "C" locale to
the calling OS thread rather than mutating process-global locale state.
The remaining race is around NVML's process-name cache, which appears to be
process/global driver state. On WSL, another thread could call a cache-priming
path such as nvmlDeviceGetComputeRunningProcesses_v3 under a non-"C" locale
between get_process_name()'s prime and read steps, reintroducing corrupted
cached data.
A lightweight improvement would be to add a module-private Python
threading.RLock shared by the cuda.core paths most likely to touch this
cache. Hold it around:
system.get_process_name()'s WSLc_locale_guard()+ prime + read sequence.Device.compute_running_processeson WSL, since that is the main in-package
path that can prime the process-name cache.
This would not protect raw cuda.bindings.nvml.* calls or external users, but
it would cover the most likely cuda.core race without globally serializing all
NVML access. A Python RLock is sufficient here: even if Cython releases the GIL
inside NVML wrappers, the lock remains held until the surrounding Python with
block exits.
| ------------ | ||
|
|
||
| * Updating from older versions (v12.6.2.post1 and below) via ``pip install -U cuda-python`` might not work. Please do a clean re-installation by uninstalling ``pip uninstall -y cuda-python`` followed by installing ``pip install cuda-python``. | ||
| * ``nvml.system_get_process_name`` on WSL can return incorrect values. To work around this, set the locale to "C" before calling ``nvml.device_get_compute_running_processes_v3`` (which sets the process names) and before calling ``nvml.system_get_process_name``. ``cuda_core`` does this automatically, but users of the raw NVML API will need to do this manually. |
There was a problem hiding this comment.
Why is this in both 13.2 and 13.3 release notes?
There was a problem hiding this comment.
It's a compatibility note that's in line with the existing v12.6.2.post1 and below note, which we've been carrying forward since 12.8.0, I believe in every single release since then:
$ git grep 'v12.6.2.post1 and below' | cutniq | cut -d/ -f-4 | uniq -c
16 cuda_bindings/docs/source/release
15 cuda_python/docs/source/release
| CUDA_BINDINGS_NVML_IS_COMPATIBLE: bool | ||
|
|
||
|
|
||
| cdef bint _detect_wsl(): |
There was a problem hiding this comment.
There is already a WSL test helper with broader detection logic: cuda_python_test_helpers._detect_wsl checks /proc/sys/kernel/osrelease and also falls back to WSL_DISTRO_NAME / WSL_INTEROP. Because the test helpers are used by cuda.bindings tests and should not depend on cuda.core, we unfortunately probably need to keep this small functionality duplicated. Could we pull the env-var fallback into this runtime helper too, so the two helpers have the same detection coverage?
There was a problem hiding this comment.
I looked at the git history, cuda_python_test_helpers._detect_wsl was added in #1045. That PR introduced the WSL_DISTRO_NAME / WSL_INTEROP. To me it looks like purely defensive AI-generated code that I often saw in the older agents; even in newer ones I still sometimes need to clean out defensive code. I quizzed GPT-5.5 about it; it told me:
So yes: those env vars look mostly defensive. ... env vars can be inherited into containers/subprocesses, so the fallback is slightly more false-positive-prone than the /proc check.
I'd argue for removing the fallback, although definitely not in this PR.
There was a problem hiding this comment.
Fair enough. We should just keep the logic consistent if possible. Ideally we'd only have one implementation, but because we want the test helpers to not depend on cuda-core and we don't want to ship this as part of cuda-bindings / cuda-pathfinder I don't think it's worth a new utilities package or anything of the like.
Summary
cuda.core.system.get_process_name(pid)raisesUnicodeDecodeErrorunderWSL whenever the calling process has a non-
Clocale (which is the defaultstate for any CPython process, since the interpreter calls
setlocale(LC_ALL, "")at startup). This is reproducible by running thecuda_coretest suite with any seed that schedulestests/system/test_system_device.py::test_compute_running_processesbeforetests/system/test_system_system.py::test_get_process_name.The underlying defect is in NVML's WSL implementation (see Root cause
below). This PR adds a scoped, defensive workaround in
cuda_coreso thepublic API returns a correct value on WSL. It also fixes a latent issue
where
get_process_namewas effectively unusable from a fresh processbecause it never primed NVML's per-PID name cache.
Root cause: the WSL mojibake
NVML's
nvmlSystemGetProcessNameon WSL takes a different code pathdepending on the process's current locale. With the default
"C"locale,the function returns the basename portion of
/proc/<pid>/execorrectly.With any other locale (including the typical
en_US.UTF-8), it insteadwalks an internal UTF-16LE buffer holding the executable path but uses a
4-byte stride (as if the buffer were UTF-32LE). Each "code point" it
pulls is therefore two adjacent ASCII bytes packed into the low and
next-higher bytes of a single 24-bit value. That value is then emitted as
an extended 5-byte UTF-8 sequence (the
0xF8-prefixed encoding used torepresent code points beyond U+10FFFF).
The net result for, say, a Python process whose
/proc/<pid>/exeresolvesto:
is that the returned buffer looks like ~180 bytes of
f8 …chunksfollowed by the correctly-encoded trailing basename, e.g.:
Decoding the first chunk illustrates the pattern:
f8 9a 80 80 afdecode as the extended-UTF-8 code point0x68002F'h'(0x68) in the high byte and'/'(0x2F)in the low byte — i.e. the source ASCII bytes
/hread as alittle-endian 32-bit value padded with zeros
Every chunk has this structure; together they spell out the prefix
/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/two characters at a time. The trailing
/python3.14is unaffected becauseof where the buggy stride leaves the cursor.
Why the workaround needs to "re-prime"
nvmlSystemGetProcessNameis cache-driven: the per-PID name is populatedthe first time NVML enumerates compute processes that include the PID
(typically via
nvmlDeviceGetComputeRunningProcesses_v3). Critically:"C"locale andre-reading does not unscramble it — the cache survives the locale
flip.
"C"locale overwrites the cachedentry with the correct UTF-8 string. Subsequent reads (in any locale)
then return correctly.
So the workaround must do prime + read together under
"C".Behaviour after this PR
get_process_namenow primes the NVML cache automatically. This makesit usable from a fresh process (previously a caller had to have
manually queried
device.compute_running_processesfirst or acceptNotFoundError).so the returned name is the correct UTF-8 string regardless of the
caller's locale.
Discussion
cuda_bindingslayer and fix itfor all
cuda_bindingsusers? In that case I guesscuda_coreshould raisean exception from
get_process_nameif on WSL and thecuda_bindingsinstalled is too old?
generally be avoided.