nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL by mdboom · Pull Request #2118 · NVIDIA/cuda-python

mdboom · 2026-05-20T19:20:07Z

Summary

cuda.core.system.get_process_name(pid) raises UnicodeDecodeError under
WSL whenever the calling process has a non-C locale (which is the default
state for any CPython process, since the interpreter calls
setlocale(LC_ALL, "") at startup). This is reproducible by running the
cuda_core test suite with any seed that schedules
tests/system/test_system_device.py::test_compute_running_processes before
tests/system/test_system_system.py::test_get_process_name.

The underlying defect is in NVML's WSL implementation (see Root cause
below). This PR adds a scoped, defensive workaround in cuda_core so the
public API returns a correct value on WSL. It also fixes a latent issue
where get_process_name was effectively unusable from a fresh process
because it never primed NVML's per-PID name cache.

Root cause: the WSL mojibake

NVML's nvmlSystemGetProcessName on WSL takes a different code path
depending on the process's current locale. With the default "C" locale,
the function returns the basename portion of /proc/<pid>/exe correctly.
With any other locale (including the typical en_US.UTF-8), it instead
walks an internal UTF-16LE buffer holding the executable path but uses a
4-byte stride (as if the buffer were UTF-32LE). Each "code point" it
pulls is therefore two adjacent ASCII bytes packed into the low and
next-higher bytes of a single 24-bit value. That value is then emitted as
an extended 5-byte UTF-8 sequence (the 0xF8-prefixed encoding used to
represent code points beyond U+10FFFF).

The net result for, say, a Python process whose /proc/<pid>/exe resolves
to:

/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/python3.14

is that the returned buffer looks like ~180 bytes of f8 … chunks
followed by the correctly-encoded trailing basename, e.g.:

f8 9a 80 80 af  f8 9b 90 81 af  f8 8b b0 81 a5  f8 99 80 81 ad  …  /python3.14\0

Decoding the first chunk illustrates the pattern:

bytes f8 9a 80 80 af decode as the extended-UTF-8 code point 0x68002F
that 24-bit value packs 'h' (0x68) in the high byte and '/' (0x2F)
in the low byte — i.e. the source ASCII bytes /h read as a
little-endian 32-bit value padded with zeros

Every chunk has this structure; together they spell out the prefix
/home/mdboom/.local/share/uv/python/cpython-3.14.0-linux-x86_64-gnu/bin/
two characters at a time. The trailing /python3.14 is unaffected because
of where the buggy stride leaves the cursor.

Why the workaround needs to "re-prime"

nvmlSystemGetProcessName is cache-driven: the per-PID name is populated
the first time NVML enumerates compute processes that include the PID
(typically via nvmlDeviceGetComputeRunningProcesses_v3). Critically:

The mojibake is produced during the prime call, not during the read.
Once a buggy entry is in NVML's cache, switching to the "C" locale and
re-reading does not unscramble it — the cache survives the locale
flip.
Re-running the prime call under the "C" locale overwrites the cached
entry with the correct UTF-8 string. Subsequent reads (in any locale)
then return correctly.

So the workaround must do prime + read together under "C".

Behaviour after this PR

Native Linux / Windows: behaviour is identical to before, except that
get_process_name now primes the NVML cache automatically. This makes
it usable from a fresh process (previously a caller had to have
manually queried device.compute_running_processes first or accept
NotFoundError).
WSL: the locale flip is applied around the prime + read sequence,
so the returned name is the correct UTF-8 string regardless of the
caller's locale.

Discussion

Should we instead try to fix this at the cuda_bindings layer and fix it
for all cuda_bindings users? In that case I guess cuda_core should raise
an exception from get_process_name if on WSL and the cuda_bindings
installed is too old?
I do plan to file the underlying bug with NVML. Locale-sensitive APIs should
generally be avoided.

…n WSL

…e-on-wsl

github-actions · 2026-05-20T19:43:21Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2118/
https://nvidia.github.io/cuda-python/pr-preview/pr-2118/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2118/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2118/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk

I used Cursor GPT-5.5 1M Extra High

GPT Findings

Medium: cuda_bindings/docs/source/release/13.2.0-notes.rst and
cuda_bindings/docs/source/release/13.3.0-notes.rst give an incomplete
raw-NVML workaround. The PR and nvbug context indicate the corruption happens
when nvmlDeviceGetComputeRunningProcesses_v3 first primes the PID cache, so
setting the locale to "C" only before nvml.system_get_process_name can be
too late if the cache was already populated. The notes should say to prime and
read under "C", matching the cuda.core workaround.
Medium: cuda_core/cuda/core/system/_system.pyx now makes
get_process_name() depend on successfully enumerating compute processes on
every device. Any unrelated device-level NVML failure now breaks a per-PID
lookup, including on non-WSL where this is a behavior change. Consider trying
the direct read first on non-WSL, or making priming failures narrower.
Low: cuda_core/docs/source/release/1.1.0-notes.rst says the WSL workaround
may hold a global lock, but the implementation uses POSIX per-thread locale
APIs and no global lock is present. That note looks stale or misleading.

Lightweight Thread-Safety Recommendation

The new locale switching implementation is reasonably thread-safe because it
uses POSIX newlocale/uselocale/freelocale, which scope the "C" locale to
the calling OS thread rather than mutating process-global locale state.

The remaining race is around NVML's process-name cache, which appears to be
process/global driver state. On WSL, another thread could call a cache-priming
path such as nvmlDeviceGetComputeRunningProcesses_v3 under a non-"C" locale
between get_process_name()'s prime and read steps, reintroducing corrupted
cached data.

A lightweight improvement would be to add a module-private Python
threading.RLock shared by the cuda.core paths most likely to touch this
cache. Hold it around:

system.get_process_name()'s WSL c_locale_guard() + prime + read sequence.
Device.compute_running_processes on WSL, since that is the main in-package
path that can prime the process-name cache.

This would not protect raw cuda.bindings.nvml.* calls or external users, but
it would cover the most likely cuda.core race without globally serializing all
NVML access. A Python RLock is sufficient here: even if Cython releases the GIL
inside NVML wrappers, the lock remains held until the surrounding Python with
block exits.

kkraus14 · 2026-05-21T02:42:07Z

 ------------

 * Updating from older versions (v12.6.2.post1 and below) via ``pip install -U cuda-python`` might not work. Please do a clean re-installation by uninstalling ``pip uninstall -y cuda-python`` followed by installing ``pip install cuda-python``.
+* ``nvml.system_get_process_name`` on WSL can return incorrect values.  To work around this, set the locale to "C" before calling ``nvml.device_get_compute_running_processes_v3`` (which sets the process names) and before calling ``nvml.system_get_process_name``. ``cuda_core`` does this automatically, but users of the raw NVML API will need to do this manually.


Why is this in both 13.2 and 13.3 release notes?

It's a compatibility note that's in line with the existing v12.6.2.post1 and below note, which we've been carrying forward since 12.8.0, I believe in every single release since then:

$ git grep 'v12.6.2.post1 and below' | cutniq | cut -d/ -f-4 | uniq -c 16 cuda_bindings/docs/source/release 15 cuda_python/docs/source/release

kkraus14 · 2026-05-21T02:48:07Z

 CUDA_BINDINGS_NVML_IS_COMPATIBLE: bool

+
+cdef bint _detect_wsl():


There is already a WSL test helper with broader detection logic: cuda_python_test_helpers._detect_wsl checks /proc/sys/kernel/osrelease and also falls back to WSL_DISTRO_NAME / WSL_INTEROP. Because the test helpers are used by cuda.bindings tests and should not depend on cuda.core, we unfortunately probably need to keep this small functionality duplicated. Could we pull the env-var fallback into this runtime helper too, so the two helpers have the same detection coverage?

I looked at the git history, cuda_python_test_helpers._detect_wsl was added in #1045. That PR introduced the WSL_DISTRO_NAME / WSL_INTEROP. To me it looks like purely defensive AI-generated code that I often saw in the older agents; even in newer ones I still sometimes need to clean out defensive code. I quizzed GPT-5.5 about it; it told me:

So yes: those env vars look mostly defensive. ... env vars can be inherited into containers/subprocesses, so the fallback is slightly more false-positive-prone than the /proc check.

I'd argue for removing the fallback, although definitely not in this PR.

Fair enough. We should just keep the logic consistent if possible. Ideally we'd only have one implementation, but because we want the test helpers to not depend on cuda-core and we don't want to ship this as part of cuda-bindings / cuda-pathfinder I don't think it's worth a new utilities package or anything of the like.

nvbug-6193808: Work around mojibake in nvml.system_get_process_name o…

06a2bc8

…n WSL

mdboom added this to the cuda.core next milestone May 20, 2026

mdboom self-assigned this May 20, 2026

mdboom added bug Something isn't working P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels May 20, 2026

github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label May 20, 2026

mdboom added 2 commits May 20, 2026 15:23

Merge remote-tracking branch 'upstream/main' into fix-get-process-nam…

aa31400

…e-on-wsl

Re-enable test

bfb518e

Move POSIX-only functionality to a separate module

22fd00c

rwgk reviewed May 20, 2026

View reviewed changes

Address comments in the PR

960659c

mdboom requested a review from rwgk May 20, 2026 23:47

rwgk approved these changes May 21, 2026

View reviewed changes

kkraus14 reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL#2118

nvbug-6193808: Work around mojibake in nvml.system_get_process_name on WSL#2118
mdboom wants to merge 5 commits into
NVIDIA:mainfrom
mdboom:fix-get-process-name-on-wsl

mdboom commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk left a comment

Uh oh!

kkraus14 May 21, 2026

Uh oh!

rwgk May 21, 2026

Uh oh!

kkraus14 May 21, 2026

Uh oh!

rwgk May 21, 2026

Uh oh!

kkraus14 May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		CUDA_BINDINGS_NVML_IS_COMPATIBLE: bool


		cdef bint _detect_wsl():

Conversation

mdboom commented May 20, 2026

Summary

Root cause: the WSL mojibake

Why the workaround needs to "re-prime"

Behaviour after this PR

Discussion

Uh oh!

github-actions Bot commented May 20, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

GPT Findings

Lightweight Thread-Safety Recommendation

Uh oh!

kkraus14 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk May 21, 2026

Choose a reason for hiding this comment

Uh oh!

kkraus14 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk May 21, 2026

Choose a reason for hiding this comment

Uh oh!

kkraus14 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants