Skip to content

WIP: windows-already-loaded-dll-detection#1904

Closed
rwgk wants to merge 2 commits intoNVIDIA:mainfrom
rwgk:windows-already-loaded-dll-detection
Closed

WIP: windows-already-loaded-dll-detection#1904
rwgk wants to merge 2 commits intoNVIDIA:mainfrom
rwgk:windows-already-loaded-dll-detection

Conversation

@rwgk
Copy link
Copy Markdown
Collaborator

@rwgk rwgk commented Apr 14, 2026

WIP-WIP-WIP

This is to address CI failures observed under (e.g.) PRs #1856, #1902

They seem to be triggered by the CTK 13.2.1 release.

Fall back to enumerating loaded modules when basename-based GetModuleHandleW lookups miss an already loaded DLL so pathfinder can recognize newer Windows CUPTI loads consistently and keep the regression covered.

Made-with: Cursor
@rwgk rwgk added this to the cuda.pathfinder next milestone Apr 14, 2026
@rwgk rwgk self-assigned this Apr 14, 2026
@rwgk rwgk added bug Something isn't working P0 High priority - Must do! cuda.pathfinder Everything related to the cuda.pathfinder module labels Apr 14, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot bot commented Apr 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Copy Markdown
Collaborator Author

rwgk commented Apr 14, 2026

Cursor GPT-5.4 Extra High Fast


What I know

  • In cuda_pathfinder/cuda/pathfinder/_dynamic_libs/dynamic_lib_subprocess.py, the first load_nvidia_dynamic_lib("cupti") succeeds as a fresh load, so the child process is loading CUPTI itself rather than finding it preloaded.
  • For that first load, cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py follows the normal found-on-disk path and ends up in cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_windows.py via load_with_abs_path(...).
  • The failure happens on the second uncached call, where check_if_already_loaded_from_elsewhere(...) is supposed to rediscover the same loaded DLL before trying to load it again.
  • That rediscovery used only GetModuleHandleW(dll_name). In the failing jobs, it returns no match, so the subprocess raises not loaded_dl_no_cache.was_already_loaded_from_elsewhere.
  • The correlation is strong with the new Windows CUPTI build: the failing jobs show cupti64_2026.1.0.dll, while older Windows jobs using cupti64_2025.2.1.dll and cupti64_2025.3.1.dll pass.

What I do not know

  • I do not have proof of why GetModuleHandleW("cupti64_2026.1.0.dll") misses after that DLL has been loaded.
  • I cannot yet say whether that is a change in CUPTI itself, a Windows loader quirk exposed by this CUPTI build, or some interaction between the two.
  • I have not found a public NVIDIA note or Windows API note that specifically explains this exact cupti64_2026.1.0.dll behavior.

Capture the first load result, basename probe results, and relevant enumerated modules so we can determine why cupti reload detection still fails on real Windows 13.2.1 systems.

Made-with: Cursor
@rwgk
Copy link
Copy Markdown
Collaborator Author

rwgk commented Apr 14, 2026

The fix is simply #1906

Closing this PR. I'll work on more robust (to new DLL names) support under a new PR.

@rwgk rwgk closed this Apr 14, 2026
@rwgk rwgk deleted the windows-already-loaded-dll-detection branch April 14, 2026 18:39
@cpcloud
Copy link
Copy Markdown
Contributor

cpcloud commented Apr 14, 2026

Root cause analysis

The win-64 test_load_nvidia_dynamic_lib[cupti] failures on #1856 / #1902 (and main's scheduled CI) are caused by a new cupti DLL version not being in the descriptor catalog.

Timeline:

  • nvidia-cuda-cupti 13.2.23 (Mar 9) shipped cupti64_2026.1.0.dll — in the catalog, tests pass
  • nvidia-cuda-cupti 13.2.75 (Apr 13) shipped cupti64_2026.1.1.dllnot in the catalog, tests fail

Mechanism:

  1. see_what_works step runs before extra wheels are installed → cupti not available → "not found" → passes
  2. Install cuda.pathfinder extra wheels for testing installs cuda-toolkit[cupti]==13.* → pulls nvidia-cuda-cupti 13.2.75 with cupti64_2026.1.1.dll
  3. all_must_work step runs → cupti loads fine via abs path → but GetModuleHandleW iterates known DLL names (newest is cupti64_2026.1.0.dll) → no match → was_already_loaded_from_elsewhere returns False → RuntimeError

On this PR's approach: The EnumProcessModules fallback still matches against desc.windows_dlls (the known basename set). If cupti64_2026.1.1.dll isn't in that tuple, the enumeration won't find it either — same name set, different lookup method. The catalog needs "cupti64_2026.1.1.dll" added regardless.

Verified by inspecting the actual wheel contents:

# 13.2.23 (old, working)
nvidia/cu13/bin/x86_64/cupti64_2026.1.0.dll

# 13.2.75 (new, breaking)
nvidia/cu13/bin/x86_64/cupti64_2026.1.1.dll

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.pathfinder Everything related to the cuda.pathfinder module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants