Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug build DLL hangs on LoadLibrary (Windows) #1298

Open
blinkfrog opened this issue Dec 23, 2023 · 17 comments
Open

Debug build DLL hangs on LoadLibrary (Windows) #1298

blinkfrog opened this issue Dec 23, 2023 · 17 comments
Labels
discussion General discussion about something

Comments

@blinkfrog
Copy link
Contributor

I'm encountering an issue where a DLL built in Debug configuration hangs when loaded using LoadLibrary. Dependency Walker also freezes when attempting to load the DLL. This issue does not occur in Release mode.

The hang occurs during the DLL loading process, specifically when using the SYCL parallel_for construct. Removing this construct allows the DLL to load correctly.

Minimal Reproducible Example:

#include <sycl/sycl.hpp>

extern "C" __declspec(dllexport) void acpp_dll_function()
{
    auto d_selector{ sycl::cpu_selector_v };
    sycl::queue q(d_selector);
    q.submit([&](sycl::handler& h)
    {
        h.parallel_for(sycl::range<1>(1), [=](auto i) {});
    });
}

This simple DLL hangs on load in Debug mode.

Build:

Using command line:
python "c:/adaptivecpp/bin/acpp" -g -D_DEBUG -shared -o dynamiclib.dll dynamiclib.cpp --acpp-targets="omp" -D_MT -lmsvcrtd -Xlinker /NODEFAULTLIB:libcmt

Using cmake:
CMakeLists.txt:

cmake_minimum_required(VERSION 3.5)
project(dynamiclib)

set(AdaptiveCpp_DIR "C:/AdaptiveCpp/lib/cmake/AdaptiveCpp")
set(ACPP_TARGETS "omp")

set(CMAKE_C_COMPILER "C:/llvm/bin/clang.exe")
set(CMAKE_CXX_COMPILER "C:/llvm/bin/clang++.exe")

find_package(AdaptiveCpp CONFIG REQUIRED)

include_directories(${PROJECT_BINARY_DIR} ${PROJECT_SOURCE_DIR})

add_library(dynamiclib SHARED "dynamiclib.cpp")
add_sycl_to_target(TARGET dynamiclib SOURCES "dynamiclib.cpp")

cmake .. -G Ninja -DCMAKE_C_COMPILER=clang.exe -DCMAKE_CXX_COMPILER=clang++.exe -DCMAKE_BUILD_TYPE=Debug
ninja

Trying to load DLL:

HMODULE acpp_dll = LoadLibrary(L"dynamiclib.dll"); // hangs here

Dependency Walker also freezes while trying to load this DLL.

Any insights or suggestions would be greatly appreciated.

@blinkfrog blinkfrog added the discussion General discussion about something label Dec 23, 2023
@blinkfrog
Copy link
Contributor Author

blinkfrog commented Dec 25, 2023

Some update.

I've made some progress in investigating the DLL loading issue and wanted to share the latest findings.

Using Process Monitor alongside Dependency Walker, I've confirmed that there are no issues with dependency resolution. All necessary DLLs are found, opened, and then closed without any errors, just before the hang occurs.

I used gflags and WinDbg to analyze the call stack at the point of the hang. The last output line before the hang is:

LdrpInitializeNode - INFO: Calling init routine 00007FFB70A50E90 for DLL "C:\Users\Daniil\source\repos\Test\LoadLibraryTest\x64\Debug\dynamiclib.dll"

The call stack suggests a hang related to a lock, specifically involving C++ standard library mutex operations. This indicates that the issue is happening during the dynamic initialization of the DLL:

00 000000a5`1737f248 00007ffb`c0627975     ntdll!NtWaitForAlertByThreadId+0x14
01 000000a5`1737f250 00007ffa`e8072793     ntdll!RtlAcquireSRWLockExclusive+0x165
02 000000a5`1737f2c0 00007ffa`e80724e5     MSVCP140D!mtx_do_lock+0xb3 [D:\a\_work\1\s\src\vctools\crt\github\stl\src\mutex.cpp @ 95] 
03 000000a5`1737f320 00007ffb`707c86d0     MSVCP140D!_Mtx_lock+0x15 [D:\a\_work\1\s\src\vctools\crt\github\stl\src\mutex.cpp @ 164] 

As my code is extremely simple and doesn't contain any global objects, this indicates that the hang is likely occurring within the AdaptiveCpp runtime initialization.

I am using Release build of AdaptiveCpp. I plan to build AdaptiveCpp with Debug configuration to include debug symbols. This will probably provide greater visibility into the library's initialization code during the DLL loading process.

@tfiner
Copy link

tfiner commented Dec 30, 2023

This page has a lot of good information on what you are allowed to do in DllMain: https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices

Try Application Verifier, it might add more information, but it sounds like a loader lock problem.

@blinkfrog
Copy link
Contributor Author

Thank you for the information. However, I have no DllMain. The code I provided is a full reproducer. Providing DllMain in a Windows DLL is not strictly necessary. The compiler and linker automatically generate an entry point for the DLL. However, I tried to add DllMain for debug purposes, and this didn't help, breakpoint in this function doesn't hit, which indicates that the problem may occur in other dependency libraries.

@illuhad
Copy link
Collaborator

illuhad commented Dec 30, 2023

Hi @blinkfrog. I'm not really a Windows person so I'm afraid I cannot help much. I am just curious, you said the hang occurs if you remove the parallel_for(). Does that mean that it works if you leave the construction of the queue in the code? That would be surprising, because the construction of a queue should already trigger the initialization of the AdaptiveCpp runtime and its backends.

Your reproducer is missing a q.wait() (or some other form of synchronization) at the end to avoid UB. But if you say that the issue is triggered at startup, then this is probably not related to your problem :(

@blinkfrog
Copy link
Contributor Author

Hi @illuhad. No, it's the opposite: hang occurs only when parallel_for() is present.

As for missing q.wait(). Please correct me if I am wrong, but In SYCL, using q.wait() isn't strictly necessary. It can be needed in some cases though. In this particular reproducer, SYCL runtime anyway should wait for all tasks to be finished when deconstructing queue (and buffers, if they would be used here).

@tfiner
Copy link

tfiner commented Dec 30, 2023

Are there any globals (or static data structures) that are calling OS code in the library? Those get initialized and code run in them at load. I can't tell you how many times I have been burnt by this particular issue.

@illuhad
Copy link
Collaborator

illuhad commented Dec 30, 2023

No, it's the opposite: hang occurs only when parallel_for() is present.

That's interesting. I would not expect any additional libraries being loaded just for a parallel_for. Except perhaps some internal libraries of backends that we do not control. Does it happen for both CUDA and OpenMP backends? Does it also happen if you just compile for CUDA, and then force it to run on OpenMP (e.g. using ACPP_VISIBILITY_MASK="omp")?

As for missing q.wait(). Please correct me if I am wrong, but In SYCL, using q.wait() isn't strictly necessary. It can be needed in some cases though. In this particular reproducer, SYCL runtime anyway should wait for all tasks to be finished when deconstructing queue (and buffers, if they would be used here).

No, this is incorrect. Buffers have their own synchronization rules, so if you were using buffers they might block in their destructor. But for the queue itself, this is not the case. From https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.queue.class

A SYCL queue may be destroyed even when there are uncompleted commands that have been submitted to the queue. Doing so does not block. Instead, any commands that have been submitted to the queue begin execution when their requisites are satisfied, just as they would had the queue not been destroyed. Any event objects for those commands are signaled in the normal manner when the command completes. Resources associated with the queue will be freed by the time the last command completes.

Queue destructor does not block. Normally this is not an issue, if the task does not have any observable side effect, as in your case. However, problems can arise if there is also no synchronization before exiting main(). In this case, the AdaptiveCpp runtime might try to finish the enqueued tasks while the program is already starting to shut down. This can cause all kinds of backend-specific issues. E.g. compute drivers might already have unloaded at this point.
I could imagine that similar issues occur if you start unloading your library withour prior synchronization.

@illuhad
Copy link
Collaborator

illuhad commented Dec 30, 2023

Are there any globals (or static data structures) that are calling OS code in the library? Those get initialized and code run in them at load. I can't tell you how many times I have been burnt by this particular issue.

Almost all of the runtime is initialized "at first use", so loading the runtime itself should not be an issue. What it does do however which departs from that pattern is that it registers the kernels that are contained in the binary with the kernel cache using constructors of global objects. I'm not sure to what extent this process includes OS functions in the call graph. In principle, it is not complex logic. It might check some environment variables for user settings that it is affected by.
But if this is connected to the problem, it might explain why there is only a problem if the code actually includes a kernel.

@blinkfrog
Copy link
Contributor Author

That's interesting. I would not expect any additional libraries being loaded just for a parallel_for. Except perhaps some internal libraries of backends that we do not control. Does it happen for both CUDA and OpenMP backends? Does it also happen if you just compile for CUDA, and then force it to run on OpenMP (e.g. using ACPP_VISIBILITY_MASK="omp")?

Yes, it happens for both CUDA and OpenMP backends. Just have tried also this mixed variant - compiled for CUDA and forced to run on OpenMP, but result is the same - it hangs.

From the last lines of the call stack I only can conclude that the hang occurs during the dynamic initialization of global objects. The lock might be part of the initialization of a global or static object, possibly a singleton.

00 000000a5`1737f248 00007ffb`c0627975     ntdll!NtWaitForAlertByThreadId+0x14
01 000000a5`1737f250 00007ffa`e8072793     ntdll!RtlAcquireSRWLockExclusive+0x165
02 000000a5`1737f2c0 00007ffa`e80724e5     MSVCP140D!mtx_do_lock+0xb3 [D:\a\_work\1\s\src\vctools\crt\github\stl\src\mutex.cpp @ 95] 
03 000000a5`1737f320 00007ffb`707c86d0     MSVCP140D!_Mtx_lock+0x15 [D:\a\_work\1\s\src\vctools\crt\github\stl\src\mutex.cpp @ 164] 
04 000000a5`1737f350 00007ffb`707c85a8     dynamiclib!acpp_dll_function+0x7680

No, this is incorrect. Buffers have their own synchronization rules, so if you were using buffers they might block in their destructor. But for the queue itself, this is not the case. From https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.queue.class (...) Queue destructor does not block. (...)

Thanks, this is very useful to know (although I never use SYCL without buffers and I usually work with accessors that way so I don't need to call q.wait() at all, as all tasks are synchronized internally. I use it mainly when benchmarking performance).

@illuhad
Copy link
Collaborator

illuhad commented Jan 1, 2024

@blinkfrog
Copy link
Contributor Author

Can you try commenting out this line here?

Sure! After commenting this line out, my test app loads dll fine now without hanging on LoadLibrary.

@blinkfrog
Copy link
Contributor Author

blinkfrog commented Jan 9, 2024

Some update regarding this issue (using original unmodified AdaptiveCpp).

The hang occurs not only when loading the DLL but also when starting an executable built in Debug mode. The executable hangs at the same point – acquiring a mutex lock, which suggests a common underlying problem.
The hang is specifically occurring within the acpp-rt.dll

I have a questions.

In my current setup, the Debug build of my program uses the AdaptiveCpp runtime built in Release configuration. Could this be a potential cause of the hang issue? Is it generally advisable or acceptable to use a Release build of AdaptiveCpp with a Debug build of a program?

I also would like to know if such a configuration (Debug program with Release AdaptiveCpp runtime) is commonly used or tested on Linux.

Thank you for your time and assistance.

@illuhad
Copy link
Collaborator

illuhad commented Jan 9, 2024

Hi, this is not systematically tested, but I remember having occasionally used such configurations on Linux when changing build types. I am not aware of any issues nor why this could cause problems. But I don't know about Windows (are there ABI changes on Windows when switching between debug and release builds?).

@fodinabor
Copy link
Collaborator

Jup, don't mix Debug and Release on windows, they have different abi and when linking normally (I.e. not at runtime), you will be notified about this by erroring out with "iterator level x in file X doesn't match level y in file Y"

@blinkfrog
Copy link
Contributor Author

blinkfrog commented Jan 9, 2024

Thank you very much. I am new to building opensource programs from source, and using cmake in overall. Could you please suggest a proper way to build and install debug version of llvm and AdaptiveCpp in case when I need to have both Debug and Release versions installed? Should I install them into a different directories, or this isn't needed - may be binaries will have d suffix added?
Installing them to different directories but having the same binaries names can lead to a problem when required dll can't be properly selected when its path is resolved via Environment PATH variable - in this case the same dll will be selected in all configurations.

@blinkfrog
Copy link
Contributor Author

Is Debug build of LLVM required for debug build of AdaptiveCpp? I can't build Debug LLVM: I get this error: LINK : fatal error LNK1189: library limit of 65535 objects exceeded

@blinkfrog
Copy link
Contributor Author

Closing the issue as it is obviously because of mixing Release build of AdaptiveCpp/Clang and Debug build of binaries I build. I should use Debug build of Clang which I can't compile :( and AdaptiveCpp for debug purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion General discussion about something
Projects
None yet
Development

No branches or pull requests

4 participants