-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modifying struct
containing CuArray
fails in threads in 5.0.0 and 5.1.0
#2171
Comments
I can confirm the behaviour at A100 and my NVIDIA GeForce RTX 3060 Laptop GPU: runs at CUDA.jl 4.3.2, fails at 5.1.0. |
I also reproduced this on a different system with newer Julia
|
The first type of errors happens from ordinary code, crashing while waiting for a lock:
It is surprising to me that ReentrantLock wouldn't be thread-safe... cc @vtjnash and @vchuravy for some ideas. The other issues are due to taking a lock from a finalizer. I could try working around this, but it fundamentally requires JuliaLang/julia#39529. |
All of those are because of the attempt to use non-finalizer-safe locks from inside a finalizer (c.f. JuliaLang/julia#39529) |
In the first example, isn't this being called from non-finalizer context, or am I misinterpreting the trace?
|
Is |
Yes it is, sorry, I misunderstood the Let me try and come up with a lockless alternative... |
You can use Base.ThreadSynchronizer locks instead, just make sure to keep your critical region tiny and fast |
Ah, TIL, thanks for the suggestion! I switched from a Dict + ReentrantLock to a simple linear scan over a vector, and would be putting that in the critical section (https://github.com/JuliaGPU/CUDA.jl/pull/2202/files#diff-542b246c46716c47119426a68a1db0365f1fae91d27740b6e06a1f19937420deR88-R110). That should be fine, I guess? @lpawela @ArturPrzybysz Can you try out #2202? |
Seems to be solved with #2202. After many iterations and relaunches I cannot replicate the errors from my MWE. |
Great, thanks for confirming. |
Describe the bug
When running the MWE code I get the following error messages in
v5.0.0
andv5.1.0
. This does not happen inv4.3.2
andv4.4.1
. What is really strange, if I decrese the number of iterations indo_stuff
andreplace_stuff
the code sometimes runs without issues and sometimes fails.To reproduce
The Minimal Working Example (MWE) for this bug:
Manifest.toml
Expected behavior
Examples finish without issue like in CUDA.jl 4.
Version info
Details on Julia:
Details on CUDA:
The text was updated successfully, but these errors were encountered: