-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some possible bug in multi-threaded update of pools #406
Comments
Reduced reproducer: using DataFrames, CategoricalArrays, Test, Random
Random.seed!(35)
df = DataFrame(x=rand(1:10, 100),
y=categorical(rand(10:15, 100)),
z=categorical(rand(0:20, 100)))
df.y2 = reverse(df.y) # Same levels
gd = groupby(df, :x)
combine(gd, [:x, :y, :z] => ((x, y, z) -> x[1] <= 5 ? y[1] : z[1]) => :res) The problem is that I can easily fix this using a lock (tested locally). But it's harder to define a more general policy regarding thread safety. Taking a lock for expensive operations is fine, but it's unacceptable for EDIT: Adding new levels while doing write operations from another thread is also unsafe as the |
Thank you for looking into this. I will also think what could be done. |
I thought a bit about it. The possible solution is as follows:
|
Yes, that would be OK if we can manage to make at least a reasonable set of operations thread-safe. That's already the case for read operations, but it's hard to achieve for writes that may add levels. This seems problematic in particular due to It seems silly not to make CategoricalArrays mostly thread-safe, as in theory that should be possible without sacrificing performance in all cases except when reordering existing levels. Indeed, apart from this, the only problem is when adding levels (at the end) and the dict needs to be resized, which puts it temporarily in an inconsistent state. But this situation is relatively rare so it would even be acceptable performance-wise to create a new dict instead to ensure that another thread can access a valid dict (old or new) at any moment. Unfortunately, I don't see a clean way to check whether the dict is going to be resized before adding a new key. (A super-simple solution would be to call We could use That would still not fix
That's interesting. Indeed if
Yeah, it would make sense to add some traits to ArrayInterface.jl to detect whether reads and/or writes are thread-safe. Then |
Found in: https://github.com/JuliaData/DataFrames.jl/actions/runs/8635821381/job/23674443172
I can reproduce it by starting Julia with
julia --check-bounds=yes -t 4
and running the following code:Note that the error does not happen always unfortunately.
The relevant part of error is:
The text was updated successfully, but these errors were encountered: