Skip to content

Conversation

@MasonProtter
Copy link
Contributor

@MasonProtter MasonProtter commented Jan 28, 2025

This is my proposal to close #14948

Any time Threads.@threads constructs a closure that contains a Core.Box, that should be a clear sign of a race condition, since the order at which that Box is read from or written to is undefined. This is a really easy, surprising, and hidden footgun for users to run into.

I suspect that a large chunk of existing cases out in the wild have not been caught because in many cases, the bugs caused by this have a low probability of being seen (though exist nonetheless).

Basically, all I'm doing here is walking through the fieldtypes of the closure produced by @threads and if that closure has a Core.Box in it, we throw an error that suggests what went wrong and how it can potentially be fixed. This is a pretty aggressive approach to the problem in #14948, but I think it's a serious enough problem that it's warranted.

@MasonProtter MasonProtter added multithreading Base.Threads and related functionality 😃🍕 and other emoji and removed 😃🍕 and other emoji labels Jan 28, 2025
@adienes
Copy link
Member

adienes commented Jan 28, 2025

is it really true that this is a "clear sign of a race condition" ? wouldn't it be more accurate to say it's a "clear sign that the compiler was unable to prove the absence of a race condition"

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jan 28, 2025

No. The mere presence of a Box access to that variable into a race condition. Here's an example you can run yourself to see how boxing causes a race condition (a version of this is in the tests I added):

julia> let A = 1 # Some outer-local-variable that gets boxed
           function wrong()
               out = zeros(Int, 10)
               Threads.@threads for i in 1:10
                   A = i # Redefine A, this causes A.contents to be mutated
                   sleep(rand()/10)
                   out[i] = A # Reference A, reading from the boxed A.conents that has since been mutated on another thread
               end
               out
           end
           function not_wrong()
               out = zeros(Int, 10)
               Threads.@threads for i in 1:10
                   local A = i
                   sleep(rand()/10)
                   out[i] = A
               end
               out
           end

           @info "Surprise!" wrong()' not_wrong()'
       end
┌ Info: Surprise!
│   (wrong())' =1×10 adjoint(::Vector{Int64}) with eltype Int64:4  8  9  6  8  6  2  6  9  8
│   (not_wrong())' =1×10 adjoint(::Vector{Int64}) with eltype Int64:1  2  3  4  5  6  7  8  9  10

@MasonProtter
Copy link
Contributor Author

The idea of turning lexically mutated variables into a mutable Box is only correct for sequential programs. It's wildly incorrect / undefined if you have parallelism / concurrency.

@mbauman
Copy link
Member

mbauman commented Jan 28, 2025

Isn't section 1.3 in https://discourse.julialang.org/t/unclear-allocation-behaviour-with-built-in-sum/125304 an example of a Box in a @threads that's not a race condition?

I mean, it's pretty terrible, but it's not incorrect (I don't think).

Edit: I suppose that example is one layer deeper than this introspects, but I think #15276 could rear its head similarly at the @threads closure.

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Jan 28, 2025

Ugh, you're right, that case probably isn't incorrect per-se (so far as I can see). I think if we were designing @threads right now though, we'd probably not want to include that.

Could we at least stick a deprecation warning in here instead of an error maybe? (with the intention of making it an error in a hypothetical v2)

@bbrehm
Copy link

bbrehm commented Jan 30, 2025

A canonical example of non-racy boxes would be

julia> function badBox(inc::Union{Nothing, Int}, A::Vector{Int}, B::Vector{Int})
       length(A)==length(B) || throw("merde")
       if inc === nothing 
           inc = 0
       end
       Threads.@threads for i=1:length(A)
            A[i] = inc + B[i]
       end
       nothing
       end

This is a very typical example that users are running into.

Note that this is very nice julia code, apart from the @threads: Local variables have no types, SSA-values have, so the code is obviously type-stable (and later references to inc mean the phi-node after the if/else, and that has known+stable type!).

Many other languages force a style of

julia> function badBox(inc0::Union{Nothing, Int}, A::Vector{Int}, B::Vector{Int})
       length(A)==length(B) || throw("merde")
       inc = if inc0 === nothing 
           0
       else
           inc0
       end
       Threads.@threads for i=1:length(A)
            A[i] = inc + B[i]
       end
       nothing
       end

It is a breath of fresh air that julia doesn't care and doesn't force the user to do the SSA transform by hand.

The only issue is that lowering doesn't understand the control-flow, and if you use closures, then users must do the SSA transform by hand, which violates the spirit of all other parts of the language :(

PS. Maybe even more poignantly,

julia> function badBox(inc, A::Vector{Int}, B::Vector{Int})
       length(A)==length(B) || throw("merde")
       inc += 1
       Threads.@threads for i=1:length(A)
       A[i] = inc + B[i]
       end
       nothing
       end

If captures behaved like the rest of the language, this would lower identical to

julia> function badBox(inc, A::Vector{Int}, B::Vector{Int})
       length(A)==length(B) || throw("merde")
       inc1 = inc + 1
       Threads.@threads for i=1:length(A)
       A[i] = inc1 + B[i]
       end
       nothing
       end

even before any compiler optimizations.

@jakobnissen
Copy link
Member

jakobnissen commented Jan 30, 2025

This PR would be a great improvement for usability. I've been bitten by this bug before, and been among the worst I've had to debug, because its both nondeterministic, requires esoteric knowledge of Julia to understand (arguably even an esoteric bug, namely the slow closure bug), and goes against the general behaviour of the language where we expect that reassigning the same variable does not cause it to become a mutable reference behind your back.

However, a straight up error is probably too breaking. Perhaps it could instead throw a warning?

@jariji jariji mentioned this pull request Mar 10, 2025
@liuyxpp
Copy link

liuyxpp commented Mar 11, 2025

Now I know where my mysterious bug comes from! It is so important that this issue can be addressed ASAP.

@DilumAluthge
Copy link
Member

If this is considered too breaking to do by default, perhaps it's a good candidate for "strict mode".

@KristofferC
Copy link
Member

requires esoteric knowledge of Julia

I am not sure this is true. Take the original example:

let A = 1 # Some outer-local-variable that gets boxed
           function wrong()
               out = zeros(Int, 10)
               Threads.@threads for i in 1:10
                   A = i # Redefine A, this causes A.contents to be mutated
                   sleep(rand()/10)
                   out[i] = A # Reference A, reading from the boxed A.conents that has since been mutated on another thread
               end
               out
           end
end

A is an outer local so it is clearly not allowed to be written to concurrently?

@MasonProtter
Copy link
Contributor Author

The barrier for what is considered esoteric can vary a lot from user to user.

In this case, I strongly suspect that knowing this will cause a race condition requires an above average understanding of julia's scoping rules.

@KristofferC
Copy link
Member

The barrier for what is considered esoteric can vary a lot from user to user.

My point is that you don't need to know about closures or Core.Boxes or anything. Just that concurrently writing to a variable is a race condition. Is that unreasonable to expect when writing multithreaded code?

@MasonProtter
Copy link
Contributor Author

I think the problem here is that a lot of users are not thinking of it that way. They think that when they set A = i at the start of the loop, they are thinking of it as local to the loop iteration, rather than mutating some shared definition of A.

At least, I've encountered multiple people who got tripped up by exactly this, and I think it's not such a crazy mistake to make if you don't understand the scoping rules well.

@jakobnissen
Copy link
Member

jakobnissen commented Mar 11, 2025

I would also say that one of the first things we learn when programming is the distinction between mutation (writing to a variable) and assignment. That's something I teach early when I teach programming, and it's also explicitly mentioned early in the Julia manual. What is so confusing here is that something that really looks like assignment really is mutation.

in particular, we learn early that assigning a variable inside a function will not cause the variable to change outside the function. That is, there is usually a difference between:

A = [1]
foo(x) = (x[1] = 2)
foo(A) # mutation

and

A = 1
foo(x) = (x = 2)
foo(A) # assignment, no mutation

@adienes
Copy link
Member

adienes commented Mar 11, 2025

also extra-confusing is that A gets boxed even if the assignment in the parent local scope happens lexically AFTER the threaded loop

let
    function wrong()
        out = zeros(Int, 10)
        Threads.@threads for i in 1:10
            A = i # define A for the first time (lexically)
            sleep(rand()/10)
            out[i] = A # user is trying to reference local A only
        end
        out
    end
    A = 1 # boxed! this hoists "A" to the same variable as in `wrong` but presumably the user wanted a new one
end

@MasonProtter
Copy link
Contributor Author

Thanks @adienes, yes I should have constructed my example with that ordering to make it extra clear just how confusing this can actually be sometimes.

@KristofferC
Copy link
Member

in particular, I (and I suspect most people) learn that assigning a variable inside a function will not cause the variable to change outside the function.

The example in particular wraps the A in a let to force it to be local. If it would be a normal variable there is no issue.

@KristofferC
Copy link
Member

This would be a good idea for a linter though.

@mbauman
Copy link
Member

mbauman commented Mar 11, 2025

something that really looks like assignment really is mutation

I mean, this really is still assignment — or more specifically, deciding you have a better use for an existing name. The trouble is that it's reusing a name from a shared scope.

But perhaps that's precisely the level at which this should be tackled. Instead of an error or warning, what if @threads for ... lowered to an explicitly local-only scope? That is, if it explicitly added a let x=x, y=y, etc=etc block around the closure for all its identifiers contained within — effectively applying the #15276 let workaround for you? As a bonus it fixes #15276 for @threads :)

Most updates to outer scope names are already racy and broken. This would just make them no-ops instead of races. And it's a very easy rule to document. I imagine there could still be some "safe" patterns of this in use in the wild, however, where folks are carefully guarding such updates with a mutex or some such... but that seems quite unlikely.

@adienes
Copy link
Member

adienes commented Mar 11, 2025

lowered to an explicitly local-only scope?

this will lead to some very fun and frequent discussions about the differences between scopes

  • local (soft, interactive)
  • local (soft, non-interactive)
  • local (hard)
  • local (extra-hard for @threads)

In seriousness though, for my tastes I find that proposal to be a bit too clever. I don't dispute that it would fix the majority of cases where this pattern is problematic, but it's also introducing a pattern I haven't seen precedent for in any other macro or language construct. it's also breaking, as it removes legitimate functionality (albeit probably rarely used)

despite the fact that I do agree that this behavior is super counterintuitive and easily leads to bugs, I do think the right solution is a combination of

  • strong language in documentation warning users that multithreaded code is hard to write correctly
  • tools to write multithreaded code correctly (OhMyThreads.jl was an enormous step forward here and I think it would be pretty reasonable to recommend it in Base docs)
  • catch the issue in this PR via a linter with a red squiggle and a clippy pop up saying "don't do that"

@MasonProtter
Copy link
Contributor Author

MasonProtter commented Mar 15, 2025

Even though this PR this will not be going forward (at least in its current form), I am implementing a similar change in OhMyThreads.jl in this PR: JuliaFolds2/OhMyThreads.jl#141

I would appreciate any thoughts or feedback there from people who had thoughts on this PR. The philosophy in that PR is that I am assuming that boxed captures are erroneous by default, but I am also including a (ScopedValue based) interface to disable this check. So if you really wish, you can write stuff like

julia> @allow_boxed_captures let
           v = tmap(1:10) do i
               A = i
               sleep(rand())
               A
           end
           A = 1 # oops, now everything is a race condition!
           v
       end
10-element Vector{Int64}:
 4
 2
 8
 4
 3
 6
 6
 2
 6
 6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multithreading Base.Threads and related functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition caused by variable scope getting lifted from a multithreaded context

8 participants