Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New blogpost: PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads() #1904

Merged
merged 25 commits into from Jul 6, 2023

Conversation

MasonProtter
Copy link
Contributor

@MasonProtter MasonProtter commented Jun 5, 2023

This is a blogpost some of us have been working on initially on hackmd, spurred by the somewhat surprising discovery of just how many packages out in the wild are using this dangerous pattern.

I feel myself being strongly pulled towards expanding this into a rather long form discussion of multithreading, and also wanting to wait for PRs to replace @threads with a pmapreduce or whatever, but I think the most important short term thing is to keep this blog post somewhat short, and focused on the specific problem of people indexing into storage using threadid() (or other common invalid usages of threadid()), and offering fixes without lowering the signal to noise ratio too much -- we need people to actually read and understand this!

Feedback, fixes, and bikeshedding all appreciated and welcome.


TODO's before publication

  • Update master & 1.9 docs to guide against misuse of threadid() e.g.
    • add docs on task-specific buffering using @threads #48542
    • add docs on task migration #50047
    • Add !!1compat notes to threadid/nthreads/maxthreadid
    • Add note to recommend against @async
  • Make sure that ^ 1.9 version is released so docs are discoverable
  • Get signoff by core multithreading developers
  • More detail in the Packages section, perhaps with examples (or remove it)
  • Bikeshead the title.

@MasonProtter MasonProtter changed the title Create PSA-dont-use-threadid.md New blogpost: PSA: don't use states[threadid()] Jun 5, 2023
@DilumAluthge DilumAluthge marked this pull request as draft June 5, 2023 18:38

MultiThreadedCaches.jl on the other hand attempts to make the `states[threadid()]`-like pattern safer by using lock mechanisms to stop data races. We think this is not an ideal way to proceed, but it may make transitioning to safer code easier and require fewer rewrites, such as in cases where a package must manage state which may be concurrently written to by a user, and the package cannot control how the user structures their code.

[^historynote]: ### Concurrency & Parallelism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a big footnote

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally added to the main body, but I moved it to a footnote because I wanted the main thrust of the article to be more focused, but I also couldn't bring myself to delete it.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am torn about it as well. I originally just wanted to define the terms, so that everyone understood them, but then it grew into this historical perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it an Appendix section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference?

MasonProtter and others added 3 commits June 5, 2023 13:33
Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
MasonProtter and others added 4 commits June 6, 2023 11:43
Co-authored-by: Sukera <11753998+Seelengrab@users.noreply.github.com>
Co-authored-by: Sukera <11753998+Seelengrab@users.noreply.github.com>
Co-authored-by: Sukera <11753998+Seelengrab@users.noreply.github.com>
Co-authored-by: Sukera <11753998+Seelengrab@users.noreply.github.com>
Alt titles:
- PSA: Multithreading with `states[threadid()]` is unsafe
- PSA: Multithreading with `buffers[threadid()]` is unsafe
- PSA: Don't assume `threadid()` is stable within a task
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just this?

Suggested change
- PSA: Don't assume `threadid()` is stable within a task
- PSA: Don't assume `threadid()` is stable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this title rather misleadning, because the stability of threadid() is not the actual problem here. The problem is race conditions in updating the buffer at a specific threadid().

E.g. in the single threaded example in the blogpost, the threadid() is perfectly stable, it's always 1, but the problem persists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A technically correct (but horrible UX) title could be

PSA: Don't use non-threadsafe, volatile, shared state to attempt to coordinate concurrent task execution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I want is to catch the attention of people who are only dimly aware of this stuff and what it means.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"You are doing it wrong! Common misconceptions about threadid() and nthreads()"

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last part of that is good

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @vchuravy's comment below, with a slight tweak:

PSA: Thread-local state is no longer recommended; Common misconceptions about threadid() and nthreads()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I've changed the title-in-progress to that one. I'm still not really convinced it's any better since people don't really know what that means, but I also don't have strong feelings about it and the old title was also not perfect.

@non-Jedi
Copy link

non-Jedi commented Jun 6, 2023

This blogpost should probably acknowledge the previous blog post teaching this pattern. It would also be good to include a commit in this PR adding a disclaimer to that same blog post.

@vchuravy
Copy link
Sponsor Member

vchuravy commented Jun 7, 2023

This blogpost should probably acknowledge the previous blog post teaching this pattern. It would also be good to include a commit in this PR adding a disclaimer to that same blog post.

Oh my! That didn't age very well.

This gives a natural title: "We no longer recommend thread-local state; Common misconceptions about threadid() and nthreads()"

do_something(states)
```

The above code is **incorrect** because the tasks spawned by `@threads` are allowed to yield to other tasks during their execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be clearer?

Suggested change
The above code is **incorrect** because the tasks spawned by `@threads` are allowed to yield to other tasks during their execution.
The above code is **incorrect** because the tasks spawned by `@threads` are neither guaranteed to have a threadid exclusively during their lifetime nor to keep the same threadid during their execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think the "have a threadid exclusively during their lifetime" part is a bit easy to misunderstand. I think the sentence you're wanting to replace here should not be seen as an explanation, but as a definition at the start of an explanation, which is in the following couple of sentences.

I think it's useful to motivate or define here what "yielding" is

MasonProtter and others added 2 commits June 8, 2023 10:02
Co-authored-by: Klaus Crusius <KlausC@users.noreply.github.com>
Co-Authored-By: Mason Protter <mason.protter@icloud.com>
Co-Authored-By: Thibaut Lienart <tlienart@me.com>
@IanButterworth
Copy link
Sponsor Member

Formatting fixed and pushed here, thanks to @tlienart https://julialang.netlify.app/previews/pr1914/blog/2023/06/psa-dont-use-threadid/

@IanButterworth
Copy link
Sponsor Member

Also the two docs PRs called out up top (JuliaLang/julia#48542 JuliaLang/julia#50047) are due to land in 1.9.2

Copy link
Sponsor Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An excellent and much-needed post, thanks!

do_something(states)
```

The above code is **incorrect** because the tasks spawned by `@threads` are allowed to yield to other tasks during their execution. This means that between reading `old_val` and storing `new_val` in the storage, the task could be paused and a new task running on the same thread with the same `threadid()` could concurrently write to `states[tid]`, causing a race condition and thus work being lost.
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For developers who may be less familiar with race conditions, would it be easier to delay explaining what can go wrong until the f(i) = (sleep(0.001); i) example below? I'm imagining you could do something like this:

    tid = Threads.threadid()   # each task gets `tid = 1`
    old_var = state[tid]       # each task reads the current value, which for all is 0 (!) because... 
    new_var = old_var + f(i)   # ...the `sleep` in `f` causes all tasks to pause *simultaneously* here (all loop iterations start, but do not yet finish)
    state[tid] = new_var       # after being released from the `sleep`, each task sets `state[1]` to 1

My hope is that the comments walk readers through the logic more clearly than the short description above.

blog/2023/06/PSA-dont-use-threadid.md Outdated Show resolved Hide resolved
Copy link
Sponsor Contributor

@tlienart tlienart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style note: for readability's sake it might be better to put # 1, # 2, # 3, ... and then put the comments in a numbered list below the code as otherwise this will y-overflow on devices and it's a bit awkward to read code having to scroll horizontally over it 🤷

edit: hmm looks like I didn't do a good job at commenting the changes, I was referring to d3b24f2

@IanButterworth
Copy link
Sponsor Member

IanButterworth commented Jul 5, 2023

@MasonProtter
Copy link
Contributor Author

So I tried a couple things to get the comments into a list like @tlienart suggested but I wasn't really happy with any of them and personally find it not so bad to scroll horizontally if the upside is that the comment is right beside the code.

Like, if the comments aren't going to be in-line with the code then I'm not sure I really see the point because we already have a paragraph explaining what's going on anyways.

Does anyone have suggestions on how to improve the comments in the code for more legibility? I'd really like to get this published.

@MasonProtter
Copy link
Contributor Author

Maybe lets get a new preview built so that we can see what it looks like?

@IanButterworth
Copy link
Sponsor Member

Co-authored-by: Mosè Giordano <giordano@users.noreply.github.com>
MasonProtter and others added 2 commits July 6, 2023 10:48
Co-authored-by: Ian Butterworth <i.r.butterworth@gmail.com>
@MasonProtter
Copy link
Contributor Author

Okay, I think this is ready to merge!

@IanButterworth IanButterworth merged commit c6c4bc6 into JuliaLang:main Jul 6, 2023
1 check passed
@xwuupb
Copy link

xwuupb commented Apr 30, 2024

I do not understand the logic in this post (https://julialang.org/blog/2023/07/PSA-dont-use-threadid/).

  • in the first sample you are using the @threads-pattern for incorrect code, which is hardly reproducible.
  • in the second sample you are using the @sync-@spawn-pattern to demonstrate the incorrectness, which is surely reproducible.

Could you please give an easily reproducible sample using the @threads-pattern? Thank!

@Seelengrab
Copy link
Contributor

@threads is more or less a thin wrapper around @sync @spawn. Neither is necessarily reproducible (in the sense that they end up with the same results every time) because of the potential presence of a race condition.

Could you clarify which part your thinking of when you say that the example is not reproducible?

@xwuupb
Copy link

xwuupb commented Apr 30, 2024

Could you clarify which part your thinking of when you say that the example is not reproducible?

In the section "The buggy pattern" in this Blog (https://julialang.org/blog/2023/07/PSA-dont-use-threadid/) it reads

"The general template that this incorrect code follows is something like the following:" (then the @threads-pattern code).

The following Julia code using the general template (as mentioned in the Blog) is repeatedly tested via julia -t 7 threads.jl in a bash for-loop. However, the output does not change in different runs (at least in my tests).

using Base.Threads: nthreads, @threads, threadid

f(i) = (sleep(0.001); i)

states = [ 0 for _ in 1:nthreads() ]
N = 100
@threads for x in 1:N
  tid = threadid()
  old_val = states[tid]
  new_val = old_val + f(x)
  states[tid] = new_val
end
@show sum(states)

For the @sync-@spawn-pattern (the 2nd example in the Blog): the following Julia code is repeatedly tested via julia -t 7 spawn.jl in a bash for-loop. The output does change very frequently in different runs.

using Base.Threads: nthreads, @spawn, threadid

f(i) = (sleep(0.001); i)

let states = [ 0 for _ in 1:nthreads() ], N = 100
  @sync for i in 1:N
    @spawn begin
      tid = threadid()
      old_val = states[tid]
      new_val = old_val + f(i)
      states[tid] = new_val
    end
  end
  @show sum(states)
end

Thus, I mean the general template in the Blog is not reproducible. The @threads-pattern code above always gives correct results in my tests. More info about my tests:

  • CPU: AMD EPYC 7763 64-Core
  • OS: Red Hat Enterprise Linux 8.8
  • Julia: version 1.10.2

@giordano
Copy link
Contributor

@xwuupb please, if you have any questions ask them on Discourse https://discourse.julialang.org/, a long merged pull request isn't a good place to have this discussions.

@xwuupb
Copy link

xwuupb commented Apr 30, 2024

@xwuupb please, if you have any questions ask them on Discourse https://discourse.julialang.org/, a long merged pull request isn't a good place to have this discussions.

Now, the topic is moved here: https://discourse.julialang.org/t/psa-thread-local-state-is-no-longer-recommended-common-misconceptions-about-threadid-and-nthreads/101274/54

@MasonProtter MasonProtter deleted the MasonProtter-PSA branch May 1, 2024 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet