Threadsafe evaluation (#667)

penelopeysm · web-flow · commit c09813acd1b7 · 2025-11-27T11:14:06.000Z
diff --git a/_quarto.yml b/_quarto.yml
@@ -82,6 +82,7 @@ website:
             - usage/tracking-extra-quantities/index.qmd
             - usage/predictive-distributions/index.qmd
             - usage/mode-estimation/index.qmd
+            - usage/threadsafe-evaluation/index.qmd
             - usage/performance-tips/index.qmd
             - usage/sampler-visualisation/index.qmd
             - usage/dynamichmc/index.qmd
@@ -215,6 +216,7 @@ usage-probability-interface: usage/probability-interface
 usage-sampler-visualisation: usage/sampler-visualisation
 usage-sampling-options: usage/sampling-options
 usage-submodels: usage/submodels
+usage-threadsafe-evaluation: usage/threadsafe-evaluation
 usage-tracking-extra-quantities: usage/tracking-extra-quantities
 usage-troubleshooting: usage/troubleshooting
 
diff --git a/usage/threadsafe-evaluation/index.qmd b/usage/threadsafe-evaluation/index.qmd
@@ -0,0 +1,270 @@
+---
+title: Threadsafe Evaluation
+engine: julia
+julia:
+  exeflags: 
+    - "--threads=4"
+---
+
+A common technique to speed up Julia code is to use multiple threads to run computations in parallel.
+The Julia manual [has a section on multithreading](https://docs.julialang.org/en/v1/manual/multi-threading), which is a good introduction to the topic.
+
+We assume that the reader is familiar with some threading constructs in Julia, and the general concept of data races.
+This page specificaly discusses Turing's support for threadsafe model evaluation.
+
+:::{.callout-note}
+Please note that this is a rapidly-moving topic, and things may change in future releases of Turing.
+If you are ever unsure about what works and doesn't, please don't hesitate to ask on [Slack](https://julialang.slack.com/archives/CCYDC34A0) or [Discourse](https://discourse.julialang.org/c/domain/probprog/48)
+:::
+
+## MCMC sampling
+
+For complete clarity, this page has nothing to do with parallel sampling of MCMC chains using
+
+```julia
+sample(model, sampler, MCMCThreads(), N, nchains)
+```
+
+That parallelisation exists outside of the model evaluation, and thus is independent of the model contents.
+This page only discusses threading _inside_ Turing models.
+
+## Threading in Turing models
+
+Given that Turing models mostly contain 'plain' Julia code, one might expect that all threading constructs such as `Threads.@threads` or `Threads.@spawn` can be used inside Turing models.
+
+This is, to some extent, true: for example, you can use threading constructs to speed up deterministic computations.
+For example, here we use parallelism to speed up a transformation of `x`:
+
+```julia
+@model function f(y)
+    x ~ dist
+    x_transformed = similar(x)
+    Threads.@threads for i in eachindex(x)
+        x_transformed[i] = some_expensive_function(x[i])
+    end
+    y ~ some_likelihood(x_transformed)
+end
+```
+
+In general, for code that does not involve tilde-statements (`x ~ dist`), threading works exactly as it does in regular Julia code.
+
+**However, extra care must be taken when using tilde-statements (`x ~ dist`) inside threaded blocks.**
+The reason for this is because tilde-statements modify the internal VarInfo object used for model evaluation.
+Essentially, `x ~ dist` expands to something like
+
+```julia
+x, __varinfo__ = DynamicPPL.tilde_assume!!(..., __varinfo__)
+```
+
+and writing into `__varinfo__` is, _in general_, not threadsafe.
+Thus, parallelising tilde-statements can lead to data races [as described in the Julia manual](https://docs.julialang.org/en/v1/manual/multi-threading/#Using-@threads-without-data-races).
+
+## Threaded tilde-statements
+
+**As of version 0.41, Turing only supports the use of tilde-statements inside threaded blocks when these are observations (i.e., likelihood terms).**
+
+This means that the following code is safe to use:
+
+```{julia}
+using Turing 
+
+@model function threaded_obs(N)
+    x ~ Normal()
+    y = Vector{Float64}(undef, N)
+    Threads.@threads for i in 1:N
+        y[i] ~ Normal(x)
+    end
+end
+
+N = 100
+y = randn(N)
+model = threaded_obs(N) | (; y = y)
+```
+
+Evaluating this model is threadsafe, in that Turing guarantees to provide the correct result in functions such as:
+
+```{julia}
+logjoint(model, (; x = 0.0))
+```
+
+(we can compare with the true value)
+
+```{julia}
+logpdf(Normal(), 0.0) + sum(logpdf.(Normal(0.0), y))
+```
+
+When sampling, you must disable model checking, but otherwise results will be correct:
+
+```{julia}
+sample(model, NUTS(), 100; check_model=false, progress=false)
+```
+
+::: {.callout-warning}
+## Upcoming changes
+
+Starting from DynamicPPL 0.39, if you use tilde-statements or `@addlogprob!` inside threaded blocks, you will have to declare this upfront using:
+
+```julia
+model = threaded_obs() | (; y = randn(N))
+threadsafe_model = setthreadsafe(model, true)
+```
+
+Then you can sample from `threadsafe_model` as before.
+
+The reason for this change is because threadsafe evaluation comes with a performance cost, which can sometimes be substantial.
+In the past, threadsafe evaluation was always enabled, i.e., this cost was *always* incurred whenever Julia was launched with more than one thread.
+However, this is not an appropriate way to determine whether threadsafe evaluation is needed!
+:::
+
+**On the other hand, parallelising the sampling of latent values is not supported.**
+Attempting to do this will either error or give wrong results.
+
+```{julia}
+#| error: true
+@model function threaded_assume_bad(N)
+    x = Vector{Float64}(undef, N)
+    Threads.@threads for i in 1:N
+        x[i] ~ Normal()
+    end
+    return x
+end
+
+model = threaded_assume_bad(100)
+
+# This will throw an error (and probably a different error
+# each time it's run...)
+model()
+```
+
+**Note, in particular, that this means that you cannot currently use `predict` to sample new data in parallel.**
+
+:::{.callout-note}
+## Threaded `predict`
+
+Support for threaded `predict` will be added in DynamicPPL 0.39 (see [this pull request](https://github.com/TuringLang/DynamicPPL.jl/pull/1130)).
+:::
+
+That is, even for `threaded_obs` where `y` was originally an observed term, you _cannot_ do:
+
+```{julia}
+#| error: true
+model = threaded_obs(N) | (; y = y)
+chn = sample(model, NUTS(), 100; check_model=false, progress=false)
+
+pmodel = threaded_obs(N)  # don't condition on data
+predict(pmodel, chn)
+```
+
+## Alternatives to threaded observation
+
+An alternative to using threaded observations is to manually calculate the log-likelihood term (which can be parallelised using any of Julia's standard mechanisms), and then _outside_ of the threaded block, [add it to the model using `@addlogprob!`]({{< meta usage-modifying-logprob >}}).
+
+For example:
+
+```{julia}
+# Note that `y` has to be passed as an argument; you can't
+# condition on it because otherwise `y[i]` won't be defined.
+@model function threaded_obs_addlogprob(N, y)
+    x ~ Normal()
+
+    # Instead of this:
+    # Threads.@threads for i in 1:N
+    #     y[i] ~ Normal(x)
+    # end
+
+    # Do this instead:
+    lls = map(1:N) do i
+        Threads.@spawn begin
+            logpdf(Normal(x), y[i])
+        end
+    end
+    @addlogprob! sum(fetch.(lls))
+end
+```
+
+In a similar way, you can also use your favourite parallelism package, such as `FLoops.jl` or `OhMyThreads.jl`.
+See [this Discourse post](https://discourse.julialang.org/t/parallelism-within-turing-jl-model/54064/9) for some examples.
+
+We make no promises about the use of tilde-statements _with_ these packages (indeed it will most likely error), but as long as you use them to only parallelise regular Julia code (i.e., not tilde-statements), they will work as intended.
+
+The main downside of this approach is:
+
+1. You can't use conditioning syntax to provide data; it has to be passed as an argument or otherwise included inside the model.
+2. You can't use `predict` to sample new data.
+
+On the other hand, one benefit of rewriting the model this way is that sampling from this model with `MCMCThreads()` will always be reproducible.
+
+```{julia}
+using Random
+N = 100
+y = randn(N)
+model = threaded_obs_addlogprob(N, y)
+nuts_kwargs = (check_model=false, progress=false, verbose=false)
+
+chain1 = sample(Xoshiro(468), model, NUTS(), MCMCThreads(), 1000, 4; nuts_kwargs...)
+chain2 = sample(Xoshiro(468), model, NUTS(), MCMCThreads(), 1000, 4; nuts_kwargs...)
+mean(chain1[:x]), mean(chain2[:x])  # should be identical
+```
+
+In contrast, the original `threaded_obs` (which used tilde inside `Threads.@threads`) is not reproducible when using `MCMCThreads()`.
+(In principle, we would like to fix this bug, but we haven't yet investigated where it stems from.)
+
+```{julia}
+model = threaded_obs(N) | (; y = y)
+nuts_kwargs = (check_model=false, progress=false, verbose=false)
+chain1 = sample(Xoshiro(468), model, NUTS(), MCMCThreads(), 1000, 4; nuts_kwargs...)
+chain2 = sample(Xoshiro(468), model, NUTS(), MCMCThreads(), 1000, 4; nuts_kwargs...)
+mean(chain1[:x]), mean(chain2[:x])  # oops!
+```
+
+## AD support
+
+Finally, if you are [using Turing with automatic differentiation]({{< meta usage-automatic-differentiation >}}), you also need to keep track of which AD backends support threadsafe evaluation.
+
+ForwardDiff is the only AD backend that we find to work reliably with threaded model evaluation.
+
+In particular:
+
+ - ReverseDiff sometimes gives right results, but quite often gives incorrect gradients.
+ - Mooncake [currently does not support multithreading at all](https://github.com/chalk-lab/Mooncake.jl/issues/570).
+ - Enzyme [mostly gives the right result, but sometimes gives incorrect gradients](https://github.com/TuringLang/DynamicPPL.jl/issues/1131).
+
+## Under the hood
+
+:::{.callout-note}
+This part will likely only be of interest to DynamicPPL developers and the very curious user.
+:::
+
+### Why is VarInfo not threadsafe?
+
+As alluded to above, the issue with threaded tilde-statements stems from the fact that these tilde-statements modify the VarInfo object used for model evaluation, leading to potential data races.
+
+Traditionally, VarInfo objects contain both *metadata* as well as *accumulators*.
+Metadata is where information about the random variables' values are stored.
+It is a Dict-like structure, and pushing to it from multiple threads is therefore not threadsafe (Julia's `Dict` has similar limitations).
+
+On the other hand, accumulators are used to store outputs of the model, such as log-probabilities
+The way DynamicPPL's threadsafe evaluation works is to create one set of accumulators per thread, and then combine the results at the end of model evaluation.
+
+In this way, any function call that _solely_ involving accumulators can be made threadsafe.
+For example, this is why observations are supported: there is no need to modify metadata, and only the log-likelihood accumulator needs to be updated.
+
+However, `assume` tilde-statements always modify the metadata, and thus cannot currently be made threadsafe.
+
+### OnlyAccsVarInfo
+
+As it happens, much of what is needed in DynamicPPL can be constructed such that they *only* rely on accumulators.
+
+For example, as long as there is no need to *sample* new values of random variables, it is actually fine to completely omit the metadata object.
+This is the case for `LogDensityFunction`: since values are provided as the input vector, there is no need to store it in metadata.
+We need only calculate the associated log-prior probability, which is stored in an accumulator.
+Thus, starting from DynamicPPL v0.39, `LogDensityFunction` itself will in fact be completely threadsafe.
+
+Technically speaking, this is achieved using `OnlyAccsVarInfo`, which is a subtype of `VarInfo` that only contains accumulators, and no metadata at all.
+It implements enough of the `VarInfo` interface to be used in model evaluation, but will error if any functions attempt to modify or read its metadata.
+
+There is currently an ongoing push to use `OnlyAccsVarInfo` in as many settings as we possibly can.
+For example, this is why `predict` will be threadsafe in DynamicPPL v0.39: instead of modifying metadata to store the predicted values, we store them inside a `ValuesAsInModelAccumulator` instead, and combine them at the end of evaluation.
+
+However, propagating these changes up to Turing will require a substantial amount of additional work, since there are many places in Turing which currently rely on a full VarInfo (with metadata).
+See, e.g., [this PR](https://github.com/TuringLang/DynamicPPL.jl/pull/1154) for more information.