Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks: don't advance task RNG on task spawn #49110

Merged
merged 1 commit into from
Mar 31, 2023
Merged

Conversation

StefanKarpinski
Copy link
Sponsor Member

@StefanKarpinski StefanKarpinski commented Mar 22, 2023

Previously we had this unfortunate behavior:

julia> Random.seed!(123)
TaskLocalRNG()

julia> randn()
-0.6457306721039767

julia> Random.seed!(123)
TaskLocalRNG()

julia> fetch(@async nothing)

julia> randn()
0.4922456865251828

In other words: the mere act of spawning a child task affects the parent task's RNG (by advancing it four times). This PR preserves the desirable parts of the previous situation: when seeded, the parent and child RNG streams are reproducible. Moreover, it fixes the undesirable behavior:

julia> Random.seed!(123)
TaskLocalRNG()

julia> randn()
-0.6457306721039767

julia> Random.seed!(123)
TaskLocalRNG()

julia> fetch(@async nothing)

julia> randn()
-0.6457306721039767

In other words: the parent RNG is unaffected by spawning a child.

The design is based on the SplitMix 1 and DotMix 2 RNGs, but with some simplifications based on observing that the task tree is always binary and therefore we only need binary pedigree coefficients: when a task forks a child, zero is appended to its pedigree, and the child's pedigree is the same with a one in the last place. Thus all the pedigree coefficients are binary. How does this help matters? In the proof of collision resistance in the DotMix paper, working in a prime modulus is only necessary to guarantee that the difference between pedigree coordinates has a multiplicative inverse. If the coefficients are binary, then the difference is always 1, which means we can work in any modulus, including Z/2^64, which lets us use native integer arithmetic.

Similar to SplitMix, instead of explicitly storing pedigree coordinates, we store each task's dot product and derive each child task's dot product by adding to the parent dot product, the random weight coefficeint for the current tree depth. No multiplication is required, we just add the random weight for the current tree depth to the parent's dot product to get the child's dot product.

Pseudorandom weights for the SplitMix dot product are generated using an internal PCG RNG, specifically the PCG-RXS-M-XS variant. We chose this RNG because we need something small and fast but unlikely to have artifacts which might sabbotage the collision resistance of SplitMix. Since PCG-RXS-M-XS passes BigCrush with only 36 bits of state and we use 64 bits of state, it fits the bill quite well. The major caveat of this RNG is that it produces each value once, which makes it insecure, but this is actually beneficial here: we don't want any repeated weights. We also don't care that this RNG is insecure and invertible since it's only used internally to seed forked child tasks. Rather than the classic LCG multiplier inherited from Knuth, we use the best full width multiplier found by Steele & Vigna 3 searching for high spectrum constants.

We reuse the PCG-RXS-M-XS output function for mixing the bits of the SplitMix dot product, instead of the MurmurHash3 mixing function that it normally uses since we're already using for the internal PCG generator. The choice of mixing output function for SplitMix is essentially arbitrary and only serves to cascade bit differences in dot products.

Copy link
Sponsor Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!

stdlib/Random/src/Xoshiro.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Sponsor Member Author

Just pushed a couple of new commits.

The first one increases instruction-level parallelism in the task spawn by using the previous LCG state, which is perfectly good, to update the SplitMix dot product and do the LCG update in parallel, writing the new value back to both the parent and child tasks. This is a straight improvement, no reason not to do this.

The second commit is speculative: it makes the seeding of child tasks independent of the usage of the parent's main xosiro256++ RNG; instead it only depends on the rngState4 and rngState5 seed values that are also set when the main RNG is seeded. In practice what this means is the following:

julia> Random.seed!(123)
TaskLocalRNG()

julia> randn()
-0.6457306721039767

julia> fetch(@async randn())
-0.8761314978436953

julia> Random.seed!(123)
TaskLocalRNG()

julia> fetch(@async randn())
-0.8761314978436953

julia> randn()
-0.6457306721039767

In other words, using the RNG in the parent doesn't affect the RNG stream of the child—the only thing that determines the childs RNG stream is how the root task was seeded and the task structure, i.e. where in the binary task tree the child is.

I've discussed a bit with @vtjnash and we're not sure about this change. It makes RNG streams and child tasks easier to reason about and less brittle—using the RNG in the parent won't cascade changes into all the children—which seems good. The potential downside is that if a SplitMix dot product collision does happen, then those two tasks will have identical RNG output, whereas if the SplitMix dot product is mixed with the xoshiro256 state, then that will only happen if there is a dot product collision and the xoshiro256 state started out the same, which is extremely unlikely.

@StefanKarpinski
Copy link
Sponsor Member Author

StefanKarpinski commented Mar 23, 2023

Spurred, somewhat ironically, by thinking about the version of this where the child's main RNG seeding is independent of the parent's main RNG state, I've come up with a further improvement to the original version where the parent's RNG does affect the child's RNG.

One observation is that SplitMix collision avoidance actually only helps in cases where the the xoshiro state is identical in two tasks to start with. In cases where the xoshiro state is not identical, different SplitMix hashes can theoretically put you into identical xoshiro states and nothing about the SplitMix construction helps to prevent that, so SplitMix really only helps when the xoshiro states start out the same.

I also noted that if we focus on consecutive children of the same task with no intervening samples from the main RNG perturbing the xoshiro state, then we would be better off using PCG output to disturb the xoshiro state than using the SplitMix dot product to perturb it: PCG output is guaranteed to be collision free since it produces each possible output once per full RNG cycle (2^64), whereas the SplitMix dot product merely makes collisions unlikely and the birthday paradox tells us that if with a 64-bit hash, if we want the chances of collision to be less than 1 in a billion, then we're in trouble if we start 200k tasks. If we perturb xoshiro's state with the output of PCG, then we could start 2^64 (> 10^19) immediate child tasks without collision.

Looking at my original code, the SplitMix dot product is perturbed via addition by the PCG output, but what the above suggests is: What if we were to perturb the xoshiro state directly with the PCG output? That guarantees that each immediate child gets a unique state. Which to: Why not use each xoshiro state register as a SplitMix dot product? Why not, indeed. Assuming no one advances their xoshiro RNG, parent and sibling tasks are guaranteed not to collide (by PCG invertibility), and cousin tasks differ from their greatest common ancestor by SplitMix dot product, which gives the same collision resistance as before—but now we have 256 bits of dot product hash instead of just 64! In other words, we gain multiple benefits:

  1. Parent and sibling tasks are guaranteed not to collide.
  2. We don't have to allocate space in each task for a SplitMix dot product—we save 64 bits per task.
  3. Instead we get to use all 256 bits of xoshiro256 state as a SplitMix dot product, which gives astronomically better collision avoidance for tasks that are not directly related.

There are some subtleties:

  • The weights you perturb each xoshiro register have to be effectively independent or having four of them isn't actually helping your collision avoidance—i.e. you have to make sure a collision in one register doesn't make a collision in another register more likely.
  • PCG generates zero sometimes, which would make the child identical to the parent. That's fine for one register, but you want to make sure that you don't perturb all the registers by zero at the same time.

One obvious option is to draw each of the four weights from the same PCG, which guarantees that they are all distinct and should be unrelated enough to pass statistical muster with BigCrush and the like. There's an argument to be made that the set of possible weights that gives the collision probability isn't really 2^256, but rather 2^64 since that's the size of the PCG state. I'll have to think on that.

Without further ado, here's the improved task split function:

void jl_rng_split(uint64_t dst[5], uint64_t src[5) JL_NOTSAFEPOINT
{
    uint64_t lcg = src[4]; // load internal PCG's LCG state
    dst[0] = src[0] + pcg_out(lcg); lcg = lcg * LCG_MUL + 1;
    dst[1] = src[1] + pcg_out(lcg); lcg = lcg * LCG_MUL + 1;
    dst[2] = src[2] + pcg_out(lcg); lcg = lcg * LCG_MUL + 1;
    dst[3] = src[3] + pcg_out(lcg); lcg = lcg * LCG_MUL + 1;
    dst[4] = lcg;
}

Yep, that's it. Note that we do away with one word of RNG state (the dot product).

@oscardssmith
Copy link
Member

That's awesome!

@StefanKarpinski
Copy link
Sponsor Member Author

Ok, I pushed that as a commit, but then replaced it with an approach that's easier to vectorize and that seems better. We're back to the child RNG being affected by the parent RNG because it lets us use the main RNG's state for SplitMix dot products, which gives really excellent collision resistance.

@StefanKarpinski StefanKarpinski force-pushed the sk/rng_split branch 2 times, most recently from 0b075f8 to 8d3d550 Compare March 27, 2023 18:33
@StefanKarpinski StefanKarpinski marked this pull request as ready for review March 27, 2023 18:34
@StefanKarpinski
Copy link
Sponsor Member Author

I'm going to squash this and add a comment explaining how it works, but I want to keep history around for posterity, so here's a gist of the commit log with diffs.

@StefanKarpinski StefanKarpinski force-pushed the sk/rng_split branch 2 times, most recently from 66147d0 to ca53644 Compare March 28, 2023 15:31
@StefanKarpinski StefanKarpinski added the backport 1.9 Change should be backported to release-1.9 label Mar 28, 2023
@StefanKarpinski StefanKarpinski removed the backport 1.9 Change should be backported to release-1.9 label Mar 28, 2023
@StefanKarpinski
Copy link
Sponsor Member Author

This is ready to go from my POV, it has tests and a long comment explaining how (and why) it works. The CI situation seems like a trash fire but I don't think that's due to this PR.

@oscardssmith
Copy link
Member

What are the odds we can combine this with your idea for a 192 bit RNG so we don't have to increase the task size?

@StefanKarpinski
Copy link
Sponsor Member Author

I'd prefer to make those changes independently; they're quite orthogonal. I'm also not fully sold on that whereas this is definitely better. Also, this rounds the task size up from 15 words to 16, which seems like a nicer size.

@JeffBezanson
Copy link
Sponsor Member

Plus the type tag though...

@StefanKarpinski
Copy link
Sponsor Member Author

Woo! Got the check tests passing.

Previously we had this unfortunate behavior:

    julia> Random.seed!(123)
    TaskLocalRNG()

    julia> randn()
    -0.6457306721039767

    julia> Random.seed!(123)
    TaskLocalRNG()

    julia> fetch(@async nothing)

    julia> randn()
    0.4922456865251828

In other words: the mere act of spawning a child task affects the parent
task's RNG (by advancing it four times). This PR preserves the desirable
parts of the previous situation: when seeded, the parent and child RNG
streams are reproducible. Moreover, it fixes the undesirable behavior:

    julia> Random.seed!(123)
    TaskLocalRNG()

    julia> randn()
    -0.6457306721039767

    julia> Random.seed!(123)
    TaskLocalRNG()

    julia> fetch(@async nothing)

    julia> randn()
    -0.6457306721039767

In other words: the parent RNG is unaffected by spawning a child. The
design is documented in detail in a comment preceding the jl_rng_split
function.
@StefanKarpinski
Copy link
Sponsor Member Author

Ok, fixed the build on 32-bit platforms! The issue was that I was using the hash function to generate the fifth seed value in setseed! which produces a UInt32 on 32-bit platforms, so calling the function fails. This causes a crash at build time since we call setseed! during the build. I'm using a simple ad hoc mixing of the seed now.

@StefanKarpinski StefanKarpinski merged commit 7618e64 into master Mar 31, 2023
@StefanKarpinski StefanKarpinski deleted the sk/rng_split branch March 31, 2023 23:22
@giordano giordano added the domain:randomness Random number generation and the Random stdlib label Apr 1, 2023
@KristofferC
Copy link
Sponsor Member

Would be good with a news entry for this.

@StefanKarpinski
Copy link
Sponsor Member Author

Will do.

Xnartharax pushed a commit to Xnartharax/julia that referenced this pull request Apr 19, 2023
vtjnash pushed a commit that referenced this pull request Sep 15, 2023
This PR adds an optional field to the existing `Xoshiro` struct to be
able to faithfully copy the task-local RNG state.

Fixes #51255
Redo of #51271

Background context: #49110 added an additional state to the task-local
RNG. However, before this PR `copy(default_rng())` did not include this
extra state, causing subtle errors in `Test` where `copy(default_rng())`
is assumed to contain the full task-local RNG state.
NHDaly pushed a commit that referenced this pull request Sep 20, 2023
This PR adds an optional field to the existing `Xoshiro` struct to be
able to faithfully copy the task-local RNG state.

Fixes #51255
Redo of #51271

Background context: #49110 added an additional state to the task-local
RNG. However, before this PR `copy(default_rng())` did not include this
extra state, causing subtle errors in `Test` where `copy(default_rng())`
is assumed to contain the full task-local RNG state.
KristofferC pushed a commit that referenced this pull request Oct 3, 2023
This PR adds an optional field to the existing `Xoshiro` struct to be
able to faithfully copy the task-local RNG state.

Fixes #51255
Redo of #51271

Background context: #49110 added an additional state to the task-local
RNG. However, before this PR `copy(default_rng())` did not include this
extra state, causing subtle errors in `Test` where `copy(default_rng())`
is assumed to contain the full task-local RNG state.

(cherry picked from commit 41b41ab)
nalimilan pushed a commit that referenced this pull request Nov 5, 2023
This PR adds an optional field to the existing `Xoshiro` struct to be
able to faithfully copy the task-local RNG state.

Fixes #51255
Redo of #51271

Background context: #49110 added an additional state to the task-local
RNG. However, before this PR `copy(default_rng())` did not include this
extra state, causing subtle errors in `Test` where `copy(default_rng())`
is assumed to contain the full task-local RNG state.

(cherry picked from commit 41b41ab)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants