Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better default RNG in the future? #27614

Open
PallHaraldsson opened this issue Jun 16, 2018 · 70 comments
Labels
RNG

Comments

@PallHaraldsson
Copy link
Contributor

@PallHaraldsson PallHaraldsson commented Jun 16, 2018

EDIT: "August 2018 Update: xoroshiro128+ fails PractRand very badly. Since this article was published, its authors have supplanted it with xoshiro256**. It has essentially the same performance, but better statistical properties. xoshiro256 is now my preferred PRNG**." I previously quoted: "The clear winner is xoroshiro128+"

https://nullprogram.com/blog/2017/09/21/

"Nobody ever got fired for choosing Mersenne Twister. It’s the classical choice for simulations, and is still usually recommended to this day. However, Mersenne Twister’s best days are behind it. ..

Ultimately I would never choose Mersenne Twister for anything anymore. This was also not surprising."

@StefanKarpinski

This comment has been minimized.

Copy link
Member

@StefanKarpinski StefanKarpinski commented Jun 16, 2018

Too late for this release but the default RNG stream is explicitly documented to not be stable so we can change it in 1.x versions. Another contender is the PCG family of RNGs.

@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Jun 16, 2018

Ok, I thought last change (people assume stable). It's 7 lines of C code.

@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Jun 16, 2018

I'm on smartphone, I at least can't locate "(un)stable" at

https://docs.julialang.org/en/latest/stdlib/Random/

@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Jun 16, 2018

MT, Julia's current default, is "between 1/4 and 1/5 the speed of xoroshiro128+." and MT has "one statistical failure (dab_filltree2)."

"PCG only generates 32-bit integers despite using 64-bit operations. To properly generate a 64-bit value we’d need 128-bit operations, which would need to be implemented in software."

http://xoshiro.di.unimi.it

64-bit Generators
xoshiro256** (XOR/shift/rotate) is our all-purpose, rock-solid generator (not a cryptographically secure generator, though, like all PRNGs in these pages). It has excellent (sub-ns) speed, a state space (256 bits) that is large enough for any parallel application, and it passes all tests we are aware of.

If, however, one has to generate only 64-bit floating-point numbers (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) faster generator with analogous statistical properties. For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits).

If you are tight on space, xoroshiro128** (XOR/rotate/shift/rotate) and xoroshiro128+ have the same speed and use half of the space; the same comments apply. They are suitable only for low-scale parallel applications; moreover, xoroshiro128+ exhibits a mild dependency in Hamming weights that generates a failure after 8 TB of output in our test. We believe this slight bias cannot affect any application.

Finally, if for any reason (which reason?) you need more state, we provide in the same vein xoshiro512** / xoshiro512+ and xoroshiro1024** / xoroshiro1024* (see the paper).

@andreasnoack

This comment has been minimized.

Copy link
Member

@andreasnoack andreasnoack commented Jun 16, 2018

It looks like the timings are based on his own implementation of MT. The one we use is probably quite a bit faster.

Update: The Dieharder test suite is also considered outdated

@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Jun 16, 2018

Also intriguing-just use AES:

Speed
The speed reported in this page is the time required to emit 64 random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, we glue two consecutive outputs. Note that we do not report results using GPUs or SSE instructions: for that to be meaningful, we should have implementations for all generators. Otherwise, with suitable hardware support we could just use AES in counter mode and get 64 secure bits in 1.12ns.

@JeffreySarnoff

This comment has been minimized.

Copy link
Contributor

@JeffreySarnoff JeffreySarnoff commented Jun 17, 2018

Extensive testing by others (Daniel Lemire, John D. Cook) favors PCG and notes a few problems with the xor__s. One or the other or another -- it will require Julia-centric testing and benchmarking.

@Nosferican

This comment has been minimized.

Copy link
Contributor

@Nosferican Nosferican commented Jun 18, 2018

Random is a stdlib so it could have breaking changes and a new release before the next release of the language, no?

@StefanKarpinski

This comment has been minimized.

Copy link
Member

@StefanKarpinski StefanKarpinski commented Jun 18, 2018

Yes, it could have breaking changes before Julia 2.0.

@staticfloat

This comment has been minimized.

Copy link
Member

@staticfloat staticfloat commented Jun 19, 2018

Just for fun, I coded up a prototype implementation:

# A quick test to test a new RNG for Julia
using BenchmarkTools

function xorshift1024(N, s)
    x = Array{UInt64}(undef, N)
    p = 0
    for i = 1:N
        # One-based indexing strikes again
        s0 = s[p + 1]
        p = (p + 1)%16
        s1 = s[p + 1]

        # Make sure you use xor() instead of ^
        s1 = xor(s1, s1 << 31)
        s1 = xor(s1, s1 >> 11)
        s0 = xor(s0, s0 >> 30)
        s[p + 1] = xor(s0, s1)

        # I love magic constants
        x[i] = s[p + 1] * 1181783497276652981
    end
    return x
end

function xoroshiro128plus(N, s)
    x = Array{UInt64}(undef, N)
    @inbounds for i = 1:N
        # copy state to temp variables
        s1, s2 = (s[1], s[2])
        
        # Calculate result immediately
        x[i] = s1 + s2
        
        # Mash up state
        s2 = xor(s2, s1)
        s1 = xor(((s1 << 55) | (s1 >> 9)), s2, (s2 << 14))
        s2 = (s2 << 36) | (s2 >> 28)
        
        # Save new state
        s[1], s[2] = (s1, s2)
    end
    return x
end

# Collect some performance data
for N in round.(Int64, 2 .^ range(12, stop=20, length=3))
    println("Testing generation of $N 64-bit numbers")
    local s = rand(UInt64, 16)

    # compare xorshift against rand()
    @btime xorshift1024($N, $s)
    @btime xoroshiro128plus($N, $s)
    @btime rand(UInt64, $N)
end

This gives surprisingly good timings, considering MT has been optimized for many a year:

Testing generation of 4096 64-bit numbers
  13.889 μs (2 allocations: 32.08 KiB)
  8.136 μs (2 allocations: 32.08 KiB)
  6.733 μs (2 allocations: 32.08 KiB)
Testing generation of 65536 64-bit numbers
  219.095 μs (2 allocations: 512.08 KiB)
  130.786 μs (2 allocations: 512.08 KiB)
  100.082 μs (2 allocations: 512.08 KiB)
Testing generation of 1048576 64-bit numbers
  4.658 ms (2 allocations: 8.00 MiB)
  3.245 ms (2 allocations: 8.00 MiB)
  3.118 ms (2 allocations: 8.00 MiB)

I note that @inbounds actually makes xorshift1024 slower, which is why that's not included in the code above.

@StefanKarpinski

This comment has been minimized.

Copy link
Member

@StefanKarpinski StefanKarpinski commented Jun 19, 2018

I don't think it's all that surprising—MT is pretty complex. One of the huge advantages of xoroshiro and PCG is that they're much simpler and easier to generate really complex code for.

@JeffreySarnoff

This comment has been minimized.

Copy link
Contributor

@JeffreySarnoff JeffreySarnoff commented Jun 19, 2018

they put the suit in pseutdorandom

@Godisemo

This comment has been minimized.

Copy link

@Godisemo Godisemo commented Jun 19, 2018

The package RandomNumbers.jl implements many different RNGs, including xoroshiro128+ which is quite a bit faster than the standard RNG from base according to the benchmarks in the documentation.

@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Jun 19, 2018

Is Float rand more important than Int rand? It seems it may matter:

@JeffreySarnoff thanks for the links; still "favors PCG and notes a few problems with the xor__s." seems not conclusive nor comparing to latest variants.

Java's shiftmix64 is also interesting.

Also Google's
https://github.com/google/randen
as per AES comment above.

@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Jun 19, 2018

sunoru/RandomNumbers.jl#1 (comment)

@sunoru "Yes, I have considered accelerating some RNGs with GPU"

I guess AES is not available on GPUs so that may be a factor to NOT choose AES as default on CPUs? I.e. would we want same default on GPU and CPU?

@StefanKarpinski

This comment has been minimized.

Copy link
Member

@StefanKarpinski StefanKarpinski commented Jun 19, 2018

In general, it's fine (and often good) to drop some extra output bits. In PCG, IIRC, you split the internal output into external output bits and permutation bits in a way that can be adjusted, raising the possibility that we could use different output steps based on the same state to generate 64 bits of integer output versus 53 bits of floating-point output. Another approach would be to use different RNG streams for different bit sizes, which is appealing for being able to generate random 128-bit values, for example. (A pair of 128-bit streams may not be as good as a single 128-bit stream, for example, unless the two streams have coprime periods which is hard to arrange, so it can be quite helpful to have different sized streams that take advantage of their increased size to produce better streams.)

@PallHaraldsson PallHaraldsson changed the title Better default rand for 1.0. Xoroshiro256? Better default rand for 1.0. Xo(ro)shiro256**? Jun 19, 2018
@StefanKarpinski StefanKarpinski changed the title Better default rand for 1.0. Xo(ro)shiro256**? Better default RNG in the future? Jun 19, 2018
@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Jun 19, 2018

"The clear winner is xoroshiro128+" (seems outdated) @staticfloat thanks for coding it. Actually "xoshiro256**" seems to be their best; note dropped letters, I first thought a typo.

@sunoru

This comment has been minimized.

Copy link
Contributor

@sunoru sunoru commented Jun 19, 2018

I'm sorry that I haven't updated that package for long. But maybe this is worth a look: http://sunoru.github.io/RandomNumbers.jl/latest/man/benchmark/

My personal choice is xoroshiro128+. Since some RNGs in PCG don't pass the big crush as its paper says, I don't know if PCG is as reliable as it sounds.

@StefanKarpinski

This comment has been minimized.

Copy link
Member

@StefanKarpinski StefanKarpinski commented Jun 19, 2018

PCG not passing big crush is indeed an issue. That seems like a strange disconnect.

@staticfloat

This comment has been minimized.

Copy link
Member

@staticfloat staticfloat commented Jun 19, 2018

@sunoru I was just looking at your RandomNumbers.jl page; that is some really nice work you've put together. IMO, discussion about a better RNG should be informed by your work as you've put in the time and effort to actually compare against quite a number of different PRNGs. And yes, it does appear true that xoroshiro128+ is the winner in all aspects so far; however I am not sure what the test for quality is nowadays; I've heard of BigCrush, DieHarder, TestU01, etc..... I'm sure Andreas can answer which is the test to pay attention to these days. :)

@JeffreySarnoff

This comment has been minimized.

Copy link
Contributor

@JeffreySarnoff JeffreySarnoff commented Jun 19, 2018

The real issue with PCG is that it is not being well-tended. I don't know anything about why it got crushed -- if there were more effort behind that family, a remedy modification may have been made but there is none. So, going with new information: I still like PCG .. to use it for what it does best .. and with the failure, that's not this.

@chethega

This comment has been minimized.

Copy link
Contributor

@chethega chethega commented Jun 26, 2018

If AES-CTR turns out to be fast enough on most architectures, then I think this would be absolutely fantastic as the default RNG.

This eradicates an entire class of security bugs and relieves people from the mental burden of considering the quality of the random stream and how to use it. And people who need something faster can still switch this out.

The only big isssue is architectures without hardware support for AES (then it would almost surely be too slow). So the bitter pill to swallow is that the random stream would become architecture dependent (e.g. older raspberry pi don't have aes acceleration and should fall back on something else).

@ViralBShah

This comment has been minimized.

Copy link
Member

@ViralBShah ViralBShah commented Jun 26, 2018

I would be ok with having a different one on different architectures.

@chethega

This comment has been minimized.

Copy link
Contributor

@chethega chethega commented Jun 28, 2018

After looking at http://www.cs.tufts.edu/comp/150FP/archive/john-salmon/parallel-random.pdf, I am more convinced that AES would be a really good default on supporting archs: Fast enough (faster than MT on a lot of CPUs, and does not eat as much L1), practically perfect quality, and no possibility of people abusing rand(UInt64) to generate session cookies and getting pwned for it.

@PallHaraldsson

This comment has been minimized.

Copy link
Contributor Author

@PallHaraldsson PallHaraldsson commented Dec 27, 2018

@ViralBShah (on xoroshiro128** or you mean on their older RNG?) "I thought those RNG don't pass the RNG testsuite." I googled, and assume you mean:

https://github.com/andreasnoackjensen/RNGTest.jl/ that I found here: #1391

I think the discussion should continue here, not the startup time issue, while I'm not sure the default RNG needs to pass all tests (I'm not sure it does NOT do that), if you can easily opt into a better RNG.

At http://xoshiro.di.unimi.it/

there's both e.g. xoroshiro128** and xoroshiro128+ to look into.

and I only see a minor flaw on the latter:

Quality

This is probably the more elusive property of a PRNG. Here quality is measured using the powerful BigCrush suite of tests. BigCrush is part of TestU01, a monumental framework for testing PRNGs developed by Pierre L'Ecuyer and Richard Simard (“TestU01: A C library for empirical testing of random number generators”, ACM Trans. Math. Softw. 33(4), Article 22, 2007).
[..]
Beside BigCrush, we analyzed our generators using a test for Hamming-weight dependencies described in our paper. As we already remarked, our only generator failing the test (but only after 5 TB of output) is xoroshiro128+. [..]

@jdchn

This comment has been minimized.

Copy link

@jdchn jdchn commented Dec 29, 2018

An architecture-dependent default PRNG might be a little awkward since one of the primary benefits of PRNG is reproducibility.

@mbauman mbauman added the RNG label Dec 31, 2018
@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Oct 29, 2019

This is a separate concern, but did you consider the fact that using the [1..2) technique you get only 52 instead of 53 significant bits ? That's a tradeoff that probably the user should be aware of. Also, in my tests on recent hardware I could measure no real difference (provided you precompute 1 / 2^53 and use exactly 53 bits).

BTW, for floats xoshiro256+ is perfectly fine (using the upper 53 bits).

I'm a bit surprised by your second benchmark—in principle, the conversion to float should have the same cost for everybody, so the speed comparison in an integer test should give approximately the same results of the comparison on doubles. I wouldn't mind having a look at the code if that's fine for you.

Presently I'm in favor of xoshiro256++.

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Oct 29, 2019

This is a separate concern, but did you consider the fact that using the [1..2) technique you get only 52 instead of 53 significant bits

Indeed, but this was to benchmark equivalent things. There has been quite a number of discussions in Julia's dicourse / github to change that.

BTW, for floats xoshiro256+ is perfectly fine (using the upper 53 bits).

Yes, I would personally not mind including different variants, as they are so simple to implement.

in principle, the conversion to float should have the same cost for everybody

In general yes, but in this case there is a native array generation method for MT, i.e. the "kernel" (dSFMT) writes directly into the array, instead of writing into a cache array within MersenneTwister struct, and then fetching one value at a time from this cache into the destination array (this is the implementation detail of MersenneTwister in Julia!).

But there is good news, by writing naïve array generation for xoshiro256++, the array versions becomes 25% faster for UInt64, and about 15% slower for Float64 (down from 40%). This would be enough for me to adopt xoshiro256++ as the default generator in Julia, and even more so as xoshiro+ might give even better results for Float64.

BTW, thanks for having put you C code for the xoshiro family in the public domain, that makes trying it out very easy!

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Oct 29, 2019

Ahhh OK you didn't tell me you were using the dSFMT. Yes, I acknowledge that in my shootout page:

xoshiro256+ is ≈20% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), has a much smaller footprint, and does not fail any test.

If you're using extended instructions and you don't mind being able to generate just half of the possible numbers, that's OK. But remember that the dSFMT fails linearity tests, and that bias can filter through computations (see my note above) in very unexpected ways.

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Oct 29, 2019

Ahhh OK you didn't tell me you were using the dSFMT.

Ah sorry! I'm used to just writing MersenneTwister in this github to mean the Julia's builtin version, which uses dSFMT.

So yes we are aware that it's not ideal to use the technique used be dSFMT to produce Float64. And regularly a user on discourse will point at this fact and be surprised about it. I don't think it's a big problem for most purposes, but better to improve that if possible.

So I tested xoshiro+, and for the array version, it becomes only about 7% slower than MT (again, with using the same technique, with half-range).

I personally would prefer the "full-range" float generation at the cost of a bit of performance, but I don't know if this is a shared feeling.

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Oct 29, 2019

Anyway, I vote for making xoshiro256++ the new Julia default RNG :)

And by the way it's really nice to have you chime in @vigna !

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Oct 29, 2019

Plus, I forgot to mention earlier, the code to implement the jump functionality is also available, and is quite short (should be about 20 lines of code), when you compare with the equivalent for MT (which is about 150 lines of code, including simple implementation of polynomials over GF(2) !)

@ViralBShah

This comment has been minimized.

Copy link
Member

@ViralBShah ViralBShah commented Oct 29, 2019

Does our RNG policy allows us to replace the default generator? I think so, but just checking. If so, this sounds like a good idea.

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Oct 29, 2019

To be true, I believe the Mersenne Twister comes with a generic jump implementation that makes it possible to jump ahead any number of steps: for that, you need computation of polinomials over Z/2Z modulo the characteristic polynomial, which needs code.

For speed and convenience, I provide a method for long jumps and one for large jumps. The idea is that you use large jumps to get different streams for multiple computational units, but then you can short jump and get a large number of streams in each unit (e.g., threads). Polynomials are precomputed and stored as bit arrays, but the technique is the same.

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Oct 29, 2019

So I tested xoshiro+, and for the array version, it becomes only about 7% slower than MT (again, with using the same technique, with half-range).

It would be interesting to see whether interleaving carefully computation and conversion can help there. Something like, you first fill the array and then you do a pass to convert to double (but you'll need C-style unions to do that). Or, even, having a temporary variable for the last result and at each iteration convert to double the previous result and store it while you compute the next state, and then store it in the temporary variable. Breaking dependency chains can have very surprising results sometimes.

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Oct 29, 2019

To be true, I believe the Mersenne Twister comes with a generic jump implementation that makes it possible to jump ahead any number of steps

Ah right, I forgot about that.

It would be interesting to see whether interleaving carefully computation and conversion can help there.

Actually, there is no cost in Julia to convert a UInt64 to Float64. I still tried you suggestion to first do the bit-twidling work for the whole array in a loop, and then to substract 1.0 from all positions, but the performance are quite worse. I also indeed tried a bit to see how to "break dependency chains", but without success in this case. Basically at each iteration, bit-twidling and one store in the array must happen.

But surprisingly for me, it's not slower to use the "full-range" generation, i.e., given a random UInt64 value u, using (u >> 11) * 2.0^-53 instead of reinterpret(Float64, 0x3ff0000000000000 | (u >> 12)) - 1.0.

For me, it's totally fine as is and potential improvements could be addressed later, but let's see what other people think.

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Oct 29, 2019

Does our RNG policy allows us to replace the default generator?

I would believe so. We explicitly documented that the generated streams of numbers can change, and we don't say anything about the stability of the type of the default rng. The only semi-official thing was to use Random.GLOBAL_RNG in the cases when you implement new functions which need to take an optional RNG, but even that got its type changed in 1.3 (it's now of type Random._GLOBAL_RNG instead of MersenneTwister).

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Oct 29, 2019

But surprisingly for me, it's not slower to use the "full-range" generation, i.e., given a random UInt64 value u, using (u >> 11) * 2.0^-53 instead of reinterpret(Float64, 0x3ff0000000000000 | (u >> 12)) - 1.0.

Yes, that was what I was telling you. I think that FPUs have a very very fast execution path if you just have the significant bits and you shift them in. It might be slower, for example, if you don't do the >>11 shift and divide by 2.0^-64, because I think that in that case there's some rounding going on.

@StefanKarpinski

This comment has been minimized.

Copy link
Member

@StefanKarpinski StefanKarpinski commented Oct 30, 2019

Actually, there is no cost in Julia to convert a UInt64 to Float64.

This is kind of true: the reinterpret itself is a no-op, but in order to do most integer operations on floating-point values in hardware, it's necessary to move the value from a floating-point register to an integer register (and back if you want to continue using as floating-point), which does take a cycle in each direction, so two cycles to go back and forth. So it's not free in practice.

@chethega

This comment has been minimized.

Copy link
Contributor

@chethega chethega commented Oct 30, 2019

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

On my machine I get down to ~0.14 cpb:

julia dsfmt 1024 x UInt64
  2.089 μs (0 allocations: 0 bytes)
ref impl, 1 x interleaved, 1024 x UInt64
  1.579 μs (0 allocations: 0 bytes)
ref impl, 4 x interleaved, 1024 x UInt64
  2.150 μs (0 allocations: 0 bytes)
julia 1 x interleaved 1 x SIMD, 1024 x UInt64
  4.336 μs (0 allocations: 0 bytes)
  1.838 μs (0 allocations: 0 bytes)
julia 1 x interleaved 2 x SIMD, 1024 x UInt64
  2.709 μs (0 allocations: 0 bytes)
  1.384 μs (0 allocations: 0 bytes)
julia 1 x interleaved 4 x SIMD, 1024 x UInt64
  1.487 μs (0 allocations: 0 bytes)
  709.241 ns (0 allocations: 0 bytes)
julia 1 x interleaved 8 x SIMD, 1024 x UInt64
  927.750 ns (0 allocations: 0 bytes)
  604.148 ns (0 allocations: 0 bytes)
julia 1 x interleaved 16 x SIMD, 1024 x UInt64

with beautiful code (allmost no branches, everything 256 bit wide; if I had a AVX512 machine, this would be even better).

CF https://gist.github.com/chethega/344a8fe464a4c1cade20ae101c49360a

Can anyone link what xoshiro code you benchmarked? Maybe I just suck at C?

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Oct 30, 2019

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable. With proper simd / avx2, xoshoro256++ is almost 4x faster than dsfmt for integer bulk generation, on my setup. No way in hell is the int->float conversion going to turn that into a loss.

Wow, now you're making me shiver. 0.14 cycle/B? The vectorization is performed by the compiler or did you handcraft it (sorry, totally not proficient with Julia)?

@chethega

This comment has been minimized.

Copy link
Contributor

@chethega chethega commented Oct 30, 2019

I'm running an Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz, i.e. a pretty anemic laptop broadwell.

The vectorization was by hand, the code is in the gist, but small enough to post here:

using SIMD, BenchmarkTools
mutable struct xstate{N}
    s0::Vec{N, UInt64}
    s1::Vec{N, UInt64}
    s2::Vec{N, UInt64}
    s3::Vec{N, UInt64}
end
macro new(T)
    return Expr(:new, T)
end

function init(xs::xstate)
    r=rand(UInt8, sizeof(xs))
    GC.@preserve xs, r
        ccall(:memcpy, Cvoid, (Ptr{Nothing}, Ptr{Nothing}, Csize_t), pointer_from_objref(xs), pointer(r), sizeof(xs))
    nothing
end

@inline rotl(a, k) = (a<<k) | (a>>(64-k))

@inline function next_nostore((s0, s1, s2, s3))
    res = rotl(s0 + s3, 23) + s0
    t = s1 << 17
    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3
    s2 ⊻= t
    s3 = rotl(s3, 45)
    return (res, (s0, s1, s2, s3))
end

function XfillSimd_nostore(dst, xs::xstate{N}, num) where N
    lane = VecRange{N}(0)
    s = xs.s0, xs.s1, xs.s2, xs.s3
    @inbounds for i=1:N:num
        dst[lane+i], s = next_nostore(s)
    end
    xs.s0, xs.s1, xs.s2, xs.s3 = s
    nothing
end
N=1024; dst=zeros(UInt64, N);
xs = @new xstate{8}; init(xs)
@btime XfillSimd_nostore($dst, $xs, $N)

Note that this uses 8 independent RNGs for bulk generation: advance 8 states, generate 8 outputs, write 8 outputs to destination, repeat.

This is no issue, just needs some extra seeding logic (seed in bulk). In the julia code, this looks like 1 rng whose state consists of four <8 x UInt64> vectors.

PS. I also implemented the 5 lines of xoshiro256** in the gist. It is much slower, presumably due to the giant latency of the multiplication. The generated inner loop for 4 x xoshiro256++ is a beauty:

L144:
	vpsrlq	$41, %ymm4, %ymm6
	vpsllq	$23, %ymm4, %ymm4
	vpor	%ymm4, %ymm6, %ymm4
	vpaddq	%ymm0, %ymm4, %ymm4
	vmovdqa	%ymm5, %ymm0
	movq	(%rdi), %rcx
	vmovdqu	%ymm4, -8(%rcx,%rdx,8)
	cmpq	%rdx, %rax
	je	L266
	addq	$4, %rdx
	vpaddq	%ymm0, %ymm2, %ymm4
	vpsllq	$17, %ymm1, %ymm6
	vpxor	%ymm0, %ymm3, %ymm3
	vpxor	%ymm1, %ymm2, %ymm2
	vpxor	%ymm1, %ymm3, %ymm1
	vpxor	%ymm0, %ymm2, %ymm5
	vpxor	%ymm6, %ymm3, %ymm3
	vpsllq	$45, %ymm2, %ymm6
	vpsrlq	$19, %ymm2, %ymm2
	vpor	%ymm6, %ymm2, %ymm2
	jns	L144
@chethega

This comment has been minimized.

Copy link
Contributor

@chethega chethega commented Oct 31, 2019

So, after consideration:

The proposed setup, if we wanted to switch to xoshiro256++, would be to reserve 5 cache-lines of RNG state (ach thread has its own completely independent generator). One CL is for sequential generation, and laid out like xstate{2}, with an implementation that only advances the first of the RNGs for 1-8 byte outputs, and advances both for 16-byte outputs. This is used in code that uses rand(some_small_bitstype).

Then, there are 4 cache-lines for bulk generations, laid out like xstate{8}.

These 4 cachelines aren't hot for code that doesn't use bulk generation, so nobody cares about state-size.

There sequential generator uses xstate{2} layout because I have no better idea what to do with the remaining half of the line, so we may well use it to speed up 16byte-wide generation; and there is no reason to compactify the layout of the primary generator (sequential generation of <= 8 byte numbers), since we can't use wide SIMD loads anyway.

The sequential generator only exists because: (1) sequential generation should only fill in a single cache-line of state, and (2) bulk generation wants to use wide vector loads/stores for the state. Code that uses both sequential and bulk generation will therefore fill 5 instead of 4 lines; doesn't matter, who cares.

The sequential generator should be properly inlined (verify that, and possibly add @inline). Unfortunately I see no good way of getting the compiler to optimize out the stores to xstate in tight loops; I assume that this is in order to ensure correct behaviour when the store traps? But we don't deal with segfaults anyway, so maybe there is some magic llvm annotation that tells llvm to not care about correct xstate after trapping? I am not opposed to writing the whole RNG in a giant llvmcall.

There is no reason to implement jump. We only need seed! and output generation.

How do we tell the allocator to give us 64-byte aligned mutable struct?

In total, I think this would be a good move, both for speed and in order to cut out the dependency. One would need to port a variant of my code over to Base-julia (i.e. without SIMD.jl) and verify that the resulting @code_native is OK on all supported archs, and that the 8 x wide SIMD doesn't hurt perf on weaker machines (i.e. check that the compiler doesn't spill registers). Someone with access to AVX512, POWER and various ARM should chime in at some point. Worst case we can stick with 4 x wide simd and save 2 cache lines. Downside is slightly lower perf with AVX2 and presumably massive perf degradation with AVX512. SIMD widths are increasing all the time, so being future-proof without breaking random stream would be nice.

On the other hand, if we change the default RNG at all, I would still prefer a cryptographic choice; if that is rejected and if scientific consensus says that xoshiro gives "good enough" streams, then that's good enough for me.

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Oct 31, 2019

I don't get how you people got that speed between xoshiro256++ and dsfmt is even comparable.

Hey, I just ported naïvely @vigna's code to Julia, I wanted to have an actualized performance comparison compared to last year and with xoshiro256++ specifically. It was good enough for me, and I currently don't have the skills to write simd-optimized code! :D

What you did is amazing, with your benchmark above, this gets 2.6 times faster than my array generation routine.
If this is what you asked, here is my code:

using Random

mutable struct Xoropp <: AbstractRNG
    s0::UInt64
    s1::UInt64
    s2::UInt64
    s3::UInt64
end

rotl(x::UInt64, k::UInt) = (x << k) | (x >> ((64 % UInt) - k))

function Base.rand(rng::Xoropp, ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    result = rotl(s0 + s3, UInt(23)) + s0

    t = s1 << 17

    s2 ⊻= s0
    s3 ⊻= s1
    s1 ⊻= s2
    s0 ⊻= s3

    s2 ⊻= t

    s3 = rotl(s3, UInt(45))

    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3

    result
end


# array generation code: nothing magic, but somehow gets quite faster than the default
# code for array, which does `rand(rng, UInt64)` in a loop
function Random.rand!(rng::Xoropp, A::AbstractArray{UInt64},
                      ::Random.SamplerType{UInt64})
    s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3

    @inbounds for i = eachindex(A)
        A[i] = rotl(s0 + s3, UInt(23)) + s0

        t = s1 << 17

        s2 ⊻= s0
        s3 ⊻= s1
        s1 ⊻= s2
        s0 ⊻= s3

        s2 ⊻= t

        s3 = rotl(s3, UInt(45))
    end
    rng.s0, rng.s1, rng.s2, rng.s3 = s0, s1, s2, s3
    A
end

# for Float64 generation
Random.rng_native_52(::Xoropp) = UInt64

I can't really comment on your "proposed setup" above, as I have not much ideas above cache lines and such.

@StefanKarpinski

This comment has been minimized.

Copy link
Member

@StefanKarpinski StefanKarpinski commented Oct 31, 2019

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

@chethega

This comment has been minimized.

Copy link
Contributor

@chethega chethega commented Oct 31, 2019

@chethega, if I'm following your description, it sounds like there would be two separate generator states in that situation—sequential and array. Does that mean that calling rand() would not advance the array generator state and calling rand(1000) would not advance the sequential generator state?

That is correct!

In even more detail, each thread would have 10 independent xoshiro instances. 1-64 bit sequential user requests would advance rng_1, 65+ bit sequential user requests would advance rng_1 and rng_2, and all bulk user requests would advance rng_3-rng_10. rng_1 and rng_2 would be laid out together in an xstate{2}, and the bulk generators would be laid out in an xstate{8}.

Hey, I just ported naïvely

Sorry, I didn't want to impugn your code. You do awesome work in keeping julia's random infrastructure together, and your APIs rock!

What you did is amazing

That's too much praise ;) I just read the assembly, noticed that it sucks, had the audacity to propose 10 interleaved xoshiro instances instead of a single one, and passed a Vec{8, UInt64} instead of a single UInt64 into the RNG function. Praise belongs more to @vigna for xoshiro and @eschnett for SIMD.jl. I guess we should also port the bulk generator to C, such that @vigna can use it as well. Would be a fun exercise in intel intrinsics (alternative: copy-paste the generated julia code and make it inline assembly. While at it, we can also clean it up: I see no reason that the hot loop has two conditional branches when we could do with a single one, probably some weird size minimization).

PS:

array generation code: nothing magic, but somehow gets quite faster than the default
code for array, which does rand(rng, UInt64) in a loop

That is because you hoisted writing and reading the state out of the loop. This is a super important optimization, and is the main property of my nostore variant. I feel fine with this for things with a pointer, but am unsure about AbstractArray: If the setindex! calls rand again, then we would generate duplicate random numbers. We don't know the setindex! implementation: Maybe the AbstractArray is really backed by a very complicated sparse array that internally uses randomized algorithms?

@rfourquet

This comment has been minimized.

Copy link
Contributor

@rfourquet rfourquet commented Nov 1, 2019

That is because you hoisted writing and reading the state out of the loop

I assumed so, but was surprised that the compiler doesn't do this seemingly easy optimization by itself. But your point about setindex! being allowed to interleave a call to the same generator helps me understand why this is not trivial.
(The faster version which assumes no weird setindex! should probably be made available to array implementations in some way).

Also, I realized that in my benchmarks above, I compared xoshiro++ to a local instance of MersenneTwister, not to the global RNG. In 1.3, with threading, its performance got a serious hit (by more than 2x), so it would be interesting to test xoshiro++ as the default global RNG.

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Nov 1, 2019

So, please help me: why won't this be vectorized by gcc or clang? Here XOSHIRO256_UNROLL=8 and I'm using -O3 -mavx2...

static inline uint64_t next(void) {
    uint64_t t[XOSHIRO256_UNROLL];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s0[i] + s3[i], R) + s0[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s1[i] >> A;

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= s0[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] ^= s1[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s1[i] ^= s2[i];
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s0[i] ^= s3[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s2[i] ^= t[i];

    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s3[i] = rotl(s3[i], B);
    // This is just to avoid dead-code elimination
    return t[0] ^ t[XOSHIRO256_UNROLL-1];
}
@staticfloat

This comment has been minimized.

Copy link
Member

@staticfloat staticfloat commented Nov 1, 2019

Try running clang with -Rpass-missed=loop-vectorize or -Rpass-analysis=loop-vectorize. That should print out debugging information showing which loops were not vectorized, which statements stopped vectorization, etc...

Reference: https://llvm.org/docs/Vectorizers.html#diagnostics

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Nov 1, 2019

I got this from clang:

./harness.c:83:11: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize]
uint64_t e = -1;

But this is the external benchmarking loop. There's never a warning saying it's not optimizing the loops above. :(

From what I've read the problem would be that I'm using memory and not registers (i.e., variables aren't local).

@staticfloat

This comment has been minimized.

Copy link
Member

@staticfloat staticfloat commented Nov 1, 2019

Have you tried with -Rpass=loop-vectorize to see if it actually does think it's vectorizing the loop?

@chethega

This comment has been minimized.

Copy link
Contributor

@chethega chethega commented Nov 2, 2019

@vigna I don't know why your code didn't vectorize, but I am quite happy with clang's output for the following. The generated assembly is even nicer to read than julia's. Feel free to add this to your reference implementation, no attribution required.

#include <stdint.h>

#define XOSHIRO256_UNROLL 4

static inline uint64_t rotl(const uint64_t x, int k) {
	return (x << k) | (x >> (64 - k));
}

typedef struct {uint64_t s[4][XOSHIRO256_UNROLL];} xstate;

xstate global_state;

#define next_macro(result, s) { \
    uint64_t t[XOSHIRO256_UNROLL]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) result[i] = rotl(s[0][i] + s[3][i], 23) + s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) t[i] = s[1][i] << 17; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= s[0][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] ^= s[1][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[1][i] ^= s[2][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[0][i] ^= s[3][i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[2][i] ^= t[i]; \
    for(int i = 0; i < XOSHIRO256_UNROLL; i++) s[3][i] = rotl(s[3][i], 45); \
} \

void rngbulk(uint64_t* restrict dst, xstate* restrict state, unsigned int len){
	xstate slocal = *state;
	uint64_t result[XOSHIRO256_UNROLL];
	unsigned int len_trunc = len - (len % XOSHIRO256_UNROLL);
	for(int j = 0; j < len_trunc; j += XOSHIRO256_UNROLL){
		next_macro( (result) , (slocal.s) );
		for(int k=0; k<XOSHIRO256_UNROLL; k++) { dst[j+k] = result[k]; }
	}
	
	if(len_trunc != len){
		next_macro( (result) , (slocal.s) );
		for(int k = 0; k < len-len_trunc; k++){ dst[len_trunc + k] = result[k];	}
	}
	*state = slocal;
}

void rngbulk_globstate(uint64_t* restrict dst, unsigned int len){ rngbulk(dst, &global_state, len);}

edit: removed memory corruption if len is not divisble by XOSHIRO256_UNROLL.

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Nov 2, 2019

Ah OK! Just switching to s[4][XOSHIRO256_UNROLL] instead of s0[XOSHIRO256_UNROLL], etc. made my code vectorize with gcc 💪🏻.

Now I have to understand why clang does yours and not mine, but we're getting there...

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Nov 2, 2019

Please note that the >> A should be << A. 🤦🏻‍♂️

@vigna

This comment has been minimized.

Copy link

@vigna vigna commented Nov 2, 2019

And, most fundamental, it will not happen with -O3. It has to be -O2 -ftree-vectorize. There must be some other optimization that jumps in and prevents vectorization. Weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.