Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new RNG for Julia? #8786

Closed
andreasnoack opened this issue Oct 23, 2014 · 50 comments
Closed

A new RNG for Julia? #8786

andreasnoack opened this issue Oct 23, 2014 · 50 comments
Labels
domain:randomness Random number generation and the Random stdlib

Comments

@andreasnoack
Copy link
Member

It has been discussed if we should consider this RNG alongside or instead of dSFMT.

The RNG we are using now is a Mersenne Twister implementation adapted to use SIMD, but we don't really get anything from SIMD right now because of #6573.

In addition, dSFMT actually consistently fails one of the tests in BigCrush in contrast to the proposed XORShift* algorithm.

Finally, the XORShift* is very simple and can be written in Julia. See some implementations with various perfomance tweaks here

https://gist.github.com/andreasnoack/9f3f63d7f1d00a07359b

@ivarne
Copy link
Sponsor Member

ivarne commented Oct 23, 2014

Adding new RNG implementations would be much less disruptive if we got rid of the notion of global RNG state that can be initialized with srand(::Uint32), and instead started passing RNG objects to all functions that needs a repeatable sequence of random numbers.

@stevengj
Copy link
Member

@ivarne, retrofitting existing code to get repeatable random numbers would then be a big chore. A common use case is to use non-repeatable random numbers (with global RNG state), and only use srand when you are trying to reproduce some case for debuggers. It is much easier just to call rand without passing an RNG object for most uses.

@simonbyrne
Copy link
Contributor

If we can roughly get the same performance and randomness guarantees, then we should definitely do this. I agree @ivarne that passing around RNG objects makes the most sense, particularly if we want this to be useful for parallel computing.

@JeffBezanson
Copy link
Sponsor Member

Of course, at the very least you should be able to pass around an RNG object and have everything work.

@simonbyrne
Copy link
Contributor

I think a reasonable model to copy is the IO interface. We can still have a global RNG object, but if you are writing a library function that uses random numbers, then you should allow it to accept an RNG argument (with the default being the global RNG)

@ViralBShah
Copy link
Member

We can certainly replace dSFMT, and it is good to reduce yet another binary dependency. dSFMT could still be useful if someone uses the array interface directly, warts and all, but for this purpose, it can easily live in a package.

@stevengj
Copy link
Member

No one is arguing that you should not be able to pass around RNG objects; the question is whether you should be required to do so. The latter would be disruptive and awkward to use, in my opinion.

I agree that that the model of optional IO arguments is a good pattern to follow here. (Especially since an RNG object can be thought of as a read-only stream.) Note that even our rand function does not fully follow this model: you can do rand(Int) but not rand(rng, Int), for example. However, before we embark on this, it would be good to have a better sense of whether users actually want this. Right now, it seems to be mainly the people implementing RNG support that want to pass around explicit RNG state, which is not the same thing as a real need for this in applications.

@stevengj
Copy link
Member

But on the present topic, what is the downside to replacing dSFMT as the default RNG (assuming the xorshift* algorithm is as fast and good as the authors claim)? It doesn't seem likely that there are many users who are heavily reliant on a particular sequence of pseudorandom numbers (as opposed to just the ability to have a reproducible sequence). The only disruption I can imagine is in test programs where someone hard-coded the expected result, for regression tests, from a particular random sequence.

@stevengj
Copy link
Member

Marsaglia certainly has a good reputation in the random-number field, and xorshift* seems to have undergone enough testing that it would be reasonable to trust. The only question in my mind is the performance of a pure-Julia implementation compared to dSFMT, but the implementation seems quite easy and straightforward.

@stevengj
Copy link
Member

A quick (untested) transcription of the C 1024-bit xorshift* algorithm:

import Base: rand, rand!
type Xorshift1024 <: AbstractRNG
    s::Vector{Uint64}
    p::Uint64
    Xorshift1024() = new(rand(Uint64, 16), 1)
end
function rand(rng::Xorshift1024)
    p = rng.p
    s = rng.s
    @inbounds s0 = s[p]
    p = p & 15 + 1
    s1 = s[p]
    s1 $= s1 << 31 # a
    s1 $= s1 >> 11 # b
    s0 $= s0 >> 30 # c
    rng.p = p
    @inbounds val = s[p] = s0 $ s1
    return val * 1181783497276652981
end
function rand!(rng::Xorshift1024, a::AbstractArray{Uint64})
    for i = 1:length(a)
        @inbounds a[i] = rand(rng)
    end
    return a
end
rand(rng::Xorshift1024, n::Int) = rand!(rng, Array(Uint64, n))

The result is about 30% slower than our current rand in the following test:

rng = Xorshift1024()
a = Array(Uint64, 10^6)
@time rand!(rng, a);
@time rand!(a);

Am I missing something?

Ideally we should use a fixed-length vector rather than a Vector in the type; that should speed things up. I'd like to just have fields s1::Uint64 .... s16::Uint64, but I don't know if I can access those efficiently by index.

@JeffBezanson
Copy link
Sponsor Member

In Andreas' gist above @simonster has a thoroughly hacked version that's faster. Can probably be faster still if we special-case filling an array.

@stevengj
Copy link
Member

Ah, I didn't see that one. Okay, the trick is to use getfield and unsafe_store with a hacked pointer.

@ViralBShah
Copy link
Member

@stevengj I don't think there is any downside to replacing the current RNG. In fact we should do so, and in case someone needs dSFMT, we can provide the current functionality in a package.

@StefanKarpinski
Copy link
Sponsor Member

Here's my version, based on @simonster's that unrolls the loop when filling arrays:

module XORShift

import Base: rand, rand!

type XORShiftStar1024 <: AbstractRNG
    x00::Uint64
    x01::Uint64
    x02::Uint64
    x03::Uint64
    x04::Uint64
    x05::Uint64
    x06::Uint64
    x07::Uint64
    x08::Uint64
    x09::Uint64
    x10::Uint64
    x11::Uint64
    x12::Uint64
    x13::Uint64
    x14::Uint64
    x15::Uint64
    p::Uint64
end
XORShiftStar() = XORShiftStar1024(rand(Uint64, 16)..., 0)

const globalXORShiftStar1024 = XORShiftStar()

fromUint64(::Type{Uint64}, u::Uint64) = u
fromUint64(::Type{Float64}, u::Uint64) = 5.421010862427522e-20 * u

function rand(rng::XORShiftStar1024, ::Type{Uint64})
    @inbounds begin
        p  = rng.p
        s0 = getfield(rng, reinterpret(Int, p)+1)
        p = (p + 1) % 16
        s1 = getfield(rng, reinterpret(Int, p)+1)
        s1 $= s1 << 31
        s1 $= s1 >> 11
        s0 $= s0 >> 30
        s0 $= s1
        unsafe_store!(convert(Ptr{Uint64}, pointer_from_objref(rng))+8,
                      s0, reinterpret(Int, p)+1)
        rng.p = p
    end
    return s0 * 1181783497276652981
end

rand{T<:Number}(rng::XORShiftStar1024, ::Type{T}) = fromUint64(T, rand(rng, Uint64))

@eval function rand!{T}(rng::XORShiftStar1024, a::AbstractArray{T})
    $(Expr(:block, [
        let x = symbol(@sprintf("x%02d", p % 16))
            :($x = rng.$x)
        end
    for p = 0:15 ]...))
    @inbounds for i = 1:16:length(a)
        $(Expr(:block, [
            let x0 = symbol(@sprintf("x%02d", p % 16)),
                x1 = symbol(@sprintf("x%02d", (p + 1) % 16))
                quote
                    s0 = $x0
                    s1 = $x1
                    s1 $= s1 << 31
                    s1 $= s1 >> 11
                    s0 $= s0 >> 30
                    s0 $= s1
                    $x1 = s0
                    a[i + $p] = fromUint64(T, s0 * 1181783497276652981)
                end
            end
        for p = 0:15 ]...))
    end
    $(Expr(:block, [
        let x = symbol(@sprintf("x%02d", p % 16))
            :(rng.$x = $x)
        end
    for p = 0:15 ]...))
    return a
end

end # module

This version seems to be significantly faster than dSFMT at array filling for both Float64 (28% faster) and Uint64 (74% faster). Note that this version is not correctly generalized yet and only handles arrays with lengths that are a multiple of 16 and assumes that rng.p is 0 at the beginning of the call.

Also, this may require a fix to base/random.jl to work (pending).

@GunnarFarneback
Copy link
Contributor

For curiosity it can be noted that a 32 bit xorshift generator is included in https://github.com/JuliaLang/julia/blob/master/test/perf/kernel/go_benchmark.jl

Somewhat ironically, in light of this discussion, it was greeted with: "Julia also ships with a pretty high-end RNG which you could use instead of this xor-based one."

@StefanKarpinski
Copy link
Sponsor Member

@GunnarFarneback – hah, that is rather funny.

@rfourquet
Copy link
Member

Given the beginning of the thread, allow me to remind that PR #8380 implements more rand/rand! methods allowing to pass an RNG object.

@rfourquet
Copy link
Member

I think that Julia's dSFMT based RNG can be made much faster than it currently is (I observed up to x3), see #8808.

@StefanKarpinski
Copy link
Sponsor Member

Fully general version of the above:

@eval function rand!{T}(rng::XORShiftStar1024, a::AbstractArray{T})
    p = rng.p
    n = length(a)
    q = min((16 - p) % 16, n)
    r = (n - q) % 16
    @inbounds for i = 1:q
        a[i] = rand(rng, T)
    end
    $(Expr(:block, [
        let x = symbol(@sprintf("x%02d", k % 16))
            :($x = rng.$x)
        end
    for k = 0:15 ]...))
    r = q + (n-q) >> 4 << 4
    @inbounds for i = q+1:16:r
        $(Expr(:block, [
            let x0 = symbol(@sprintf("x%02d", k % 16)),
                x1 = symbol(@sprintf("x%02d", (k + 1) % 16))
                quote
                    s0 = $x0
                    s1 = $x1
                    s1 $= s1 << 31
                    s1 $= s1 >> 11
                    s0 $= s0 >> 30
                    s0 $= s1
                    $x1 = s0
                    a[i + $k] = fromUint64(T, s0 * 1181783497276652981)
                end
            end
        for k = 0:15 ]...))
    end
    $(Expr(:block, [
        let x = symbol(@sprintf("x%02d", k % 16))
            :(rng.$x = $x)
        end
    for k = 0:15 ]...))
    rng.p = 0
    @inbounds for i = r+1:n
        a[i] = rand(rng, T)
    end
    return a
end

We should also strongly consider @rfourquet's patches to dSFMT, which might be even faster.

@ViralBShah
Copy link
Member

We should see if we can use the tricks used by dSFMT to have a SIMD version of this algorithm. That may be non-trivial work, but if possible, doing so would put us on par or even faster than dSFMT, since it does not go beyond SSE2, IIRC.

@GunnarFarneback
Copy link
Contributor

I think this is a good opportunity to realize the SIMD potential within Julia that was first promised in #4042. Here's a small start with the easy part, the xor operation.

julia> typealias Uint64x2 NTuple{2, Uint64}
(Uint64,Uint64)

julia> typealias Uint64x4 NTuple{4, Uint64}
(Uint64,Uint64,Uint64,Uint64)

julia> function ($)(x::Uint64x2, y::Uint64x2)
           Base.llvmcall("""%3 = xor <2 x i64> %1, %0
                            ret <2 x i64> %3""",
                         Uint64x2, (Uint64x2, Uint64x2), x, y)
       end
$ (generic function with 34 methods)

julia> function ($)(x::Uint64x4, y::Uint64x4)
           Base.llvmcall("""%3 = xor <4 x i64> %1, %0
                            ret <4 x i64> %3""",
                         Uint64x4, (Uint64x4, Uint64x4), x, y)
       end
$ (generic function with 35 methods)

julia> (0x0123456789abcdef,0x0123456789abcdef)$(0x0011223344556677,0x8899aabbccddeeff)
(0x01326754cdfeab98,0x89baefdc45762310)

julia> code_native($, (Uint64x2, Uint64x2))
        .text
Filename: none
Source line: 2
        push    RBP
        mov     RBP, RSP
        vxorps  XMM0, XMM1, XMM0
Source line: 2
        pop     RBP
        ret

julia> code_native($, (Uint64x2, Uint64x2, Uint64x2))
        .text
Filename: operators.jl
Source line: 82
        push    RBP
        mov     RBP, RSP
        vxorps  XMM0, XMM1, XMM0
        vxorps  XMM0, XMM0, XMM2
Source line: 82
        pop     RBP
        ret

julia> code_native($, (Uint64x4, Uint64x4))
        .text
Filename: none
Source line: 2
        push    RBP
        mov     RBP, RSP
        vxorps  YMM0, YMM1, YMM0
Source line: 2
        pop     RBP
        ret

The Uint64x2 xor only requires SSE2. The Uint64x4 needs something later, possibly AVX, in any case available on my Sandy Bridge processor.

Getting sensible shift instructions out of this is beyond my LLVM skills however. But load, store, xor, multiplication, left shift, and right shift are all the components needed for a xorshift* generator. This should be within reach from Julia.

@toivoh
Copy link
Contributor

toivoh commented Oct 26, 2014

With SIMD vectorization, we could produce multiple streams from different xorshift seeds in parallel. I guess that's ok, even though it increases the size of the RNG state.

What I find more troublesome is that the optimal number of parallel streams depends on the width of the SIMD instructions available (and probably also on their latency). But if we should interleave multiple streams it should always be the same number, so that we produce the same random sequence independent on the machine where the random number generation is done.

I guess that we would want the smallest number of parallel streams that seems future proof in that it will be possible to do the evaluation in parallel on future processors, which might have wider SIMD instructions. But I'm not sure how easy it is to find a figure that is both sensible now and future-proof.

Looking at the xorshift1024* algorithm, it seems to have an inherent parallelization bottleneck in that the previous 64-bit result s0 is needed to form the next one, so it seems that no matter how much cleverness we throw at evaluating one stream in parallel, the gains will be limited.

@toivoh
Copy link
Contributor

toivoh commented Oct 26, 2014

Now, if there were a xorshift1024* algorithm where the next output depends on the output from e.g. 15 and 16 steps ago, there would have been a lot more room for parallelization...

@toivoh
Copy link
Contributor

toivoh commented Oct 28, 2014

Looking at the papers, it seems that the validation that the proposed generators have received was to verify the maximum period property and to evaluate a hundred seeds with BigCrush.

This is bordering on original research, but I wrote some code to check the maximum period property, see this gist. It takes around half an hour to verify the property on my somewhat dated laptop.

If someone were to come up with a variation on xorshift1024 with better parallelization properties that passes the test, and BigCrush likes it, we could consider using it for an RNG.

Edit: Changed the order of loops in the matrix multiply, and some other optimizations. Now it takes a little more than a minute to verify the maximum period property of xorshift1024 on my machine.

@ViralBShah
Copy link
Member

Note that with @rfourquet's commit 06fb0a3, we are now back to using the fast array generators from dSFMT.

@ViralBShah
Copy link
Member

And now with 84b6aec the rand() performance is also back to the original levels.

@ViralBShah
Copy link
Member

I think we should have the XORSHIFT as another RNG in julia, with dsfmt continuing to be the default for now.

@simonbyrne
Copy link
Contributor

Michael Stumpf gave a great talk at the Julia London meetup last night, and as someone who has a lot of experience with large-scale Monte Carlo simulations, he does not hold a very high opinion of Mersenne twister RNGs. I'm sure he would be keen to provide some input on this, particularly for working with parallel streams.

He did have good things to say about this work (free link available here), based on cryptographic functions of simple transitions.

@andreasnoack
Copy link
Member Author

Was his major concern lack of independence between streams, i.e. in our situation the MT with different seed values?

@StefanKarpinski has mentioned the Random123 library before and I have tried to wrap it, but I had difficulties in getting high speed when calling the library from Julia. It's a headers only library, so you'd have to write a little wrapper to make a dynamic library and somehow this step made step the speed drop.

There are several RNGs in the library and some of them are quite simple to write in Julia which I did at some point. Maybe it would be competitive in terms of speed if @toivoh's Julia SIMD tricks could be applied to the Julia translation.

@simonbyrne
Copy link
Contributor

Was his major concern lack of independence between streams, i.e. in our situation the MT with different seed values?

That was certainly a major problem, though I think he also viewed as relatively slow given its unreliability (i.e. failing BigCrush). I've sent him an email to ask for input.

@StefanKarpinski
Copy link
Sponsor Member

The parallelization issue is another big problem. But it's not really just parallelism – it's about the fact that determinancy of sequential RNGs are inherently dynamically scoped and brittle. Consider e.g. if some function you call decides to use a randomized algorithm in the future – if both codes are using a default global RNG then this changes the random stream your code sees. This is ameliorated by explicitly passing RNG state around but does address parallelism. The indexed RNG approach of Random123 is totally immune to this – the stream depends only on the indexes you use. Fast-forwarding and parallelism are completely trivial. You may need to use some label at each location to ensure that different streams are used in each lexical place; I'd be interested in hearing how that's done. One approach is to generate random values at code writing time like we've done for the hashing stuff.

@toivoh
Copy link
Contributor

toivoh commented Nov 19, 2014

The Random123 algorithms look very nice. If there were some pure Julia implementations, I might very well be able to speed them up by making good use of SIMD instructions and such. It also seems that @ArchRobison has come a long way in #6271 to make Julia/LLVM do a lot of such vectorization automatically.

@andreasnoack
Copy link
Member Author

I just found a Julia implementation I did some time ago and made a gist

julia> include("random123.jl")
julia> [Random123.threefly(uint(24), uint(1), uint(1), uint(1), uint(i)) for i = 1:10]
10-element Array{(Uint64,Uint64),1}:
 (0x501ed512ad55f8be,0xd9bc3571774244b0)
 (0x9ca695f69a0bdf3d,0x59505d30f78315dd)
 (0x4f0228d7b1510136,0x8f47d7e3e872c7aa)
 (0x74aac2ff7c7547a2,0xa26b29ebf8bd140d)
 (0xc5882e14e65fe8cc,0x3983634edc30c114)
 (0x8e741ef59f274f8d,0x7a97339b59c2a967)
 (0x73ca5f166b815646,0x7f0edadba29c719a)
 (0x0ec2396e74e60cf0,0xf7f77cd48b3b14a1)
 (0x0f86717815d8a0bd,0x40b192d616af79f0)

@StefanKarpinski
Copy link
Sponsor Member

We could also potentially use llvmcall in case we can't get LLVM to do sufficiently clever optimizations.

@toivoh
Copy link
Contributor

toivoh commented Nov 19, 2014

Yes, that is one of the tools that I have been using :)

@toivoh
Copy link
Contributor

toivoh commented Nov 20, 2014

@andreasnoack: I looked a bit at your code. It already generates very straightforward machine code (basically plus, rotate, xor, then repeat), especially when I parameterized on Nrounds to get rid of the branches. It even uses the x86 rol instruction.

The options for parallelization are basically to process more input points at the same time, or use a wider version of the generator (i.e. 64x8). For normal rng purposes each might be similarly useful, but in general I think that the former will be more interesting.

I didn't get it to produce any SIMD code yet for evaluating multiple points in parallel, but I might try some more.

@ViralBShah ViralBShah added the domain:randomness Random number generation and the Random stdlib label Nov 22, 2014
@simonbyrne
Copy link
Contributor

Another resource that looks like it could be interesting is Agner Fog's page (also known for his tables of instruction latencies).

@tkelman
Copy link
Contributor

tkelman commented Mar 3, 2015

Anyone familiar with this work? Looks interesting http://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf

@ViralBShah
Copy link
Member

Should we try and include XORShift in Julia by default, or should we have a package? For now, I am thinking it is a good idea to have it in a package, which will also help clean up the API to make it easy to add other RNGs.

@toivoh
Copy link
Contributor

toivoh commented May 12, 2015 via email

ViralBShah pushed a commit to JuliaAttic/XORShiftRNG.jl that referenced this issue May 13, 2015
@ViralBShah
Copy link
Member

I have put the code from here into: https://github.com/JuliaLang/XORShiftRNG.jl

Ideally one would just provide rand(rng::SomeRNG, ::Type{Float64}) and everything else would just work. I think we are pretty close to that, and if that can be achieved with this package, and some reorganization in Base, we can even register it.

@tkelman
Copy link
Contributor

tkelman commented May 13, 2015

I was also hoping the new pcg-random rng's would have been relicensed to MIT by now so a more direct port of the source code instead of going from the paper could avoid Apache, but no update there.

@ivarne
Copy link
Sponsor Member

ivarne commented May 13, 2015

I rather think rand(rng::SomeRNG, ::Type{Int64}) should be the basis, because rand(rng::SomeRNG, ::Type{Float64}) only offers 52 bits of entropy. @rfourquet also complained that a fully generic pluggable system creates ambiguity problems, but I can't find the reference.

@ViralBShah
Copy link
Member

Yes, I recollect that too, but it is worth revisiting. @rfourquet may have a PR for this as well. Int64 is certainly a better basis, but that won't work for dSFMT, although it will work for SFMT. Perhaps we can have either one.

@rfourquet
Copy link
Member

I wouldn't say that "a fully generic pluggable system [necessarily] creates ambiguity problems" (I may have expressed my concerns unclearly in this comment and above), only that it must be general enough and allow the library writer to define the specializations she wants without getting ambiguity warnings. I wish rand(rng::SomeRNG, ::Type{Int64}) won't be hardcoded as the only possible "basis": more kind of engines should be "pluggable", even though the full generality of the C++ random engines/distributions system may not be needed at this point. This is certainly doable with e.g. @generated functions.

@blakejohnson
Copy link
Contributor

@tkelman I've recently stumbled upon the PCG family of random number generators again. The algorithm seems simple enough that it might be straightforward to create an original implementation from the paper. I'll put it on my TODO list.

@blakejohnson
Copy link
Contributor

It's just a first pass, but this is an attempt to reproduce PCG-XSH-RR from the paper:

# bitwise rotate right
function rotate{T <: Unsigned}(x::T, n::T)
    mask = 8*sizeof(x) - 1
    return (x >> n) | (x << ((-n) & mask))
end

# multiplier chosen from table 4 in:
# P. L'Ecuyer, ``Tables of Linear Congruential Generators of Different Sizes and
# Good Lattice Structure'', Mathematics of Computation, 68, 225 (1999), 249--260.
const multiplier = UInt64(3935559000370003845)
# L'Ecuyer only says that you need an odd increment
const increment = UInt64(204209821)

let state::UInt64 = 0x23cc2170ddd0af6e
    global pcg_rand
    function pcg_rand()
        state = state * multiplier + increment;
        rotate(((state $ (state >> 18)) >> 27) % UInt32,
               (state >> 59) % UInt32)
    end
end

@pao
Copy link
Member

pao commented Apr 12, 2016

Forwarding this on, as promised, from art4711/random-double#1:

I just remembered this conversation. You probably won't get this (I have not idea how github notifies people), but in case you do, I did a thing: https://github.com/art4711/randbench (regarding the "CSPRNG is slow" meme) and https://github.com/art4711/unpredictable. With the new compiler that will be in go 1.7, I'm down to 25ns per 64 bit number which is just 2.5x slower than the default PRNG in Go.

The conversation started roughly ages ago but seeing as there was a follow-up and benchmarks (though in Go) I figured I should pass it along. I don't agree that "just 2.5x slower" is acceptable for random sampling applications.

@ViralBShah
Copy link
Member

I wonder if this is worth revisiting given the interesting developments in libm.jl. Having a good implementation in Julia is probably desirable compared to dsfmt, which does not leverage the modern wider vector units. I am not sure if this should be xorshift or MT or perhaps both.

Reducing the dependency on dsfmt also will give us faster RNGs on non-Intel platforms.

@simonbyrne
Copy link
Contributor

I should mention the GSOC project by @sunoru:
https://github.com/sunoru/RNG.jl

Some benchmarks are here:
http://sunoru.github.io/RNG.jl/latest/man/benchmark/

(note, however, that these were for generating UInt64s: dSFMT was designed to generate 52-bits at a time for a Float64s, and thus requires to 2 draws to get a Float64).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:randomness Random number generation and the Random stdlib
Projects
None yet
Development

No branches or pull requests