Skip to content


Subversion checkout URL

You can clone with
Download ZIP


randn performance #1348

JeffBezanson opened this Issue · 16 comments

4 participants


randn might be slower than it should be in some environments. I see:

julia> [@time (rand(1,10000000); gc()) for x = 1:10];
elapsed time: 0.030134916305541992 seconds
elapsed time: 0.030981063842773438 seconds
elapsed time: 0.030447959899902344 seconds

julia> [@time (randn(1,10000000); gc()) for x = 1:10];
elapsed time: 0.211439847946167 seconds
elapsed time: 0.21141505241394043 seconds
elapsed time: 0.21072912216186523 seconds

Should be investigated.!searchin/julia-dev/randn/julia-dev/ImhGsqX_IHc/QOCM4a9dmqUJ


The thing is that randmtzig originally used MT, which generated 32-bit ints. Now, we are using dsfmt, which generates double precision random numbers, and converts them to 32-bit random integers, of which randmtzig consumes a couple to constructs a 53-bit random integer, and uses that to generate normally distributed random numbers.

The dsfmt manual in fact even warns against using dsfmt to generate 32-bit random integers, apart from occasional use. Fixing this up might just give us the performance we want. The holy grail, of course, is to implement a ziggurat in julia that is just as fast. We already have an implementation of ziggurat in perf2, but we still need to close the gap with C to use it. (#1211)


I do agree with Viral that generating double precision random uniforms then converting to 32 bit ints is somewhat of a waste. However, in my timings I'm getting a factor of less than 3 which is rather good considering the inefficiency in the algorithm.

julia> [@elapsed (rand(10000000); gc()) for i in 1:10]
10-element Float64 Array:

julia> [@elapsed (randn(10000000); gc()) for i in 1:10]
10-element Float64 Array:

AMD Athlon(tm) II X4 635 Processor running Ubuntu 12.04 amd_64


Since the initial scaled sample is accepted 99.3% of the time, it would be nice to see the performance ratio consistently under 2 or so.


But currently it is necessary to draw two random uniforms to be converted to 32-bit integers and to do a certain amount of integer arithmetic and indexing. It seems optimistic to expect a ratio consistently under 2, doesn't it?


Yes, but dSFMT now natively generates 64 bits of entropy at a time, so I believe the two calls to generate 32-bit ints are doing redundant work.


I believe dSFMT does generate 64-bits of entropy, but I think that it reuses that to generate the second 32-bit int. We just need to clean up randmtzig so that it is compatible with the way dsfmt works naturally.


I note a 5x difference between rand and randn in julia. However, the julia randn is still faster than Matlab R2011a.

julia> @time rand(1,10000000);
elapsed time: 0.02201104164123535 seconds

julia> @time randn(1,10000000);
elapsed time: 0.10574984550476074 seconds

matlab >> tic; rand(1,10000000); toc
Elapsed time is 0.167736 seconds.

matlab >> tic; randn(1,10000000); toc
Elapsed time is 0.196783 seconds.

Excellent! Then when we get it down to 2x slower than our rand, we'll really be trouncing them. The fact that Matlab's randn is only 17% slower than Matlab's rand, proves that you can get gaussian sampling pretty close to uniform sampling (unless they are doing something bizarre).


I should also note that even though it seems that the 32-bit int generation with dsfmt leads to extra work, it has yet to be shown that avoiding it will speed things up. Here are some experiments in C, which show that generating 2n uint32 random numbers is about 50% slower than generating n floating point random numbers.

./randmtzig 10000000
Uniform fill (n): 0.049514
Uniform (n): 0.038494
Uniform 32-bit ints (2*n): 0.104497
Normal (n): 0.102242

Basically, what we are seeing is that julia takes 0.08 seconds for the normal rng (excluding uniform rng time), and that matlab takes 0.03 seconds for the same. So yes, there should be some scope for improvement.


All good points. This is a case where the overhead is additive, not multiplicative.


One major source of difference in this performance is that when we generate rand(n), we are able to use the dsfmt's fill_array routines, which are much more efficient than the routines that generate one random number at a time. This itself accounts for about 2-3x performance difference.


It also seems that a significant part of the time goes in the tail computation, which gets invoked about 2% of the time.


I could get about a 20% speedup in my C implementation by avoiding multiple calls to genrand_uint32 and directly generating a 52-bit integer from the mantissa of genrand_close_open, but that also had some inlining enabled that we don't do in librandom. Plugging these improvements in julia's librandom showed no performance improvement. The whole thing is quite delicate, with small changes leading to large differences. As for now, I am not sure what we can do to quickly improve randn performance.

@ViralBShah ViralBShah closed this issue from a commit
@ViralBShah ViralBShah Update randmtzig.c to use dsfmt's double precision RNG.
Earlier, randmtzig used the int32 RNG from dsfmt, which is
not the preferred way to use dsfmt. 52-bits of randomness
are now used from the mantissa of the doubles generated by
dsfmt. This is two fewer bits than the original code, but
much faster.

This passes the Testu01 crush testsuite, when the output of
randn() is mapped through the inverse CDF into the uniform
distribution. The testsuite is at

Detailed discussion in #1372
Close #1348.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.