Skip to content
This repository

randn performance #1348

Closed
JeffBezanson opened this Issue October 07, 2012 · 16 comments

4 participants

Jeff Bezanson Viral B. Shah Douglas Bates Stefan Karpinski
Jeff Bezanson
Owner

randn might be slower than it should be in some environments. I see:

julia> [@time (rand(1,10000000); gc()) for x = 1:10];
elapsed time: 0.030134916305541992 seconds
elapsed time: 0.030981063842773438 seconds
elapsed time: 0.030447959899902344 seconds
...

julia> [@time (randn(1,10000000); gc()) for x = 1:10];
elapsed time: 0.211439847946167 seconds
elapsed time: 0.21141505241394043 seconds
elapsed time: 0.21072912216186523 seconds
...

Should be investigated.

https://groups.google.com/forum/#!searchin/julia-dev/randn/julia-dev/ImhGsqX_IHc/QOCM4a9dmqUJ

Viral B. Shah
Owner

The thing is that randmtzig originally used MT, which generated 32-bit ints. Now, we are using dsfmt, which generates double precision random numbers, and converts them to 32-bit random integers, of which randmtzig consumes a couple to constructs a 53-bit random integer, and uses that to generate normally distributed random numbers.

The dsfmt manual in fact even warns against using dsfmt to generate 32-bit random integers, apart from occasional use. Fixing this up might just give us the performance we want. The holy grail, of course, is to implement a ziggurat in julia that is just as fast. We already have an implementation of ziggurat in perf2, but we still need to close the gap with C to use it. (#1211)

Douglas Bates
Collaborator

I do agree with Viral that generating double precision random uniforms then converting to 32 bit ints is somewhat of a waste. However, in my timings I'm getting a factor of less than 3 which is rather good considering the inefficiency in the algorithm.

julia> [@elapsed (rand(10000000); gc()) for i in 1:10]
10-element Float64 Array:
 0.070298 
 0.0667231
 0.069932 
 0.0685332
 0.0698729
 0.0664868
 0.0696168
 0.0679629
 0.0744951
 0.0732141

julia> [@elapsed (randn(10000000); gc()) for i in 1:10]
10-element Float64 Array:
 0.202207
 0.201848
 0.205695
 0.209753
 0.209389
 0.208542
 0.203538
 0.209986
 0.206832
 0.209801
Stefan Karpinski
Douglas Bates
Collaborator

AMD Athlon(tm) II X4 635 Processor running Ubuntu 12.04 amd_64

Jeff Bezanson
Owner

Since the initial scaled sample is accepted 99.3% of the time, it would be nice to see the performance ratio consistently under 2 or so.

Douglas Bates
Collaborator

But currently it is necessary to draw two random uniforms to be converted to 32-bit integers and to do a certain amount of integer arithmetic and indexing. It seems optimistic to expect a ratio consistently under 2, doesn't it?

Jeff Bezanson
Owner

Yes, but dSFMT now natively generates 64 bits of entropy at a time, so I believe the two calls to generate 32-bit ints are doing redundant work.

Viral B. Shah
Owner

I believe dSFMT does generate 64-bits of entropy, but I think that it reuses that to generate the second 32-bit int. We just need to clean up randmtzig so that it is compatible with the way dsfmt works naturally.

Viral B. Shah
Owner

I note a 5x difference between rand and randn in julia. However, the julia randn is still faster than Matlab R2011a.

julia> @time rand(1,10000000);
elapsed time: 0.02201104164123535 seconds

julia> @time randn(1,10000000);
elapsed time: 0.10574984550476074 seconds

matlab >> tic; rand(1,10000000); toc
Elapsed time is 0.167736 seconds.

matlab >> tic; randn(1,10000000); toc
Elapsed time is 0.196783 seconds.
Stefan Karpinski

Excellent! Then when we get it down to 2x slower than our rand, we'll really be trouncing them. The fact that Matlab's randn is only 17% slower than Matlab's rand, proves that you can get gaussian sampling pretty close to uniform sampling (unless they are doing something bizarre).

Viral B. Shah
Owner

I should also note that even though it seems that the 32-bit int generation with dsfmt leads to extra work, it has yet to be shown that avoiding it will speed things up. Here are some experiments in C, which show that generating 2n uint32 random numbers is about 50% slower than generating n floating point random numbers.

./randmtzig 10000000
Uniform fill (n): 0.049514
Uniform (n): 0.038494
Uniform 32-bit ints (2*n): 0.104497
Normal (n): 0.102242
Viral B. Shah
Owner

Basically, what we are seeing is that julia takes 0.08 seconds for the normal rng (excluding uniform rng time), and that matlab takes 0.03 seconds for the same. So yes, there should be some scope for improvement.

Stefan Karpinski

All good points. This is a case where the overhead is additive, not multiplicative.

Viral B. Shah
Owner

One major source of difference in this performance is that when we generate rand(n), we are able to use the dsfmt's fill_array routines, which are much more efficient than the routines that generate one random number at a time. This itself accounts for about 2-3x performance difference.

Viral B. Shah
Owner

It also seems that a significant part of the time goes in the tail computation, which gets invoked about 2% of the time.

Viral B. Shah
Owner

I could get about a 20% speedup in my C implementation by avoiding multiple calls to genrand_uint32 and directly generating a 52-bit integer from the mantissa of genrand_close_open, but that also had some inlining enabled that we don't do in librandom. Plugging these improvements in julia's librandom showed no performance improvement. The whole thing is quite delicate, with small changes leading to large differences. As for now, I am not sure what we can do to quickly improve randn performance.

Viral B. Shah ViralBShah closed this issue from a commit October 14, 2012
Viral B. Shah Update randmtzig.c to use dsfmt's double precision RNG.
Earlier, randmtzig used the int32 RNG from dsfmt, which is
not the preferred way to use dsfmt. 52-bits of randomness
are now used from the mantissa of the doubles generated by
dsfmt. This is two fewer bits than the original code, but
much faster.

This passes the Testu01 crush testsuite, when the output of
randn() is mapped through the inverse CDF into the uniform
distribution. The testsuite is at
https://github.com/andreasnoackjensen/rngtest.jl

Detailed discussion in #1372
Close #1348.
c6089ee
Viral B. Shah ViralBShah closed this in c6089ee October 14, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.