randn performance #1348

Closed
JeffBezanson opened this Issue Oct 8, 2012 · 16 comments

Comments

Projects
None yet
4 participants
@JeffBezanson
Member

JeffBezanson commented Oct 8, 2012

randn might be slower than it should be in some environments. I see:

julia> [@time (rand(1,10000000); gc()) for x = 1:10];
elapsed time: 0.030134916305541992 seconds
elapsed time: 0.030981063842773438 seconds
elapsed time: 0.030447959899902344 seconds
...

julia> [@time (randn(1,10000000); gc()) for x = 1:10];
elapsed time: 0.211439847946167 seconds
elapsed time: 0.21141505241394043 seconds
elapsed time: 0.21072912216186523 seconds
...

Should be investigated.

https://groups.google.com/forum/#!searchin/julia-dev/randn/julia-dev/ImhGsqX_IHc/QOCM4a9dmqUJ

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 8, 2012

Member

The thing is that randmtzig originally used MT, which generated 32-bit ints. Now, we are using dsfmt, which generates double precision random numbers, and converts them to 32-bit random integers, of which randmtzig consumes a couple to constructs a 53-bit random integer, and uses that to generate normally distributed random numbers.

The dsfmt manual in fact even warns against using dsfmt to generate 32-bit random integers, apart from occasional use. Fixing this up might just give us the performance we want. The holy grail, of course, is to implement a ziggurat in julia that is just as fast. We already have an implementation of ziggurat in perf2, but we still need to close the gap with C to use it. (#1211)

Member

ViralBShah commented Oct 8, 2012

The thing is that randmtzig originally used MT, which generated 32-bit ints. Now, we are using dsfmt, which generates double precision random numbers, and converts them to 32-bit random integers, of which randmtzig consumes a couple to constructs a 53-bit random integer, and uses that to generate normally distributed random numbers.

The dsfmt manual in fact even warns against using dsfmt to generate 32-bit random integers, apart from occasional use. Fixing this up might just give us the performance we want. The holy grail, of course, is to implement a ziggurat in julia that is just as fast. We already have an implementation of ziggurat in perf2, but we still need to close the gap with C to use it. (#1211)

@dmbates

This comment has been minimized.

Show comment
Hide comment
@dmbates

dmbates Oct 9, 2012

Member

I do agree with Viral that generating double precision random uniforms then converting to 32 bit ints is somewhat of a waste. However, in my timings I'm getting a factor of less than 3 which is rather good considering the inefficiency in the algorithm.

julia> [@elapsed (rand(10000000); gc()) for i in 1:10]
10-element Float64 Array:
 0.070298 
 0.0667231
 0.069932 
 0.0685332
 0.0698729
 0.0664868
 0.0696168
 0.0679629
 0.0744951
 0.0732141

julia> [@elapsed (randn(10000000); gc()) for i in 1:10]
10-element Float64 Array:
 0.202207
 0.201848
 0.205695
 0.209753
 0.209389
 0.208542
 0.203538
 0.209986
 0.206832
 0.209801
Member

dmbates commented Oct 9, 2012

I do agree with Viral that generating double precision random uniforms then converting to 32 bit ints is somewhat of a waste. However, in my timings I'm getting a factor of less than 3 which is rather good considering the inefficiency in the algorithm.

julia> [@elapsed (rand(10000000); gc()) for i in 1:10]
10-element Float64 Array:
 0.070298 
 0.0667231
 0.069932 
 0.0685332
 0.0698729
 0.0664868
 0.0696168
 0.0679629
 0.0744951
 0.0732141

julia> [@elapsed (randn(10000000); gc()) for i in 1:10]
10-element Float64 Array:
 0.202207
 0.201848
 0.205695
 0.209753
 0.209389
 0.208542
 0.203538
 0.209986
 0.206832
 0.209801
@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Oct 9, 2012

Member

Doug, what kind of machine are you on? Architecture seems to make a huge
difference for these sorts of things – primarily that more superscalar CPUs
tend to hide inefficiencies much better.

On Tue, Oct 9, 2012 at 5:01 PM, dmbates notifications@github.com wrote:

I do agree with Viral that generating double precision random uniforms
then converting to 32 bit ints is somewhat of a waste. However, in my
timings I'm getting a factor of less than 3 which is rather good
considering the inefficiency in the algorithm.

julia> [@Elapsed (rand(10000000); gc()) for i in 1:10]
10-element Float64 Array:
0.070298
0.0667231
0.069932
0.0685332
0.0698729
0.0664868
0.0696168
0.0679629
0.0744951
0.0732141

julia> [@Elapsed (randn(10000000); gc()) for i in 1:10]
10-element Float64 Array:
0.202207
0.201848
0.205695
0.209753
0.209389
0.208542
0.203538
0.209986
0.206832
0.209801


Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/1348#issuecomment-9279679.

Member

StefanKarpinski commented Oct 9, 2012

Doug, what kind of machine are you on? Architecture seems to make a huge
difference for these sorts of things – primarily that more superscalar CPUs
tend to hide inefficiencies much better.

On Tue, Oct 9, 2012 at 5:01 PM, dmbates notifications@github.com wrote:

I do agree with Viral that generating double precision random uniforms
then converting to 32 bit ints is somewhat of a waste. However, in my
timings I'm getting a factor of less than 3 which is rather good
considering the inefficiency in the algorithm.

julia> [@Elapsed (rand(10000000); gc()) for i in 1:10]
10-element Float64 Array:
0.070298
0.0667231
0.069932
0.0685332
0.0698729
0.0664868
0.0696168
0.0679629
0.0744951
0.0732141

julia> [@Elapsed (randn(10000000); gc()) for i in 1:10]
10-element Float64 Array:
0.202207
0.201848
0.205695
0.209753
0.209389
0.208542
0.203538
0.209986
0.206832
0.209801


Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/1348#issuecomment-9279679.

@dmbates

This comment has been minimized.

Show comment
Hide comment
@dmbates

dmbates Oct 9, 2012

Member

AMD Athlon(tm) II X4 635 Processor running Ubuntu 12.04 amd_64

Member

dmbates commented Oct 9, 2012

AMD Athlon(tm) II X4 635 Processor running Ubuntu 12.04 amd_64

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Oct 9, 2012

Member

Since the initial scaled sample is accepted 99.3% of the time, it would be nice to see the performance ratio consistently under 2 or so.

Member

JeffBezanson commented Oct 9, 2012

Since the initial scaled sample is accepted 99.3% of the time, it would be nice to see the performance ratio consistently under 2 or so.

@dmbates

This comment has been minimized.

Show comment
Hide comment
@dmbates

dmbates Oct 10, 2012

Member

But currently it is necessary to draw two random uniforms to be converted to 32-bit integers and to do a certain amount of integer arithmetic and indexing. It seems optimistic to expect a ratio consistently under 2, doesn't it?

Member

dmbates commented Oct 10, 2012

But currently it is necessary to draw two random uniforms to be converted to 32-bit integers and to do a certain amount of integer arithmetic and indexing. It seems optimistic to expect a ratio consistently under 2, doesn't it?

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Oct 10, 2012

Member

Yes, but dSFMT now natively generates 64 bits of entropy at a time, so I believe the two calls to generate 32-bit ints are doing redundant work.

Member

JeffBezanson commented Oct 10, 2012

Yes, but dSFMT now natively generates 64 bits of entropy at a time, so I believe the two calls to generate 32-bit ints are doing redundant work.

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 11, 2012

Member

I believe dSFMT does generate 64-bits of entropy, but I think that it reuses that to generate the second 32-bit int. We just need to clean up randmtzig so that it is compatible with the way dsfmt works naturally.

Member

ViralBShah commented Oct 11, 2012

I believe dSFMT does generate 64-bits of entropy, but I think that it reuses that to generate the second 32-bit int. We just need to clean up randmtzig so that it is compatible with the way dsfmt works naturally.

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 11, 2012

Member

I note a 5x difference between rand and randn in julia. However, the julia randn is still faster than Matlab R2011a.

julia> @time rand(1,10000000);
elapsed time: 0.02201104164123535 seconds

julia> @time randn(1,10000000);
elapsed time: 0.10574984550476074 seconds

matlab >> tic; rand(1,10000000); toc
Elapsed time is 0.167736 seconds.

matlab >> tic; randn(1,10000000); toc
Elapsed time is 0.196783 seconds.
Member

ViralBShah commented Oct 11, 2012

I note a 5x difference between rand and randn in julia. However, the julia randn is still faster than Matlab R2011a.

julia> @time rand(1,10000000);
elapsed time: 0.02201104164123535 seconds

julia> @time randn(1,10000000);
elapsed time: 0.10574984550476074 seconds

matlab >> tic; rand(1,10000000); toc
Elapsed time is 0.167736 seconds.

matlab >> tic; randn(1,10000000); toc
Elapsed time is 0.196783 seconds.
@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Oct 11, 2012

Member

Excellent! Then when we get it down to 2x slower than our rand, we'll really be trouncing them. The fact that Matlab's randn is only 17% slower than Matlab's rand, proves that you can get gaussian sampling pretty close to uniform sampling (unless they are doing something bizarre).

Member

StefanKarpinski commented Oct 11, 2012

Excellent! Then when we get it down to 2x slower than our rand, we'll really be trouncing them. The fact that Matlab's randn is only 17% slower than Matlab's rand, proves that you can get gaussian sampling pretty close to uniform sampling (unless they are doing something bizarre).

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 11, 2012

Member

I should also note that even though it seems that the 32-bit int generation with dsfmt leads to extra work, it has yet to be shown that avoiding it will speed things up. Here are some experiments in C, which show that generating 2n uint32 random numbers is about 50% slower than generating n floating point random numbers.

./randmtzig 10000000
Uniform fill (n): 0.049514
Uniform (n): 0.038494
Uniform 32-bit ints (2*n): 0.104497
Normal (n): 0.102242
Member

ViralBShah commented Oct 11, 2012

I should also note that even though it seems that the 32-bit int generation with dsfmt leads to extra work, it has yet to be shown that avoiding it will speed things up. Here are some experiments in C, which show that generating 2n uint32 random numbers is about 50% slower than generating n floating point random numbers.

./randmtzig 10000000
Uniform fill (n): 0.049514
Uniform (n): 0.038494
Uniform 32-bit ints (2*n): 0.104497
Normal (n): 0.102242
@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 11, 2012

Member

Basically, what we are seeing is that julia takes 0.08 seconds for the normal rng (excluding uniform rng time), and that matlab takes 0.03 seconds for the same. So yes, there should be some scope for improvement.

Member

ViralBShah commented Oct 11, 2012

Basically, what we are seeing is that julia takes 0.08 seconds for the normal rng (excluding uniform rng time), and that matlab takes 0.03 seconds for the same. So yes, there should be some scope for improvement.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Oct 11, 2012

Member

All good points. This is a case where the overhead is additive, not multiplicative.

Member

StefanKarpinski commented Oct 11, 2012

All good points. This is a case where the overhead is additive, not multiplicative.

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 11, 2012

Member

One major source of difference in this performance is that when we generate rand(n), we are able to use the dsfmt's fill_array routines, which are much more efficient than the routines that generate one random number at a time. This itself accounts for about 2-3x performance difference.

Member

ViralBShah commented Oct 11, 2012

One major source of difference in this performance is that when we generate rand(n), we are able to use the dsfmt's fill_array routines, which are much more efficient than the routines that generate one random number at a time. This itself accounts for about 2-3x performance difference.

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 11, 2012

Member

It also seems that a significant part of the time goes in the tail computation, which gets invoked about 2% of the time.

Member

ViralBShah commented Oct 11, 2012

It also seems that a significant part of the time goes in the tail computation, which gets invoked about 2% of the time.

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Oct 12, 2012

Member

I could get about a 20% speedup in my C implementation by avoiding multiple calls to genrand_uint32 and directly generating a 52-bit integer from the mantissa of genrand_close_open, but that also had some inlining enabled that we don't do in librandom. Plugging these improvements in julia's librandom showed no performance improvement. The whole thing is quite delicate, with small changes leading to large differences. As for now, I am not sure what we can do to quickly improve randn performance.

Member

ViralBShah commented Oct 12, 2012

I could get about a 20% speedup in my C implementation by avoiding multiple calls to genrand_uint32 and directly generating a 52-bit integer from the mantissa of genrand_close_open, but that also had some inlining enabled that we don't do in librandom. Plugging these improvements in julia's librandom showed no performance improvement. The whole thing is quite delicate, with small changes leading to large differences. As for now, I am not sure what we can do to quickly improve randn performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment