Speed and accuracy improvements to java benchmark #14229

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
10 participants
@ryanhamilton
  • Increase the number of iterations to let the JIT compile kick-in. (this still isn't optimal, some calls are too quick and the clocks too inaccurate that our measurements are inaccurate.)
  • Refactor to call PerfPure static functions from PerfBlas rather than repeating code in two places. Ideally all similar code should also be moved.
  • Use parseUnsignedInt rather than valueOf, it's more accurate and faster.
  • Remove custom quicksort routine and use Arrays.sort. It's faster, does the same and is more idiomatic.
  • printf changed to behave closer to julia code, i.e. use printf rather than concatenation.
  • Replace recursive fibonnaci with loop, much faster.
  • Put in OS detection to allow tests to run on windows. I recommend not using /dev/null at all as this is a special case on most platforms so the test only shows performance for dev/null not files in general.
Ryan Hamilton
Speed and accuracy improvements to java benchmark
 - Increase the number of iterations to let the JIT compile kick-in. (this still isn't optimal, some calls are too quick and the clocks too inaccurate that our measurements are inaccurate.)
 - Refactor to call PerfPure static functions from PerfBlas rather than repeating code in two places. Ideally all similar code should also be moved.
 - Use parseUnsignedInt rather than valueOf, it's more accurate and faster.
 - Remove custom quicksort routine and use Arrays.sort. It's faster, does the same and is more idiomatic.
 - printf changed to behave closer to julia code, i.e. use printf rather than concatenation.
 - Replace recursive fibonnaci with loop, much faster.
 - Put in OS detection to allow tests to run on windows. I recommend not using /dev/null at all as this is a special case on most platforms so the test only shows performance for dev/null not files in general.
@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Dec 2, 2015

Member

Thanks for the improvements. However, the entire point of the Fibonacci benchmark is to have some idea for the cost of recursion.

Member

jiahao commented Dec 2, 2015

Thanks for the improvements. However, the entire point of the Fibonacci benchmark is to have some idea for the cost of recursion.

@vchuravy

This comment has been minimized.

Show comment
Hide comment
@vchuravy

vchuravy Dec 2, 2015

Member

w.r.t to recursive fib I think the idea is to implement the same algorithm in Julia and across languages and the micro algorithm in Julia is using recursion so that you are comparing apples with apples. Same thing for quicksort.

Member

vchuravy commented Dec 2, 2015

w.r.t to recursive fib I think the idea is to implement the same algorithm in Julia and across languages and the micro algorithm in Julia is using recursion so that you are comparing apples with apples. Same thing for quicksort.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Dec 2, 2015

Member

Ditto with the quicksort. We want to know how fast user code that shuffles array elements is, not haw fast the hyper-optimized system quicksort is.

Member

StefanKarpinski commented Dec 2, 2015

Ditto with the quicksort. We want to know how fast user code that shuffles array elements is, not haw fast the hyper-optimized system quicksort is.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Dec 2, 2015

Member

The parseUnsignedInt change is also invalid since other languages parse the numbers as signed integers.

Member

StefanKarpinski commented Dec 2, 2015

The parseUnsignedInt change is also invalid since other languages parse the numbers as signed integers.

@Keno

This comment has been minimized.

Show comment
Hide comment
@Keno

Keno Dec 2, 2015

Member

Thanks for working on this @ryanhamilton. Having fair benchmarks is really important.

Member

Keno commented Dec 2, 2015

Thanks for working on this @ryanhamilton. Having fair benchmarks is really important.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Dec 2, 2015

Member

Yes, thanks for the improvements, which appear to be these:

  • PerfPure/PerfBlas code deduplication
  • printf change
  • Windows portability

The number of JIT iterations is a bit debatable, but arguably not wrong. This is a tricky issue since some languages like C and Fortran require no JIT, Julia has a first-time-only JIT, and other systems like Java and JavaScript have JITs that kick in after an indeterminate amount of time. How many iterations are fair?

Member

StefanKarpinski commented Dec 2, 2015

Yes, thanks for the improvements, which appear to be these:

  • PerfPure/PerfBlas code deduplication
  • printf change
  • Windows portability

The number of JIT iterations is a bit debatable, but arguably not wrong. This is a tricky issue since some languages like C and Fortran require no JIT, Julia has a first-time-only JIT, and other systems like Java and JavaScript have JITs that kick in after an indeterminate amount of time. How many iterations are fair?

@timholy

This comment has been minimized.

Show comment
Hide comment
@timholy

timholy Dec 2, 2015

Member

How many iterations are fair?

Agreed it's tricky, but I think the only clean answers are "the first iteration" or "in the asymptotic limit." Anything else feels pretty arbitrary.

Besides, I imagine julia might go that way someday to decrease the cost of run-time compilation 😉.

Member

timholy commented Dec 2, 2015

How many iterations are fair?

Agreed it's tricky, but I think the only clean answers are "the first iteration" or "in the asymptotic limit." Anything else feels pretty arbitrary.

Besides, I imagine julia might go that way someday to decrease the cost of run-time compilation 😉.

@ryanhamilton

This comment has been minimized.

Show comment
Hide comment
@ryanhamilton

ryanhamilton Dec 2, 2015

@StefanKarpinski I would say let it run 50 times, calculate time taken per run, do it again for 70 runs, did you get a similar result i.e. 5% change. Keep going until you have a known confidence level.

@StefanKarpinski I would say let it run 50 times, calculate time taken per run, do it again for 70 runs, did you get a similar result i.e. 5% change. Keep going until you have a known confidence level.

@ryanhamilton

This comment has been minimized.

Show comment
Hide comment
@ryanhamilton

ryanhamilton Dec 2, 2015

I mean this in the most productive way for the Julia project...What is the goal of these benchmarks?
Is it to benchmark how code written by a naive programmer in a julia style would perform in other languages?

Because as a technical user, when I'm asked to evaluate and decide if we should use Julia for part of our solution I am going to dig into the technical details and if I find the technical parts lacking I will be unimpressed.

If you don't have benchmarks for such a purpose I advise making them one of your priorities so that:

  1. You know the effect of your changes on real world performance. This system for PyPy is a good example: http://speed.pypy.org/timeline/ It gives me confidence that those guys have put thought into studying their progress at making their system faster.
  2. To convince tech leads in companies to use Julia

I mean this in the most productive way for the Julia project...What is the goal of these benchmarks?
Is it to benchmark how code written by a naive programmer in a julia style would perform in other languages?

Because as a technical user, when I'm asked to evaluate and decide if we should use Julia for part of our solution I am going to dig into the technical details and if I find the technical parts lacking I will be unimpressed.

If you don't have benchmarks for such a purpose I advise making them one of your priorities so that:

  1. You know the effect of your changes on real world performance. This system for PyPy is a good example: http://speed.pypy.org/timeline/ It gives me confidence that those guys have put thought into studying their progress at making their system faster.
  2. To convince tech leads in companies to use Julia
@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Dec 2, 2015

Member

There's a paragraph right after the benchmarks table on the home page explaining their point:

These benchmarks, while not comprehensive, do test compiler performance on a range of common code patterns, such as function calls, string parsing, sorting, numerical loops, random number generation, and array operations. It is important to note that these benchmark implementations are not written for absolute maximal performance (the fastest code to compute fib(20) is the constant literal 6765). Rather, all of the benchmarks are written to test the performance of specific algorithms, expressed in a reasonable idiom in each language. In particular, all languages use the same algorithm: the Fibonacci benchmarks are all recursive while the pi summation benchmarks are all iterative; the “algorithm” for random matrix multiplication is to call LAPACK, except where that’s not possible, such as in JavaScript. The point of these benchmarks is to compare the performance of specific algorithms across language implementations, not to compare the fastest means of computing a result, which in most high-level languages relies on calling C code.

If you find that unconvincing or uninteresting, that's fine – there are lots of examples of real world use cases that we didn't design where Julia is close to C and Fortran in performance.

Member

StefanKarpinski commented Dec 2, 2015

There's a paragraph right after the benchmarks table on the home page explaining their point:

These benchmarks, while not comprehensive, do test compiler performance on a range of common code patterns, such as function calls, string parsing, sorting, numerical loops, random number generation, and array operations. It is important to note that these benchmark implementations are not written for absolute maximal performance (the fastest code to compute fib(20) is the constant literal 6765). Rather, all of the benchmarks are written to test the performance of specific algorithms, expressed in a reasonable idiom in each language. In particular, all languages use the same algorithm: the Fibonacci benchmarks are all recursive while the pi summation benchmarks are all iterative; the “algorithm” for random matrix multiplication is to call LAPACK, except where that’s not possible, such as in JavaScript. The point of these benchmarks is to compare the performance of specific algorithms across language implementations, not to compare the fastest means of computing a result, which in most high-level languages relies on calling C code.

If you find that unconvincing or uninteresting, that's fine – there are lots of examples of real world use cases that we didn't design where Julia is close to C and Fortran in performance.

@Keno

This comment has been minimized.

Show comment
Hide comment
@Keno

Keno Dec 2, 2015

Member

Regarding the benchmarks for performance tracking, we used to have an installation of pypy's speed center, but it proved somewhat unreliable and not very useful, so it was discontinued. @jrevels Is currently working on an improved version of that to get performance-tracking CI back.

Member

Keno commented Dec 2, 2015

Regarding the benchmarks for performance tracking, we used to have an installation of pypy's speed center, but it proved somewhat unreliable and not very useful, so it was discontinued. @jrevels Is currently working on an improved version of that to get performance-tracking CI back.

@ryanhamilton

This comment has been minimized.

Show comment
Hide comment
@ryanhamilton

ryanhamilton Dec 2, 2015

@StefanKarpinski That text is extremely upfront and detailed. I'm impressed. I should also read better :)

That just leaves:

  • The issue of iterations
  • /dev/null for printfd tests. It's a special case that I think is handled differently on certain platforms compared to standard files.

@StefanKarpinski That text is extremely upfront and detailed. I'm impressed. I should also read better :)

That just leaves:

  • The issue of iterations
  • /dev/null for printfd tests. It's a special case that I think is handled differently on certain platforms compared to standard files.
@nalimilan

This comment has been minimized.

Show comment
Hide comment
@nalimilan

nalimilan Dec 2, 2015

Contributor

Objections about the purpose of benchmarks keep being reported. Maybe we should put that paragraph before the benchmarks, with a warning in bold and red like "Please read the following disclaimer before interpreting these benchmarks." I'm afraid some people conclude the Julia team is cheating, as has been claimed in some blog posts already.

Contributor

nalimilan commented Dec 2, 2015

Objections about the purpose of benchmarks keep being reported. Maybe we should put that paragraph before the benchmarks, with a warning in bold and red like "Please read the following disclaimer before interpreting these benchmarks." I'm afraid some people conclude the Julia team is cheating, as has been claimed in some blog posts already.

@jrevels

This comment has been minimized.

Show comment
Hide comment
@jrevels

jrevels Dec 2, 2015

Member

@jrevels Is currently working on an improved version of that to get performance-tracking CI back.

Just for reference, the issue tracking the development of our CI performance testing system is #13893

Member

jrevels commented Dec 2, 2015

@jrevels Is currently working on an improved version of that to get performance-tracking CI back.

Just for reference, the issue tracking the development of our CI performance testing system is #13893

@ryanhamilton

This comment has been minimized.

Show comment
Hide comment
@ryanhamilton

ryanhamilton Dec 2, 2015

Closing as unlikely to get through. Will make smaller PRs instead.

Closing as unlikely to get through. Will make smaller PRs instead.

@ryanhamilton

This comment has been minimized.

Show comment
Hide comment
@ryanhamilton

ryanhamilton Dec 16, 2015

Benchmark issue popped up on hacker news: https://news.ycombinator.com/item?id=10735840

Benchmark issue popped up on hacker news: https://news.ycombinator.com/item?id=10735840

@bestsss

This comment has been minimized.

Show comment
Hide comment
@bestsss

bestsss Dec 16, 2015

Speaking of Java:
quicksort (hi+low)/2 doesn't account for integer overflow.
Fib: Java doesn't have tail call optimizations (and likely won't have them as the stack trace is needed for the security manager)0 So using naive fib in java is ok if you wish to test recursion alone.
parseInt benchmark is actually dominate by int->String conversion NOT parsingInt
A lot of timings are dependent on Random.nextXXX which is tread safe and involves CAS on x86. Using ThreadLocalRandom is the preferred way but that means losing determinism.

Overall the method body is way too big to and JVM will give up on many optimizations. I would not consider most of the code idiomatic.

bestsss commented Dec 16, 2015

Speaking of Java:
quicksort (hi+low)/2 doesn't account for integer overflow.
Fib: Java doesn't have tail call optimizations (and likely won't have them as the stack trace is needed for the security manager)0 So using naive fib in java is ok if you wish to test recursion alone.
parseInt benchmark is actually dominate by int->String conversion NOT parsingInt
A lot of timings are dependent on Random.nextXXX which is tread safe and involves CAS on x86. Using ThreadLocalRandom is the preferred way but that means losing determinism.

Overall the method body is way too big to and JVM will give up on many optimizations. I would not consider most of the code idiomatic.

@StefanKarpinski

This comment has been minimized.

Show comment
Hide comment
@StefanKarpinski

StefanKarpinski Dec 16, 2015

Member

Speaking of Java:
quicksort (hi+low)/2 doesn't account for integer overflow.

We could change this to use >>> instead of integer division. I don't think it matters, however, since any compiler worth its salt will optimize integer division by 2 to an arithmetic right shift by one, which is just as fast.

Fib: Java doesn't have tail call optimizations (and likely won't have them as the stack trace is needed for the security manager)0 So using naive fib in java is ok if you wish to test recursion alone.

Testing recursion is the explicit purpose of that benchmark. "Tail call optimization" is not usually an optimization at all – it's often slower than just pushing a stack frame. Also, since that algorithm is doubly recursive, you can't eliminate the recursive calls entirely.

parseInt benchmark is actually dominate by int->String conversion NOT parsingInt

All of the languages are doing both, so that's fine.

A lot of timings are dependent on Random.nextXXX which is tread safe and involves CAS on x86. Using ThreadLocalRandom is the preferred way but that means losing determinism.

If that avoids locking overhead, then we should do it. Why would using this make it non-deterministic?

Overall the method body is way too big to and JVM will give up on many optimizations. I would not consider most of the code idiomatic.

What method body? This is all straightforward purely static code using primitive data types, hardly any objects, so a compiler for a static language like Java should have no difficulty optimizing it.

Member

StefanKarpinski commented Dec 16, 2015

Speaking of Java:
quicksort (hi+low)/2 doesn't account for integer overflow.

We could change this to use >>> instead of integer division. I don't think it matters, however, since any compiler worth its salt will optimize integer division by 2 to an arithmetic right shift by one, which is just as fast.

Fib: Java doesn't have tail call optimizations (and likely won't have them as the stack trace is needed for the security manager)0 So using naive fib in java is ok if you wish to test recursion alone.

Testing recursion is the explicit purpose of that benchmark. "Tail call optimization" is not usually an optimization at all – it's often slower than just pushing a stack frame. Also, since that algorithm is doubly recursive, you can't eliminate the recursive calls entirely.

parseInt benchmark is actually dominate by int->String conversion NOT parsingInt

All of the languages are doing both, so that's fine.

A lot of timings are dependent on Random.nextXXX which is tread safe and involves CAS on x86. Using ThreadLocalRandom is the preferred way but that means losing determinism.

If that avoids locking overhead, then we should do it. Why would using this make it non-deterministic?

Overall the method body is way too big to and JVM will give up on many optimizations. I would not consider most of the code idiomatic.

What method body? This is all straightforward purely static code using primitive data types, hardly any objects, so a compiler for a static language like Java should have no difficulty optimizing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment