Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed and accuracy improvements to java benchmark #14229

Closed
wants to merge 1 commit into from

Conversation

@ryanhamilton
Copy link

@ryanhamilton ryanhamilton commented Dec 2, 2015

  • Increase the number of iterations to let the JIT compile kick-in. (this still isn't optimal, some calls are too quick and the clocks too inaccurate that our measurements are inaccurate.)
  • Refactor to call PerfPure static functions from PerfBlas rather than repeating code in two places. Ideally all similar code should also be moved.
  • Use parseUnsignedInt rather than valueOf, it's more accurate and faster.
  • Remove custom quicksort routine and use Arrays.sort. It's faster, does the same and is more idiomatic.
  • printf changed to behave closer to julia code, i.e. use printf rather than concatenation.
  • Replace recursive fibonnaci with loop, much faster.
  • Put in OS detection to allow tests to run on windows. I recommend not using /dev/null at all as this is a special case on most platforms so the test only shows performance for dev/null not files in general.
 - Increase the number of iterations to let the JIT compile kick-in. (this still isn't optimal, some calls are too quick and the clocks too inaccurate that our measurements are inaccurate.)
 - Refactor to call PerfPure static functions from PerfBlas rather than repeating code in two places. Ideally all similar code should also be moved.
 - Use parseUnsignedInt rather than valueOf, it's more accurate and faster.
 - Remove custom quicksort routine and use Arrays.sort. It's faster, does the same and is more idiomatic.
 - printf changed to behave closer to julia code, i.e. use printf rather than concatenation.
 - Replace recursive fibonnaci with loop, much faster.
 - Put in OS detection to allow tests to run on windows. I recommend not using /dev/null at all as this is a special case on most platforms so the test only shows performance for dev/null not files in general.
@jiahao
Copy link
Member

@jiahao jiahao commented Dec 2, 2015

Thanks for the improvements. However, the entire point of the Fibonacci benchmark is to have some idea for the cost of recursion.

@vchuravy
Copy link
Member

@vchuravy vchuravy commented Dec 2, 2015

w.r.t to recursive fib I think the idea is to implement the same algorithm in Julia and across languages and the micro algorithm in Julia is using recursion so that you are comparing apples with apples. Same thing for quicksort.

@StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Dec 2, 2015

Ditto with the quicksort. We want to know how fast user code that shuffles array elements is, not haw fast the hyper-optimized system quicksort is.

@StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Dec 2, 2015

The parseUnsignedInt change is also invalid since other languages parse the numbers as signed integers.

@Keno
Copy link
Member

@Keno Keno commented Dec 2, 2015

Thanks for working on this @ryanhamilton. Having fair benchmarks is really important.

@StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Dec 2, 2015

Yes, thanks for the improvements, which appear to be these:

  • PerfPure/PerfBlas code deduplication
  • printf change
  • Windows portability

The number of JIT iterations is a bit debatable, but arguably not wrong. This is a tricky issue since some languages like C and Fortran require no JIT, Julia has a first-time-only JIT, and other systems like Java and JavaScript have JITs that kick in after an indeterminate amount of time. How many iterations are fair?

@timholy
Copy link
Member

@timholy timholy commented Dec 2, 2015

How many iterations are fair?

Agreed it's tricky, but I think the only clean answers are "the first iteration" or "in the asymptotic limit." Anything else feels pretty arbitrary.

Besides, I imagine julia might go that way someday to decrease the cost of run-time compilation 😉.

@ryanhamilton
Copy link
Author

@ryanhamilton ryanhamilton commented Dec 2, 2015

@StefanKarpinski I would say let it run 50 times, calculate time taken per run, do it again for 70 runs, did you get a similar result i.e. 5% change. Keep going until you have a known confidence level.

@ryanhamilton
Copy link
Author

@ryanhamilton ryanhamilton commented Dec 2, 2015

I mean this in the most productive way for the Julia project...What is the goal of these benchmarks?
Is it to benchmark how code written by a naive programmer in a julia style would perform in other languages?

Because as a technical user, when I'm asked to evaluate and decide if we should use Julia for part of our solution I am going to dig into the technical details and if I find the technical parts lacking I will be unimpressed.

If you don't have benchmarks for such a purpose I advise making them one of your priorities so that:

  1. You know the effect of your changes on real world performance. This system for PyPy is a good example: http://speed.pypy.org/timeline/ It gives me confidence that those guys have put thought into studying their progress at making their system faster.
  2. To convince tech leads in companies to use Julia
@StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Dec 2, 2015

There's a paragraph right after the benchmarks table on the home page explaining their point:

These benchmarks, while not comprehensive, do test compiler performance on a range of common code patterns, such as function calls, string parsing, sorting, numerical loops, random number generation, and array operations. It is important to note that these benchmark implementations are not written for absolute maximal performance (the fastest code to compute fib(20) is the constant literal 6765). Rather, all of the benchmarks are written to test the performance of specific algorithms, expressed in a reasonable idiom in each language. In particular, all languages use the same algorithm: the Fibonacci benchmarks are all recursive while the pi summation benchmarks are all iterative; the “algorithm” for random matrix multiplication is to call LAPACK, except where that’s not possible, such as in JavaScript. The point of these benchmarks is to compare the performance of specific algorithms across language implementations, not to compare the fastest means of computing a result, which in most high-level languages relies on calling C code.

If you find that unconvincing or uninteresting, that's fine – there are lots of examples of real world use cases that we didn't design where Julia is close to C and Fortran in performance.

@Keno
Copy link
Member

@Keno Keno commented Dec 2, 2015

Regarding the benchmarks for performance tracking, we used to have an installation of pypy's speed center, but it proved somewhat unreliable and not very useful, so it was discontinued. @jrevels Is currently working on an improved version of that to get performance-tracking CI back.

@ryanhamilton
Copy link
Author

@ryanhamilton ryanhamilton commented Dec 2, 2015

@StefanKarpinski That text is extremely upfront and detailed. I'm impressed. I should also read better :)

That just leaves:

  • The issue of iterations
  • /dev/null for printfd tests. It's a special case that I think is handled differently on certain platforms compared to standard files.
@nalimilan
Copy link
Contributor

@nalimilan nalimilan commented Dec 2, 2015

Objections about the purpose of benchmarks keep being reported. Maybe we should put that paragraph before the benchmarks, with a warning in bold and red like "Please read the following disclaimer before interpreting these benchmarks." I'm afraid some people conclude the Julia team is cheating, as has been claimed in some blog posts already.

@jrevels
Copy link
Member

@jrevels jrevels commented Dec 2, 2015

@jrevels Is currently working on an improved version of that to get performance-tracking CI back.

Just for reference, the issue tracking the development of our CI performance testing system is #13893

@ryanhamilton
Copy link
Author

@ryanhamilton ryanhamilton commented Dec 2, 2015

Closing as unlikely to get through. Will make smaller PRs instead.

@ryanhamilton
Copy link
Author

@ryanhamilton ryanhamilton commented Dec 16, 2015

Benchmark issue popped up on hacker news: https://news.ycombinator.com/item?id=10735840

@bestsss
Copy link

@bestsss bestsss commented Dec 16, 2015

Speaking of Java:
quicksort (hi+low)/2 doesn't account for integer overflow.
Fib: Java doesn't have tail call optimizations (and likely won't have them as the stack trace is needed for the security manager)0 So using naive fib in java is ok if you wish to test recursion alone.
parseInt benchmark is actually dominate by int->String conversion NOT parsingInt
A lot of timings are dependent on Random.nextXXX which is tread safe and involves CAS on x86. Using ThreadLocalRandom is the preferred way but that means losing determinism.

Overall the method body is way too big to and JVM will give up on many optimizations. I would not consider most of the code idiomatic.

@StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented Dec 16, 2015

Speaking of Java:
quicksort (hi+low)/2 doesn't account for integer overflow.

We could change this to use >>> instead of integer division. I don't think it matters, however, since any compiler worth its salt will optimize integer division by 2 to an arithmetic right shift by one, which is just as fast.

Fib: Java doesn't have tail call optimizations (and likely won't have them as the stack trace is needed for the security manager)0 So using naive fib in java is ok if you wish to test recursion alone.

Testing recursion is the explicit purpose of that benchmark. "Tail call optimization" is not usually an optimization at all – it's often slower than just pushing a stack frame. Also, since that algorithm is doubly recursive, you can't eliminate the recursive calls entirely.

parseInt benchmark is actually dominate by int->String conversion NOT parsingInt

All of the languages are doing both, so that's fine.

A lot of timings are dependent on Random.nextXXX which is tread safe and involves CAS on x86. Using ThreadLocalRandom is the preferred way but that means losing determinism.

If that avoids locking overhead, then we should do it. Why would using this make it non-deterministic?

Overall the method body is way too big to and JVM will give up on many optimizations. I would not consider most of the code idiomatic.

What method body? This is all straightforward purely static code using primitive data types, hardly any objects, so a compiler for a static language like Java should have no difficulty optimizing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

10 participants
You can’t perform that action at this time.