Improve structure of join benchmarks #2707

cjalmeida · 2021-04-10T21:21:34Z

Simple patch to improve the way join benchmarks are structured.

They should run faster and make it easier to run in the REPL for profiling/improvement

bkamins · 2021-04-10T21:47:38Z

Thank you for looking into this!

If I understand the key part of your proposal correctly is that you want to run these tests in one Julia process. Is this correct?

This was unfortunately problematic when I developed the tests. I had to run each test in a fresh Julia session, as otherwise GC time influences the benchmarks too much. In particular later questions get skewed timing because GC starts to slow down because it has to sweep though millions of very small objects.

I initially had roughly what you propose but had to change it because it was too unstable. But I used my approach for development (when there were many moving parts), and maybe now it would be OK to switch to what you propose. Does your code produce comparable timings as my code on the same machine for all tests? (and what machine are you using to run the comparisons - here the amount of RAM might affect the design and results).

cjalmeida · 2021-04-12T14:50:39Z

Hi @bkamins, I ran both the original and the new benchmark suite and indeed the new suite is faster than the old, but not by much as much as I would expect (2h9m vs. 2h41m). Nevertheless the structure allow me to more easily benchmark from REPL using Profile and Revise for quick interaction, and that's a win in my workfow :)

The test timings were roughly the same (as expected); but I indeed run them on a pretty beefy machine (128GB RAM) I have access to at work.

GC being an issue even after the multiple GC.gc calls, I would suspect of RAM swapping to disk. Maybe try running it without swap?

bkamins · 2021-04-12T14:58:24Z

GC being an issue even after the multiple GC.gc calls

This can be unstable especially if one works with String (where you have millions of small objects). Running multiple GC.gc does not guarantee clean sweep unfortunately. The crucial issue is that I did not find a good solution to the following problem described in GC.gc docsting:

A full collection (default) sweeps all objects, which makes the next GC scan much slower, while an incremental collection may only sweep so-called young objects.

If you do not have enough RAM (I tested on smaller machines) then GC may get triggered within the join* and then often GC takes more time than join itself because of the way GC.gc works as described above.

Let me propose the following:

keep the old codes (for testing on smaller machines)
add your codes in a separate folder (for testing on bigger machines)

bkamins · 2021-04-12T15:00:24Z

The point of the current design is to avoid GC at all as much as possible when timing the join* calls.

cjalmeida · 2021-04-12T15:08:45Z

Makes sense. I'll keep the old behavior as default, and add "--single-process" to runtests.jl to run all tests in a single process. Does that work?

bkamins · 2021-04-12T15:15:10Z

Whatever is easiest for you, but please keep run.sh file for the old workflow as is. OK?

cjalmeida · 2021-04-13T01:46:45Z

Done!

bkamins · 2021-04-13T09:50:49Z

Thank you!

almeida-cloves-bcg added 2 commits April 10, 2021 21:08

improve structure of join benchmarks

cca8fe3

revert Project.toml changes

eb28492

bkamins added CI performance labels Apr 10, 2021

bkamins added this to the 1.x milestone Apr 10, 2021

almeida-cloves-bcg added 2 commits April 13, 2021 01:39

restore run.sh script

5ba575b

improve docs

269d0fa

bkamins approved these changes Apr 13, 2021

View reviewed changes

bkamins merged commit 9a2c4d2 into JuliaData:main Apr 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve structure of join benchmarks #2707

Improve structure of join benchmarks #2707

cjalmeida commented Apr 10, 2021

bkamins commented Apr 10, 2021 •

edited

cjalmeida commented Apr 12, 2021

bkamins commented Apr 12, 2021

bkamins commented Apr 12, 2021

cjalmeida commented Apr 12, 2021

bkamins commented Apr 12, 2021

cjalmeida commented Apr 13, 2021

bkamins commented Apr 13, 2021

Improve structure of join benchmarks #2707

Improve structure of join benchmarks #2707

Conversation

cjalmeida commented Apr 10, 2021

bkamins commented Apr 10, 2021 • edited

cjalmeida commented Apr 12, 2021

bkamins commented Apr 12, 2021

bkamins commented Apr 12, 2021

cjalmeida commented Apr 12, 2021

bkamins commented Apr 12, 2021

cjalmeida commented Apr 13, 2021

bkamins commented Apr 13, 2021

bkamins commented Apr 10, 2021 •

edited