Longer running benchmarks #220

kmcdermo · 2019-05-03T20:38:37Z

As demonstrated in test throughput studies from @makortel , running with larger number of events eliminates edge effects and improves parallel throughout performance.

Quoting Matti on the chat:

so on phi3 with 32 threads or jobs, the throughput of multithreading vs. multiprocessing is

20 events/thread: 75 %

120 events/thread: 94 %

with 64 threads or jobs, the same fractions are

20 events/thread: 67 %

120 events/thread: 92 %

It seems that it may be beneficial to rewrite part of the benchmarking scripts to use more events / thread to achieve a higher parallel utilization. The question is: is this solely "forConf", to have our "best" results on display, or should we be doing this with every PR as well?

The case for every PR (although it will lengthen the time to run the benchmarking) is that compute performance gains and losses could be hiding behind this under-utilization in some systematic way. I should mention we partially account for this when running the standard benchmarks, as we drop the first event from the average build time, since we have seen it does in fact have a time per event an order of magnitude different from the average. The question, even with dropping this first event, does the average time per event improve when processing more events.

Let me know what you think (and who might want to tackle this.

slava77 · 2019-05-03T21:04:43Z

What is the history of using 20 events?
Was this fine-tuned to KNL (256 threads and our file size of 5K events back then)? Also, KNL per thread is pretty slow, compared to phi3.
On phi3 we can do more in less time. Given the rewind capabilities the total is not limited anymore.

kmcdermo · 2019-05-03T22:02:02Z

I think way back on the KNC days, this was a compromise on wall clock time and throughput. And then, when we introduced the MEIF tests, this was as you say to fit within the 5K events of the binary file / 256 threads for KNL.

So, indeed, we can test what makes sense to do phi3 (and also if we want to use dd to keep the files in memory.

srlantz · 2019-08-09T16:42:05Z

It may be interesting to see if the edge effects are now reduced, or even go away, with the new "performance" scaling governor setting on phi3 (issues #232 and #233).

slava77 · 2020-02-20T19:10:22Z

we can now enabling looping over the same file multiple times. This should remove the constraint of the total available number of input events

cerati · 2020-02-20T19:25:10Z

In my CHEP19 area I have two different options in addition to the default:

used nevents_meif=120.
multiply nevents by nthreads (${nevents} => ${nevents} * ${nth})
Both are trivial changes to xeon_scripts/benchmark-cmssw-ttbar-fulldet-build.sh.

I should probably test these options and make a PR...

I am not sure if we need to loop over the same file multiple times.

osschar · 2020-02-21T22:41:57Z

That's what I did for CHEP18: (N_events = 20 * N_threads, ignore N_meif)
84be84f

kmcdermo mentioned this issue Jul 19, 2019

Revamp of benchmark scripts #230

Closed

srlantz mentioned this issue Aug 9, 2019

shift in multithreaded benchmarks due to new setting of CPU scaling governor on phi3 #232

Closed

areinsvo assigned cerati Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Longer running benchmarks #220

Longer running benchmarks #220

kmcdermo commented May 3, 2019

slava77 commented May 3, 2019

kmcdermo commented May 3, 2019

srlantz commented Aug 9, 2019

slava77 commented Feb 20, 2020

cerati commented Feb 20, 2020 •

edited

osschar commented Feb 21, 2020

Longer running benchmarks #220

Longer running benchmarks #220

Comments

kmcdermo commented May 3, 2019

slava77 commented May 3, 2019

kmcdermo commented May 3, 2019

srlantz commented Aug 9, 2019

slava77 commented Feb 20, 2020

cerati commented Feb 20, 2020 • edited

osschar commented Feb 21, 2020

cerati commented Feb 20, 2020 •

edited