Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longer running benchmarks #220

Open
kmcdermo opened this issue May 3, 2019 · 6 comments
Open

Longer running benchmarks #220

kmcdermo opened this issue May 3, 2019 · 6 comments
Assignees

Comments

@kmcdermo
Copy link
Collaborator

kmcdermo commented May 3, 2019

As demonstrated in test throughput studies from @makortel , running with larger number of events eliminates edge effects and improves parallel throughout performance.

Quoting Matti on the chat:

so on phi3 with 32 threads or jobs, the throughput of multithreading vs. multiprocessing is

  • 20 events/thread: 75 %
  • 120 events/thread: 94 %

with 64 threads or jobs, the same fractions are

  • 20 events/thread: 67 %
  • 120 events/thread: 92 %

It seems that it may be beneficial to rewrite part of the benchmarking scripts to use more events / thread to achieve a higher parallel utilization. The question is: is this solely "forConf", to have our "best" results on display, or should we be doing this with every PR as well?

The case for every PR (although it will lengthen the time to run the benchmarking) is that compute performance gains and losses could be hiding behind this under-utilization in some systematic way. I should mention we partially account for this when running the standard benchmarks, as we drop the first event from the average build time, since we have seen it does in fact have a time per event an order of magnitude different from the average. The question, even with dropping this first event, does the average time per event improve when processing more events.

Let me know what you think (and who might want to tackle this.

@slava77
Copy link
Collaborator

slava77 commented May 3, 2019

What is the history of using 20 events?
Was this fine-tuned to KNL (256 threads and our file size of 5K events back then)? Also, KNL per thread is pretty slow, compared to phi3.
On phi3 we can do more in less time. Given the rewind capabilities the total is not limited anymore.

@kmcdermo
Copy link
Collaborator Author

kmcdermo commented May 3, 2019

I think way back on the KNC days, this was a compromise on wall clock time and throughput. And then, when we introduced the MEIF tests, this was as you say to fit within the 5K events of the binary file / 256 threads for KNL.

So, indeed, we can test what makes sense to do phi3 (and also if we want to use dd to keep the files in memory.

@srlantz
Copy link
Collaborator

srlantz commented Aug 9, 2019

It may be interesting to see if the edge effects are now reduced, or even go away, with the new "performance" scaling governor setting on phi3 (issues #232 and #233).

@slava77
Copy link
Collaborator

slava77 commented Feb 20, 2020

we can now enabling looping over the same file multiple times. This should remove the constraint of the total available number of input events

@cerati
Copy link
Collaborator

cerati commented Feb 20, 2020

In my CHEP19 area I have two different options in addition to the default:

  1. used nevents_meif=120.
  2. multiply nevents by nthreads (${nevents} => ${nevents} * ${nth})
    Both are trivial changes to xeon_scripts/benchmark-cmssw-ttbar-fulldet-build.sh.

I should probably test these options and make a PR...

I am not sure if we need to loop over the same file multiple times.

@osschar
Copy link
Collaborator

osschar commented Feb 21, 2020

That's what I did for CHEP18: (N_events = 20 * N_threads, ignore N_meif)
84be84f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants