Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient benchmarking strategy #2

Closed
e-kayrakli opened this issue Jun 15, 2017 · 5 comments
Closed

Efficient benchmarking strategy #2

e-kayrakli opened this issue Jun 15, 2017 · 5 comments

Comments

@e-kayrakli
Copy link
Collaborator

Opened this issue just to capture some ideas as we discuss them.

My random "in an ideal world" thoughts on this are as follows:

  • Whatever script that you develop is most likely going to be used by you only for the rest of the project, although you may reuse parts of it later. Therefore, you need to find a sweet spot in terms of time that you are going to spend on this script vs the time it will save you. You can start small and add features as you want.
  • A script to run all (at least) microbenchmarks is a must. I think we should set a data size (or workload size or whatever term makes sense) and stick with it so as to avoid comparing apples and oranges.
  • It'd be nice if this script can log performance metrics, along with commit sha of your queue and Chapel. So that we can revert back if we notice a performance degradation. This implies that you should use the master and not the release. When we see a performance degradation, we can do an adhoc test with release to see if the performance issue is due to what you changed or some regression on master.
  • Plots help understand the results but definitely not a must. If you want to do that at most you can do very basic pyplots, which should do the job and shouldn't take a lot of time.
  • Writing your script a bit in a flexible fashion can help when/if we add new benchmarks.
  • In general, when do stuff like this I add the support (if necessary) to be able to run it locally and use it for correctness testing before performance testing. It helps save time by pruning segfaults/races etc. before running on an actual machine.
@e-kayrakli
Copy link
Collaborator Author

To be clear: the list is not exhaustive. OTOH, some of them are optional and relatively more time consuming than others.

We can discuss here in a more structured way about this issue.

@LouisJenkinsCS
Copy link
Owner

All of the above definitely seem like great ideas for the script, especially the SHA hash one. One addition I was thinking of was to vary not only in locales, but per-locale cores as well. Currently I was planning on ramping up from 1 to 44 (with some variation in steps in between) locales, but I realize now as well that I think perhaps we should test scalability of adding more CPU cores as well. The rationale behind it is that... really, there are two bottlenecks here: Communication from work stealing, and contention/concurrent usage of the queue. If we used a simple two-locked sync variable queue, then increasing the amount of cores wouldn't effect performance at all, but with CCLock it would, which would be reflected in the benchmark. What do you think?

@e-kayrakli
Copy link
Collaborator Author

Makes sense..

Just to keep our benchmarking overhead at bay, do you think it makes sense to do varying intranode paralellism tests only in single locale?

Another way to put it: Is there anything new we can learn by running, say, 2 threads/node 44 locale test as opposed to 2 threads/core single locale test?

@LouisJenkinsCS
Copy link
Owner

Oh, that's one of the things I was thinking about, having, as you put it, intranode parallelism, where we'd have, say...

1 thread/core/node (Maximum parallelism)
2 threads/core/node (Oversubscribing)
4 threads/core/node (Over-Oversubscribing)

Furthermore, I was thinking of varying in terms of actual cores/node as well.

1 core/node
2 core/node
4 core/node
8 core/node
...

Etc.

This way it'd test all forms of scalability. Do we gain performance by adding more cores to a node (Sync-Variable = No, CCLock = Yes), do we gain performance by adding locales (both: yes), do we add more performance by oversubscribing? Etc.

Did I answer the question?

@e-kayrakli
Copy link
Collaborator Author

1 core/node
2 core/node
4 core/node
8 core/node

If you combine this test with increasing number of nodes, that may be a bit redundant. But it is just a hunch, I may well be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants