Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2D thread distribution for multi-threaded GEMMs #1320

Merged
merged 3 commits into from
Oct 8, 2017

Conversation

timmoon10
Copy link
Contributor

This is a refinement on #1316 that attempts to maximize thread utilization when a matrix dimension is small. In particular, all threads will be active if M is small and N is large (or vice versa).

Below are some scaling experiments on a 24-core system with hyper-threading (2 x Intel Xeon E5-2695v2, gcc 4.9.3, TARGET=SANDYBRIDGE USE_OPENMP=1 NUM_THREADS=48 INTERFACE64=0). "Unpatched" refers to 00c42dc, "Patch v1" to a89d671, and "Patch v2" to 30486a3. Varying N:
n_sgemm
Varying M:
m_sgemm

Tim Moon added 3 commits October 3, 2017 13:43
@martin-frbg
Copy link
Collaborator

Impressive (Though somewhat intriguing to see the first sgemm test "lose" a quarter of its just gained (v1) performance again at N=130 (and also N=65, N=32/33). I wonder if it would make sense to add special cases for such hardware-specific sweet spots at some point ?

@timmoon10
Copy link
Contributor Author

For v1, the number of threads is N/4. It seems we get the best performance when the number of threads is a power of 2 and we take a hit when we cross that boundary. For v2, we use all available threads from the get-go, so we don't see this effect.

@timmoon10
Copy link
Contributor Author

We could probably add some hardware-dependent macros to determine how we partition the computation. Right now, I use SWITCH_RATIO to determine partitions in the M and N dimensions, but it may be worthwhile experimenting with separate values.

@brada4
Copy link
Contributor

brada4 commented Oct 4, 2017

If you look at early marketing diagrams it is not n^2 but a 4-core clusters of cores with somewhat more adjacent / shareable caches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants