-
Notifications
You must be signed in to change notification settings - Fork 18
First attempt at multithreading: thread loop 3 only #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
da3a48f
to
706578d
Compare
Codecov Report
@@ Coverage Diff @@
## master #17 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 9 10 +1
Lines 109 122 +13
=========================================
+ Hits 109 122 +13
Continue to review full report at Codecov.
|
Actually, looks like some of the CI jobs fail even if we only multithread loop 3. |
@chriselrod @MasonProtter What do you think? Is this the right approach? |
src/matmul.jl
Outdated
Bblock = PointerMatrix(Bptr, (ksize,nsize)) | ||
unsafe_copyto_avx!(Bblock, Bview) | ||
matmul_loop3!(C, Aptr, A, Bblock, α, β, ksize, nsize, M, k, n, Mc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, Bptr
here is the same across threads. Meaning each thread will be trying to read from and write to the same block of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would explain the failure when you multithread loop 5, but I'm not sure about why you'd have a problem when threading loop3.
There are three issues I'm currently trying to address with multithreading matmul in PaddedMatrices:
A little more detail on each: Packing
|
That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right? |
What bugs me about this PR is that tests are failing even when we only thread loop 3. |
Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case. On Julia master, if I set the Of course, the converse case is that if I set the So, maybe, we want to define |
Unless we made a mistake implementing the algorithm. ;)
Yes, but it'd be better to use Note that we only have one L2 cache per physical core. So if we wanted to use more threads than we have L2 caches, we would have to reduce the size of our packed A matrix (meaning we'd have to reduce |
Oops, same mistake as with the packed-B. We define I'd just move that definition (and the |
e97e5cc
to
4c905f6
Compare
d275395
to
b16a2ea
Compare
Hmmm, I've done that, but I am still getting test failures. |
b16a2ea
to
5e0387c
Compare
The code is currently still threading loop 5, so you still have the |
5e0387c
to
d70a6c4
Compare
Should hopefully be working now. |
d70a6c4
to
1a38653
Compare
Maybe we should make some of the test matrices a little smaller again 😂 |
This PR currently only threads loop 3.
https://github.com/flame/blis/blob/master/docs/Multithreading.md