First attempt at multithreading: thread loop 3 only #17

DilumAluthge · 2020-12-31T00:13:59Z

This PR currently only threads loop 3.

https://github.com/flame/blis/blob/master/docs/Multithreading.md

codecov · 2020-12-31T00:27:25Z

Codecov Report

Merging #17 (1a38653) into master (5975d5c) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #17   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            9        10    +1     
  Lines          109       122   +13     
=========================================
+ Hits           109       122   +13

Impacted Files	Coverage Δ
src/Octavian.jl	`100.00% <ø> (ø)`
src/block_sizes.jl	`100.00% <ø> (ø)`
src/init.jl	`100.00% <100.00%> (ø)`
src/matmul.jl	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5975d5c...1a38653. Read the comment docs.

DilumAluthge · 2020-12-31T00:40:29Z

Actually, looks like some of the CI jobs fail even if we only multithread loop 3.

DilumAluthge · 2020-12-31T00:41:21Z

@chriselrod @MasonProtter What do you think? Is this the right approach?

chriselrod · 2020-12-31T01:38:19Z

src/matmul.jl

+                        Bblock = PointerMatrix(Bptr, (ksize,nsize))
+                        unsafe_copyto_avx!(Bblock, Bview)
+                        matmul_loop3!(C, Aptr, A, Bblock, α, β, ksize, nsize, M, k, n, Mc)


Unless I'm missing something, Bptr here is the same across threads. Meaning each thread will be trying to read from and write to the same block of memory.

That would explain the failure when you multithread loop 5, but I'm not sure about why you'd have a problem when threading loop3.

chriselrod · 2020-12-31T02:16:30Z

There are three issues I'm currently trying to address with multithreading matmul in PaddedMatrices:

Packing B.
Good utilization.
Overhead.

A little more detail on each:

Packing `B`

At the time I'm writing this, there's a bug in the PR where multiple threads will try to read from and write to the same block of memory when packing B.
One solution is to just give each thread a separate, offset, pointer so that the regions they will be reading/writing do not overlap.

However this is less than ideal.

The packed B is supposed to fit in the L3 cache. The L3 cache is normally shared among multiple cores. The most common situation is that there's a single L3 cache shared by all the cores, although with Zen2 and Zen3 CPUs there is an L3 cache shared by clusters of 4 and 8 cores, respectively. And obviously a dual socket system will have separate L3s for each socket.

But in each of these, the L3 is still shared by some number of cores.
Meaning, if we were to give each of these cores a separate region of memory, it wouldn't mean giving each of them their own Bblock. It would mean dividing the Bblock into pieces, and giving each of them a much smaller Bblock.
If shared by 4 cores, then N_C would be divided by 4. If shared by 18, it'd be divided by 18.

The number of times we execute the M and K loops, which includes packing A, would then increase by the same factor.

A better solution would be to share the packed Bblock among the threads.
It would make sense for every thread on a different block of the M partition to use the same full-sized Bblock.

Of course, that's what you already get when only threading loop 3, which brings us to

Good utilization

This means we should favor loop 3 over loop 5. But of course, if loop 3 is too short relative to how many cores the CPU has, it'd probably be faster to thread loop 5 too.
As an extreme example, say we have

using Octavian
A = rand(64,100);
B = rand(100,96);
C = Matrix{Float64}(undef, size(A,1), size(B,2));
Octavian.matul!(C, A, B)

on a CPU with AVX2 and 8 cores.
These matrices are small enough that we probably do not want to pack at all.

We probably have M_C = 72, and N_C > 1000. So does that mean we should evaluate it with a single thread?
That if A had 144 rows instead, we'd want to use just 2 threads?

No, in this example we may want to partition A into 16x100 blocks, and B into 100x24 blocks, and use all 8 threads.
(Unless threading overhead becomes problematic.)

Alternatively, what if A and B were each 10_000x10_000?
Would we want to spawn >1000 tasks (10_000 / 72 * 10_000 / 1000 ≈ 1_389 -- or many more, if we consider the reduced N_c we get by splitting up Bpacked)?
No, that'd incur a lot of scheduling overhead.

Overhead

Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs.
If they're small, we want to cut them into little pieces. If they're big, we probably only want to parallelize loop3, and to have just 1 task per thread that itself iterates over the M_c-sized chunks.
Ideally, we also wouldn't have to spawn tasks for every N_c iteration either, and instead synchronize using spinlocks, for example.

DilumAluthge · 2020-12-31T03:23:23Z

Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs.

That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?

DilumAluthge · 2020-12-31T03:24:52Z

Of course, that's what you already get when only threading loop 3, which brings us to

What bugs me about this PR is that tests are failing even when we only thread loop 3.

DilumAluthge · 2020-12-31T03:28:44Z

Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.

On Julia master, if I set the JULIA_NUM_THREADS variable to 99, it will give me 99 threads even if the CPU only has 8 CPU threads.

Of course, the converse case is that if I set the JULIA_NUM_THREADS variable to 1, it will give me one thread even if the CPU has 8 threads.

So, maybe, we want to define N = max(Threads.nthreads(), NUM_OF_CPU_THREADS). And then we spawn N tasks. Would that make sense?

chriselrod · 2020-12-31T04:56:26Z

That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?

Unless we made a mistake implementing the algorithm. ;)

Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.

On Julia master, if I set the JULIA_NUM_THREADS variable to 99, it will give me 99 threads even if the CPU only has 8 CPU threads.

So, maybe, we want to define N = max(Threads.nthreads(), NUM_OF_CPU_THREADS). And then we spawn N tasks. Would that make sense?

Yes, but it'd be better to use VectorizationBase.NUM_CORES than Sys.CPU_THREADS.

Note that we only have one L2 cache per physical core. So if we wanted to use more threads than we have L2 caches, we would have to reduce the size of our packed A matrix (meaning we'd have to reduce M_c * K_c).

chriselrod · 2020-12-31T05:01:29Z

What bugs me about this PR is that tests are failing even when we only thread loop 3.

Oops, same mistake as with the packed-B. We define Amem at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.

I'd just move that definition (and the GC.@preserve) inside the @spawn of the 3rd loop.

DilumAluthge · 2021-01-01T01:00:09Z

What bugs me about this PR is that tests are failing even when we only thread loop 3.

Oops, same mistake as with the packed-B. We define Amem at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.

I'd just move that definition (and the GC.@preserve) inside the @spawn of the 3rd loop.

Hmmm, I've done that, but I am still getting test failures.

chriselrod · 2021-01-01T01:40:10Z

The code is currently still threading loop 5, so you still have the Bptr problem.

DilumAluthge · 2021-01-01T01:41:14Z

Should hopefully be working now.

DilumAluthge · 2021-01-01T01:50:12Z

DilumAluthge · 2021-01-01T03:04:38Z

Maybe we should make some of the test matrices a little smaller again 😂

DilumAluthge requested a review from chriselrod December 31, 2020 00:13

DilumAluthge added the threading label Dec 31, 2020

DilumAluthge force-pushed the dpa/multithreading branch from da3a48f to 706578d Compare December 31, 2020 00:22

DilumAluthge requested a review from MasonProtter December 31, 2020 00:40

chriselrod reviewed Dec 31, 2020

View reviewed changes

DilumAluthge force-pushed the dpa/multithreading branch 2 times, most recently from e97e5cc to 4c905f6 Compare January 1, 2021 00:48

DilumAluthge changed the title ~~First attempt at multithreading~~ First attempt at multithreading: thread loop 3 only Jan 1, 2021

DilumAluthge force-pushed the dpa/multithreading branch 2 times, most recently from d275395 to b16a2ea Compare January 1, 2021 00:59

DilumAluthge force-pushed the dpa/multithreading branch from b16a2ea to 5e0387c Compare January 1, 2021 01:07

DilumAluthge force-pushed the dpa/multithreading branch from 5e0387c to d70a6c4 Compare January 1, 2021 01:40

DilumAluthge marked this pull request as ready for review January 1, 2021 01:41

DilumAluthge enabled auto-merge (squash) January 1, 2021 01:41

First attempt at multithreading: thread loop 3 only

1a38653

DilumAluthge force-pushed the dpa/multithreading branch from d70a6c4 to 1a38653 Compare January 1, 2021 01:56

DilumAluthge disabled auto-merge January 1, 2021 03:27

DilumAluthge merged commit 510659f into master Jan 1, 2021

DilumAluthge deleted the dpa/multithreading branch January 1, 2021 03:27

First attempt at multithreading: thread loop 3 only #17

First attempt at multithreading: thread loop 3 only #17

Uh oh!

Conversation

DilumAluthge commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DilumAluthge commented Dec 31, 2020

Uh oh!

DilumAluthge commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chriselrod Dec 31, 2020

Choose a reason for hiding this comment

Uh oh!

chriselrod Dec 31, 2020

Choose a reason for hiding this comment

Uh oh!

chriselrod commented Dec 31, 2020

Packing B

Good utilization

Overhead

Uh oh!

DilumAluthge commented Dec 31, 2020

Uh oh!

DilumAluthge commented Dec 31, 2020

Uh oh!

DilumAluthge commented Dec 31, 2020

Uh oh!

chriselrod commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chriselrod commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DilumAluthge commented Jan 1, 2021

Uh oh!

chriselrod commented Jan 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DilumAluthge commented Jan 1, 2021

Uh oh!

DilumAluthge commented Jan 1, 2021

Uh oh!

DilumAluthge commented Jan 1, 2021

Uh oh!

Uh oh!

DilumAluthge commented Dec 31, 2020 •

edited

Loading

codecov bot commented Dec 31, 2020 •

edited

Loading

DilumAluthge commented Dec 31, 2020 •

edited

Loading

Packing `B`

chriselrod commented Dec 31, 2020 •

edited

Loading

chriselrod commented Dec 31, 2020 •

edited

Loading

chriselrod commented Jan 1, 2021 •

edited

Loading