Skip to content

Conversation

DilumAluthge
Copy link
Member

@DilumAluthge DilumAluthge commented Dec 31, 2020

@codecov
Copy link

codecov bot commented Dec 31, 2020

Codecov Report

Merging #17 (1a38653) into master (5975d5c) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #17   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            9        10    +1     
  Lines          109       122   +13     
=========================================
+ Hits           109       122   +13     
Impacted Files Coverage Δ
src/Octavian.jl 100.00% <ø> (ø)
src/block_sizes.jl 100.00% <ø> (ø)
src/init.jl 100.00% <100.00%> (ø)
src/matmul.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5975d5c...1a38653. Read the comment docs.

@DilumAluthge
Copy link
Member Author

Actually, looks like some of the CI jobs fail even if we only multithread loop 3.

@DilumAluthge
Copy link
Member Author

DilumAluthge commented Dec 31, 2020

@chriselrod @MasonProtter What do you think? Is this the right approach?

src/matmul.jl Outdated
Comment on lines 50 to 52
Bblock = PointerMatrix(Bptr, (ksize,nsize))
unsafe_copyto_avx!(Bblock, Bview)
matmul_loop3!(C, Aptr, A, Bblock, α, β, ksize, nsize, M, k, n, Mc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, Bptr here is the same across threads. Meaning each thread will be trying to read from and write to the same block of memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would explain the failure when you multithread loop 5, but I'm not sure about why you'd have a problem when threading loop3.

@chriselrod
Copy link
Collaborator

There are three issues I'm currently trying to address with multithreading matmul in PaddedMatrices:

  1. Packing B.
  2. Good utilization.
  3. Overhead.

A little more detail on each:

Packing B

At the time I'm writing this, there's a bug in the PR where multiple threads will try to read from and write to the same block of memory when packing B.
One solution is to just give each thread a separate, offset, pointer so that the regions they will be reading/writing do not overlap.

However this is less than ideal.

The packed B is supposed to fit in the L3 cache. The L3 cache is normally shared among multiple cores. The most common situation is that there's a single L3 cache shared by all the cores, although with Zen2 and Zen3 CPUs there is an L3 cache shared by clusters of 4 and 8 cores, respectively. And obviously a dual socket system will have separate L3s for each socket.

But in each of these, the L3 is still shared by some number of cores.
Meaning, if we were to give each of these cores a separate region of memory, it wouldn't mean giving each of them their own Bblock. It would mean dividing the Bblock into pieces, and giving each of them a much smaller Bblock.
If shared by 4 cores, then N_C would be divided by 4. If shared by 18, it'd be divided by 18.

The number of times we execute the M and K loops, which includes packing A, would then increase by the same factor.

A better solution would be to share the packed Bblock among the threads.
It would make sense for every thread on a different block of the M partition to use the same full-sized Bblock.

Of course, that's what you already get when only threading loop 3, which brings us to

Good utilization

This means we should favor loop 3 over loop 5. But of course, if loop 3 is too short relative to how many cores the CPU has, it'd probably be faster to thread loop 5 too.
As an extreme example, say we have

using Octavian
A = rand(64,100);
B = rand(100,96);
C = Matrix{Float64}(undef, size(A,1), size(B,2));
Octavian.matul!(C, A, B)

on a CPU with AVX2 and 8 cores.
These matrices are small enough that we probably do not want to pack at all.

We probably have M_C = 72, and N_C > 1000. So does that mean we should evaluate it with a single thread?
That if A had 144 rows instead, we'd want to use just 2 threads?

No, in this example we may want to partition A into 16x100 blocks, and B into 100x24 blocks, and use all 8 threads.
(Unless threading overhead becomes problematic.)

Alternatively, what if A and B were each 10_000x10_000?
Would we want to spawn >1000 tasks (10_000 / 72 * 10_000 / 1000 ≈ 1_389 -- or many more, if we consider the reduced N_c we get by splitting up Bpacked)?
No, that'd incur a lot of scheduling overhead.

Overhead

Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs.
If they're small, we want to cut them into little pieces. If they're big, we probably only want to parallelize loop3, and to have just 1 task per thread that itself iterates over the M_c-sized chunks.
Ideally, we also wouldn't have to spawn tasks for every N_c iteration either, and instead synchronize using spinlocks, for example.

@DilumAluthge
Copy link
Member Author

Instead, we probably want just 1 task per thread, and to figure out what each of these tasks should be doing based on the dimensions of our inputs.

That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?

@DilumAluthge
Copy link
Member Author

Of course, that's what you already get when only threading loop 3, which brings us to

What bugs me about this PR is that tests are failing even when we only thread loop 3.

@DilumAluthge
Copy link
Member Author

Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.

On Julia master, if I set the JULIA_NUM_THREADS variable to 99, it will give me 99 threads even if the CPU only has 8 CPU threads.

Of course, the converse case is that if I set the JULIA_NUM_THREADS variable to 1, it will give me one thread even if the CPU has 8 threads.

So, maybe, we want to define N = max(Threads.nthreads(), NUM_OF_CPU_THREADS). And then we spawn N tasks. Would that make sense?

@chriselrod
Copy link
Collaborator

chriselrod commented Dec 31, 2020

That sounds like it probably would be the simplest approach. And in this case, it would never be the case that two different tasks are trying to write to the same memory, right?

Unless we made a mistake implementing the algorithm. ;)

Also, working under the "one task per thread" idea, we should think about how we handle the oversubscription case.

On Julia master, if I set the JULIA_NUM_THREADS variable to 99, it will give me 99 threads even if the CPU only has 8 CPU threads.

So, maybe, we want to define N = max(Threads.nthreads(), NUM_OF_CPU_THREADS). And then we spawn N tasks. Would that make sense?

Yes, but it'd be better to use VectorizationBase.NUM_CORES than Sys.CPU_THREADS.

Note that we only have one L2 cache per physical core. So if we wanted to use more threads than we have L2 caches, we would have to reduce the size of our packed A matrix (meaning we'd have to reduce M_c * K_c).

@chriselrod
Copy link
Collaborator

chriselrod commented Dec 31, 2020

What bugs me about this PR is that tests are failing even when we only thread loop 3.

Oops, same mistake as with the packed-B. We define Amem at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.

I'd just move that definition (and the GC.@preserve) inside the @spawn of the 3rd loop.

@DilumAluthge DilumAluthge force-pushed the dpa/multithreading branch 2 times, most recently from e97e5cc to 4c905f6 Compare January 1, 2021 00:48
@DilumAluthge DilumAluthge changed the title First attempt at multithreading First attempt at multithreading: thread loop 3 only Jan 1, 2021
@DilumAluthge DilumAluthge force-pushed the dpa/multithreading branch 2 times, most recently from d275395 to b16a2ea Compare January 1, 2021 00:59
@DilumAluthge
Copy link
Member Author

What bugs me about this PR is that tests are failing even when we only thread loop 3.

Oops, same mistake as with the packed-B. We define Amem at the top, so we also have only 1 for all the threads, making them try to read/write to it at the same time, just like the bug with packed-B.

I'd just move that definition (and the GC.@preserve) inside the @spawn of the 3rd loop.

Hmmm, I've done that, but I am still getting test failures.

@chriselrod
Copy link
Collaborator

chriselrod commented Jan 1, 2021

The code is currently still threading loop 5, so you still have the Bptr problem.

@DilumAluthge
Copy link
Member Author

Should hopefully be working now.

@DilumAluthge DilumAluthge marked this pull request as ready for review January 1, 2021 01:41
@DilumAluthge DilumAluthge enabled auto-merge (squash) January 1, 2021 01:41
@DilumAluthge
Copy link
Member Author

image

@DilumAluthge
Copy link
Member Author

Maybe we should make some of the test matrices a little smaller again 😂

@DilumAluthge DilumAluthge disabled auto-merge January 1, 2021 03:27
@DilumAluthge DilumAluthge merged commit 510659f into master Jan 1, 2021
@DilumAluthge DilumAluthge deleted the dpa/multithreading branch January 1, 2021 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants