Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix multiplication: Nested parallelism #9

Open
mratsim opened this issue Dec 26, 2018 · 1 comment
Open

Matrix multiplication: Nested parallelism #9

mratsim opened this issue Dec 26, 2018 · 1 comment

Comments

@mratsim
Copy link
Owner

mratsim commented Dec 26, 2018

cc @Laurae2

On benchmark on dual Xeon Gold 6154 vs MKL:

Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes:                   42.467 MB
Arithmetic intensity:            576.000 FLOP/byte
Theoretical peak single-core:    118.400 GFLOP/s
Theoretical peak multi:         4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev  time: 2.274 ms
Min     time: 5.648 ms
Max     time: 28.398 ms
Perf:         3938.203 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

Laser production implementation
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev  time: 12.542 ms
Min     time: 35.367 ms
Max     time: 121.945 ms
Perf:         606.927 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev  time: 3.071 ms
Min     time: 362.655 ms
Max     time: 380.193 ms
Perf:         66.403 GFLOP/s

Display output[0] to make sure it's not optimized away
566.6849975585938

According to the paper

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al

Parallelism should be done around jc (dimension nc)

2018-12-26_12-10-01

Note that nc is often 4096 so we might need another distribution scheme.

@mratsim mratsim changed the title Matrix multiplication: Dispatch depending on NUMA nodes Matrix multiplication: Nested parallelism Jan 16, 2019
@mratsim
Copy link
Owner Author

mratsim commented Jan 16, 2019

Changing to nested parallelism.

Unfortunately, parallelizing on a single loop doesn't scale well (unless we multiply bigger matrices).

BLIS multithreading readme
mentions multithreading at multiple level.

Regarding nested parallelism in OpenMP, at first glance it seems quite tricky with a real risk of oversubscription or OpenMP not spawning new threads on the second loop if we use dynamic schedule.
Intel sugests using the recent OpenMP task construct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant