Matrix multiplication: Nested parallelism

cc @laurae2

On benchmark on dual Xeon Gold 6154 vs MKL:

```
Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes:                   42.467 MB
Arithmetic intensity:            576.000 FLOP/byte
Theoretical peak single-core:    118.400 GFLOP/s
Theoretical peak multi:         4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev  time: 2.274 ms
Min     time: 5.648 ms
Max     time: 28.398 ms
Perf:         3938.203 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

Laser production implementation
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev  time: 12.542 ms
Min     time: 35.367 ms
Max     time: 121.945 ms
Perf:         606.927 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev  time: 3.071 ms
Min     time: 362.655 ms
Max     time: 380.193 ms
Perf:         66.403 GFLOP/s

Display output[0] to make sure it's not optimized away
566.6849975585938
```

According to the paper

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
     Smith et al
   - http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

Parallelism should be done around `jc` (dimension `nc`)

![2018-12-26_12-10-01](https://user-images.githubusercontent.com/22738317/50444156-3fb8af80-0907-11e9-84f5-b91c2fe66d82.png)

Note that `nc` is often 4096 so we might need another distribution scheme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Matrix multiplication: Nested parallelism #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Matrix multiplication: Nested parallelism #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions