Optimize AVX512 DGEMM (& DTRMM)#2384
Conversation
|
The new DTRMM kernel passed 1-thread reliability test. |
|
Reduce DGEMM_R to avoid segfault when executing serial dgemm with m=3000 & n=k=12000. |
|
DGEMM test on a KVM(aliyun, bounded to 8 physical cores on 1 die) on Intel Xeon Platinum 8269CY at 2.5 GHz: library..................................1-thread perf. (dim ~ 7000)...........8-thread perf. (dim~20000) @martin-frbg Have you tested the performance change on a SKX platform? (I cannot access my i7-9800X computer now because of the outbreak of 2019 Novel Coronavirus in my country.) |
|
Will try to get to that in a minute (but all I have is an older W2123, only four cores /eight threads). |
|
preliminary results on Xeon W2125, 4c/8t (before the very latest commits) - OpenBLAS numbers for 6/8 threads look wrong. dgemmtest_new obtained from your GEMM_AVX2_FMA3 repository
|
|
@martin-frbg Thank you very much. |
|
20000 may be too small to justify more than 4 threads - watching a rerun of the OMP_NUM_THREADS=6 case, I see it start out with 4 threads (MKL ?) and switch to 6 later (where top shows 4x100 percent utilisation, but only around 45 percent on the other two) |
|
Probably MKL has a mechanism to detect the number of physical cores and limit the number of threads accordingly (to 1 thread per core). I noticed something strange when running dgemm tests on the "8c/16t" Huawei cloud mentioned above (not aliyun). OpenBLAS(after this PR) got ~1000 GFLOPS with OMP_NUM_THREADS=16 while MKL got only 500-600 on that VM. Latency tests between logical cores showed that there should be 16 physical cores, but lscpu showed 16 threads on 8 cores.... |
|
Quite likely - and/or it puts an upper limit on the number of threads based on problem size.There is code by Agner Fog to do the physical/logical core counting on Intel and AMD in https://github.com/vectorclass/add-on/tree/master/physical_processors but it seems even he cannot do it on Intel processors without resorting to assumptions. |
|
Performance degradation with >1 threads per core is reasonable with the new kernel. The packed matrix A occupies 576 kB (it can't be too small, because the limited bandwidth of main memory requires GEMM_Q and GEMM_P to be big enough), which can be cached in L2 when each core runs 1 thread. With 2 threads on a core, however, the size of L2 is not enough to hold both packed matrices from 2 threads. I saw a 14% performance degradation with 16 threads(compared to 8 threads) on a 8c/16t aliyun KVM. |
|
Oh, right. This should be acceptable given the serious improvement overall, though it might make sense to put a note about this cache flushing effect in the readme and/or wiki. |






Replace KERNEL_8x24 with KERNEL_16x12, which makes more room for raising GEMM_Q to slow down reading/writing on matrix C (in main memory), thus improves parallel performance.