Test and tune for Zen 2 #2180

TiborGY · 2019-07-08T12:04:27Z

Zen 2 is now released, bringing a number of improvements to the table.
Most notably, it now has 256 wide AVX units. This should in theory allow performance parity with Haswell-Coffee Lake CPUs, and initial results suggest this is true (at least for single thread).
https://i.imgur.com/sFhxPrW.png

The chips also have double the L3 cache, and a generally reworked cache hierarchy. One thing to note, is that these chips do not have enough TLB cache to cover all of L2 and L3, so hugepages might be a little more important.

I might be able to get my hands on a Zen 2 system in ~1-2 months.

brada4 · 2019-07-08T19:10:26Z

It is not L3 cache per core or NUMA domain, it is per socket, like 1-2MB per core, in place of haswell's 2.5MB
Smaller than zen1 L1d actually matches that of haswell....
Probably neither is considered even with zen1, just that some lengthy discussions how to work around BIOS with broken NUMA support.

TiborGY · 2019-07-08T19:35:09Z

It is not L3 cache per core or NUMA domain, it is per socket, like 1-2MB per core, in place of haswell's 2.5MB

I have no idea what you are talking about. The 3700X has 8 cores and a total of 32 MiB of L3. Internally each cluster of 4 cores share their L3, so its more like 2*16MiB of L3. That still works out to 4 MiB of L3 per core. No idea where you are getting the 1-2 MiB from.

L3 cache is not shared between the 4 core core complexes (CCXs), not even within the same die.

wjc404 · 2019-07-14T09:55:57Z

@TiborGY I also found tuning of kernel is required for zen2. I tested single-thread dgemm performance of OpenBLAS(target=Haswell) on a Ryzen 7 3700X at 3.0GHz fixed clock, got ~33GFLOPS, which was far behind the theoretical maximum (48GFLOPS at 3.0GHz). By the way I also tested my dgemm subroutine and got a speed of ~44GFLOPS.

wjc404 · 2019-07-14T13:21:59Z

The speed of L3 in r7-3700X is fast, but the memory latency is still a problem.
I think the enhanced size of L3 allows larger blocks from matrix B to be packed, thus reducing the bandwidth requirement for accessing matrix A and C, eliminating the problem of slow memory access.

wjc404 · 2019-07-14T20:10:49Z

I read the code of OpenBLAS's Haswell dgemm kernel and found the 2 most common FP arithmetic instructions are vfmadd231pd and (chained) vpermpd. I roughly tested the latency of vfmadd231pd and vpermpd on i9-9900K and r7-3700x, found that vfmadd231pd has a latency of 5 cycles on both CPUs, however for vpermpd the latency on r7-3700x (6 cycles) doubles that on 9900K (3 cycles). I guess the performance problem on zen2 may result from vpermpd instructions.
test_program.tar.gz

martin-frbg · 2019-07-14T21:25:03Z

Interesting observation. I now see this doubling of latency for vpermpd mentioned in Agner Fog's https://www.agner.org/optimize/instruction_tables.pdf for Zen - so this apparently still applies to Zen2 as well (and it is as obviously relevant for the old issue #1461)

TiborGY · 2019-07-15T00:13:59Z

The speed of L3 in r7-3700X is fast, but the memory latency is still a problem.

The reason why you memory latency is sky high is your memory clock. 2133 MHz is a huge performance nerf for Ryzen CPUs, because the internal bus that connects the cores to the memory controller (and each other) is running at 1/2 memory clock. (this bus is conceptually similar to intels mesh/uncore clock)

102 ns is crazy high, even for Ryzen. IMO 2400 MHz should be the bare minimum speed that anyone uses, even that is because ECC UDIMMs are kinda hard to find above that. If someone is not using ECC, using 2666 or more like 3000 MHz is very much recommended. You can easily shave off 20 ns from that figure you measured.

wjc404 · 2019-07-15T03:38:16Z

I removed vperm instructions in the macros "KERNEL4x12_M1", "KERNEL4x12_M2", "KERNEL4x12_E" and "KERNEL4x12_SUB" of the file "dgemm_kernel_4x8_haswell.S" and recompiled OpenBLAS, and found a 1/4 speedup in a subsequent dgemm test (of course the results were no more meaningful), which illustrated the performance penalty is from vpermpd.

(test on r7-3700x, 1thread, 3.6GHz)

On r5-1600 the performance degradation is not significant (OpenBLAS(zen) gave 27GFLOPS when theoretical maximum is 29GFLOPS for 1 thread), probably because the half throughput of fma instructions on zen1 hides the latency of vpermpd.

wjc404 · 2019-07-15T05:12:58Z

I also tested the latencies of some other AVX instructions on r7 3700x in a way similar to my previous test of vpermpd. The results are as follows:
instruction vblendpd vperm2f128 vshufpd
latency 1 cycle 3 cycles 1 cycle
The expensive vpermpd can be replaced by a proper combination of the 3 tested instructions above (vblendpd and vshufpd should also be cheaper on common intel CPUs).

brada4 · 2019-07-15T05:34:26Z

but the memory latency is still a problem

Are you serious? You know that X GHz memory server that much words per second, there is no shortcut (There is one, called cache)

wjc404 · 2019-07-15T05:38:37Z

I changed 8 vpermpd to vshufpd in the first 4 "KERNEL4x12_*" macros in the file "dgemm_kernel_4x8_haswell.S" and received a 1/4 speedup while maintaining the correct results.
dgemm_kernel_4x8_haswell.S.txt

wjc404 · 2019-07-15T06:37:14Z

I then modified the macro "SAVE4x12" in a similar way and got 0.3% performance improvement. Now the performance is about 9/10 of theoretical maximum.
dgemm_kernel_4x8_haswell.S.txt

wjc404 · 2019-07-15T15:55:59Z

Test of more avx(2) instructions of doubles on r7-3700x (1 thread at 3.6 GHz)

test_of_common_avx2_instructions.zip
Instruction..... IPC.. latency
vpermpd....... 0.8... 6cycs
vblendpd....... 2.0... 1cyc
vperm2f128.. 1.0... 3cycs
vshufpd......... 2.0... 1cyc
vfmadd231pd 2.0... 5cycs
vaddpd.......... 2.0... 3cycs
vmulpd.......... 2.0... 3cycs
vhaddpd........ 0.5... 6-7cycs

A similar test on i9-9900K (1 thread, 4.4GHz) (chained vfmadd231pd test encountered endless running so it was removed from the test, luckily I've done it previously with different codes):

Instruction..... IPC.. latency
vpermpd....... 1.0... 3cycs
vblendpd....... 3.0... 1cyc
vperm2f128.. 1.0... 3cycs
vshufpd......... 1.0... 1cyc
vfmadd231pd 2.0... ~5cycs(previous test)
vaddpd.......... 2.0... 4cycs
vmulpd.......... 2.0... 4cycs
vhaddpd........ 0.5... 6cycs

wjc404 · 2019-07-15T16:20:45Z

I also found that alternating vaddpd and vmulpd in the test code can get a total IPC of 4 on zen2, which was only 2 for i9-9900K.

wjc404 · 2019-07-16T06:26:53Z

A simple test of AVX load & store instructions of packed doubles on r7-3700x (3.6GHz, 1 thread):

test_load_store_avx_doubles.zip

Instruction(AT&T)................ max_IPC
vmovapd mem,reg ............ ......2.......
vmovupd mem,reg ............ ......2.......
vmaskmovpd mem,reg,reg ......2.......
vbroadcastsd mem,reg ..... ......2.......
vmovapd reg,mem ............ ......1.......
vmovupd reg,mem ............. ......1.......
vmaskmovpd reg,reg,mem. .....1/6.....

wjc404 · 2019-07-16T07:17:43Z

The same load/store test on i9-9900K (4.4 GHz, 1 thread)

shared the same maximum IPCs with r7-3700x except "vmaskmovpd reg,reg,mem"(IPC=1)

wjc404 · 2019-07-16T08:56:05Z

Unlike vpermpd, vpermilpd share the same latency and IPC with vshufpd on r7-3700x, so it can also replace vpermpd in some cases.

wjc404 · 2019-07-30T23:05:45Z

Data sharing between CCXs - still problematic
Synchronization latencies of shared data between cores: test on r7-3700x (3.6 GHz)

Here's the code:
INTER-CORE LATENCY.zip

the same test on i9-9900K:

wjc404 · 2019-07-31T04:53:49Z

Synchronization bandwidths of shared data between cores: test on r7-3700x (3.6 GHz)

the same test on i9-9900K:

codes:
INTER-CORE BANDWIDTH.zip

brada4 · 2019-07-31T06:36:53Z

AMD looks like 4-core clusters ?
Does it get seen in NUMA tables anywhere?

TiborGY · 2019-07-31T08:47:26Z

AMD looks like 4-core clusters ?
Does it get seen in NUMA tables anywhere?

It it accurately shown by lstopo, the L3 cache is not shared between CCXs. But it is shown as a single NUMA node, since memory access is uniform for all cores, so technically it is not a NUMA setup.

TiborGY · 2019-07-31T08:51:15Z

@wjc404 What sort of fabric clock (FCLK) are you running? The inter core bandwidth between the CCXs is probably largely affected by FCLK.

brada4 · 2019-07-31T11:47:41Z

Well, not exposed but 3x faster ...
It is quite important that same data does not get dragged around outer cache without need. There is sort of no software exposure, just that probably way to hack affinity so that all threads stay in same space with shared L3

wjc404 · 2019-07-31T13:39:38Z

Sorry I don't know where to get the frequency of FCLK. It should be the default one for 3.6 GHz CPU clock.

martin-frbg · 2019-07-31T13:48:02Z

I believe AMD put in some effort to make the Linux and Windows10 schedulers aware of the special topology. OpenBLAS itself probably has little chance to create a "useful" default affinity map on its own without knowing the "bigger picture" of what kind of code it was called from and what the overall system utilization is.
Perhaps a wiki page collecting links to Ryzen "best practices" whitepapers like
https://www.suse.com/documentation/suse-best-practices/singlehtml/optimizing-linux-for-amd-epyc-with-sle-12-sp3/optimizing-linux-for-amd-epyc-with-sle-12-sp3.html#sec.memory_cpu_binding or the PRACE document linked in #1461 (comment) might be useful.

(I think FCLK is proportional to the clock speed of the RAM installed in a particular system, so it could be that the DDR4-2133 memory shown on your AIDA screenshot lead to less than optimal performance of the interconnect )

TiborGY · 2019-07-31T14:58:04Z

The FCLK is the clock for the fabric between the core chiplet(s) and the IO die. (I think it is also the bus responsible for communication between the CCXs) The FCLK is set by the motherboard FW, under normal circumstances this should mean exactly 1/2 of the memory clock. So on most motherboards, memory speed will directly alter CCX to CCX latency and bandwidth. Memory write bandwidth is also very highly dependent on FCLK. Going from 2133 to 3200 should increase the BW between CCXs by about 50%, if the motherboard correctly keeps FCLK in sync with the memory speed. It is possible to have a desynchronized FCLK, however it is very undesirable running your system like that, as it increases memory latency by about 20 ns and generally worsens performance. Motherboards should default to keeping the FCLK in sync with the memory speed, from 2133 to 3600 MHz. However, I have heard that some motherboards have had firmware bugs, and sometimes desynced FCLK for no good reason. wjc404 <notifications@github.com> ezt írta (időpont: 2019. júl. 31., Sze, 15:39):

…

Sorry I don't know where to get the frequency of FCLK. It should be the default one for 3.6 GHz CPU clock. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2180>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHD2KHSUXKJS2N6KK64XTMTQCGI2RANCNFSM4H627C4Q> .

wjc404 · 2019-07-31T17:25:53Z

@TiborGY Thanks for your guidance~ The FCLK frequency setting in my bios is AUTO.
On win10 I see the fabric clock frequency is 1200 MHz from ryzen master utility.

martin-frbg · 2019-07-31T18:16:33Z

So by replacing your memory with DDR4-3600 you could increase FCLK to 1800 which would make the cross-ccx transfers look less ugly (though at an added cost of something like $150 per 16GB)

TiborGY · 2019-07-31T18:42:28Z

Officially, Zen2 only supports up to 3200 MHz memory. In practice, 3600 seems fine, beyond that you start running into issues with the fabric getting unstable, of course depending on your luck on the silicon lottery. For this reason motherboards seem to default to desynced FCLK if you apply an XMP profile faster than 3600. On a serious workstation I would probably not risk going beyond 3200. Memory stability is notoriously hard to stress test, and I would guess the same applies to fabric stability. This does have a silver lining though, 3200 MHz RAM is not too expensive, unless you want very tight memory timings (CL14-CL15). Martin Kroeker <notifications@github.com> ezt írta (időpont: 2019. júl. 31., Sze, 20:16):

…

So by replacing your memory with DDR4-3600 you could increase FCLK to 1800 which would make the cross-ccx transfers look less ugly (though at an added cost of something like $150 per 16GB) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2180>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHD2KHVRRIBNQ5AFFNMSWV3QCHJIXANCNFSM4H627C4Q> .

brada4 · 2019-07-31T20:00:14Z

It is HyperTransport (intels rough equivalent is QPI). Though no idea how modern one does around clocking/powersaving etc....

TiborGY · 2019-07-31T20:57:06Z

It is HyperTransport (intels rough equivalent is QPI). Though no idea how modern one does around clocking/powersaving etc....

Not anymore. It used to be hypertransport before Zen. The official marketing name for the current fabric is "Infinity Fabric".

brada4 · 2019-07-31T21:02:12Z

It is not userspace programmable, if scheduler knows we might be able to just group threads in cluster sized groups accessing same memory pieces, and avoiding L3 to L3 copies

It claims 40GB/s roughly ?duplex? ?half each way? ?you are at optimum already?

marxin · 2020-02-10T14:53:41Z

test_of_common_avx2_instructions.zip
Instruction..... IPC.. latency
vpermpd....... 0.8... 6cycs
vblendpd....... 2.0... 1cyc
vperm2f128.. 1.0... 3cycs
vshufpd......... 2.0... 1cyc
vfmadd231pd 2.0... 5cycs
vaddpd.......... 2.0... 3cycs
vmulpd.......... 2.0... 3cycs
vhaddpd........ 0.5... 6-7cycs

Hello.

Thank you very much for the measurement script. I modified that a bit and pushed here;
https://github.com/marxin/instruction-tester

For the znver1 CPU, I've got a different numbers for model name : AMD Ryzen 7 2700X Eight-Core Processor:

make test
gcc -march=haswell testinst.S testinst.c -o testinst
./testinst
CPU frequency: 4.30 GHz
GOPs per second for vpermpd indep. instructions: 2.137337e+00, rec. throughput: 2.01
GOPs per second for vpermpd chained instructions: 2.150827e+00, latency: 2.00

GOPs per second for vpermilpd indep. instructions: 4.301699e+00, rec. throughput: 1.00
GOPs per second for vpermilpd chained instructions: 4.296690e+00, latency: 1.00

GOPs per second for vblendpd indep. instructions: 4.298875e+00, rec. throughput: 1.00
GOPs per second for vblendpd chained instructions: 4.301755e+00, latency: 1.00

GOPs per second for vperm2f128 indep. instructions: 1.435560e+00, rec. throughput: 3.00
GOPs per second for vperm2f128 chained instructions: 1.439942e+00, latency: 2.99

GOPs per second for vshufpd indep. instructions: 4.296961e+00, rec. throughput: 1.00
GOPs per second for vshufpd chained instructions: 4.296540e+00, latency: 1.00

GOPs per second for vfmadd231pd indep. instructions: 4.296248e+00, rec. throughput: 1.00
GOPs per second for vfmadd231pd chained instructions: 8.651844e-01, latency: 4.97

GOPs per second for vaddpd indep. instructions: 4.286476e+00, rec. throughput: 1.00
GOPs per second for vaddpd chained instructions: 1.443964e+00, latency: 2.98

GOPs per second for vmulpd indep. instructions: 4.304053e+00, rec. throughput: 1.00
GOPs per second for vmulpd chained instructions: 1.086745e+00, latency: 3.96

GOPs per second for vhaddpd indep. instructions: 1.433505e+00, rec. throughput: 3.00
GOPs per second for vhaddpd chained instructions: 6.227662e-01, latency: 6.90

I verified the numbers with 4. Instruction tables - Agner Fog and we've got the same numbers.
I'm also sending numbers for znver2 (model name : AMD EPYC 7702 64-Core Processor):

$ make test
gcc -march=haswell testinst.S testinst.c -o testinst
./testinst
CPU frequency: 3.35 GHz
GOPs per second for vpermpd indep. instructions: 2.582493e+00, rec. throughput: 1.30
GOPs per second for vpermpd chained instructions: 5.568692e-01, latency: 6.02

GOPs per second for vpermilpd indep. instructions: 6.679462e+00, rec. throughput: 0.50
GOPs per second for vpermilpd chained instructions: 3.340770e+00, latency: 1.00

GOPs per second for vblendpd indep. instructions: 6.682278e+00, rec. throughput: 0.50
GOPs per second for vblendpd chained instructions: 3.338153e+00, latency: 1.00

GOPs per second for vperm2f128 indep. instructions: 3.339144e+00, rec. throughput: 1.00
GOPs per second for vperm2f128 chained instructions: 1.113484e+00, latency: 3.01

GOPs per second for vshufpd indep. instructions: 6.679295e+00, rec. throughput: 0.50
GOPs per second for vshufpd chained instructions: 3.338552e+00, latency: 1.00

GOPs per second for vfmadd231pd indep. instructions: 6.677935e+00, rec. throughput: 0.50
GOPs per second for vfmadd231pd chained instructions: 6.681326e-01, latency: 5.01

GOPs per second for vaddpd indep. instructions: 6.679347e+00, rec. throughput: 0.50
GOPs per second for vaddpd chained instructions: 1.113059e+00, latency: 3.01

GOPs per second for vmulpd indep. instructions: 6.681665e+00, rec. throughput: 0.50
GOPs per second for vmulpd chained instructions: 1.113511e+00, latency: 3.01

GOPs per second for vhaddpd indep. instructions: 1.670478e+00, rec. throughput: 2.01
GOPs per second for vhaddpd chained instructions: 5.135085e-01, latency: 6.52

marxin · 2020-02-10T15:10:29Z

I then modified the macro "SAVE4x12" in a similar way and got 0.3% performance improvement. Now the performance is about 9/10 of theoretical maximum.
dgemm_kernel_4x8_haswell.S.txt

Hey.
Can you please shared the benchmark so that I can test it on my machines ;) ?
Thanks.

marxin · 2020-02-13T16:13:02Z

Hey.

I've just prepared a comparison on one znver1 and one znver22 machine for all releases from 0.3.3 to 0.3.8. I've used the following script:
https://github.com/marxin/BLAS-Tester/blob/benchmark-script/test-all.py

which runs BLAS-Tester binaries with the following arguments:

$ ./test-all.py
1/12: taskset 0x1 ./bin/xsl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
2/12: taskset 0x1 ./bin/xdl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
3/12: taskset 0x1 ./bin/xcl1blastst -R all -N 67108864 67108864 1 -X 5 1 1 1 1 1
4/12: taskset 0x1 ./bin/xzl1blastst -R all -N 33554432 33554432 1 -X 5 1 1 1 1 1
5/12: taskset 0x1 ./bin/xsl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
6/12: taskset 0x1 ./bin/xdl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
7/12: taskset 0x1 ./bin/xcl2blastst -R all -N 8192 8192 1 -X 5 1 1 1 1 1
8/12: taskset 0x1 ./bin/xzl2blastst -R all -N 4096 4096 1 -X 5 1 1 1 1 1
9/12: taskset 0x1 ./bin/xsl3blastst -R all -N 2048 2048 1 -a 5 1 1 1 1 1
10/12: taskset 0x1 ./bin/xdl3blastst -R all -N 2048 2048 1 -a 5 1 1 1 1 1
11/12: taskset 0x1 ./bin/xcl3blastst -R all -N 1024 1024 1 -a 5 1 1 1 1 1 1 1 1 1 1
12/12: taskset 0x1 ./bin/xzl3blastst -R all -N 1024 1024 1 -a 5 1 1 1 1 1 1 1 1 1 1

all numbers are collected here:
https://docs.google.com/spreadsheets/d/1Xb3HWbsEuMeMf1mfRPP1AdnQTYxGU-7Rmm-khzMxz98/edit#gid=228273818 (the spreadsheet contains 3 sheets).

Based on the numbers I was able to identify the following problems:

I found a typo in isamax and the speed will be restored once Fix iamax sse implementation and add utests #2414 is merged
there's a speed drop of ~5% for GEMM, SYMM, SYR2K, SYRK, TRMM after 92b1021 (Optimize AVX2 SGEMM & STRMM #2361, @wjc404); I also verified that locally on my AMD Ryzen 7 2700X machine
there's a speed drop for both znver1 and znver2 after 28e9645 (Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel #2190, @wjc404); the patch was supposed to speed it up; I can confirm vpermilpd has smaller latency (and bigger throughput), but is slower for some reason in the benchmark

I'm going to bisect other performance issues.
Feel free to comment on the selected benchmarking workloads.

wjc404 · 2020-02-13T18:23:57Z

@marxin I did most of the SGEMM and DGEMM benchmarks with the 2 programs "sgemmtest_new" and "dgemmtest_new" in my repository GEMM_AVX2_FMA3. When using them on Zen processors, please set the environment variable MKL_DEBUG_CPU_TYPE to 5.
For benchmarking level3 subroutines, monitoring CPU frequency is recommended (if it is never done before) as thermal throttling can affect results.

marxin · 2020-02-14T11:40:09Z

@marxin I did most of the SGEMM and DGEMM benchmarks with the 2 programs "sgemmtest_new" and "dgemmtest_new" in my repository GEMM_AVX2_FMA3. When using them on Zen processors, please set the environment variable MKL_DEBUG_CPU_TYPE to 5.

Ok, I see the program depends on a MKL header file (and needs to be linked against it).
Can you please make the code more portable? It would be great to have it part of this repository or BLAS-Tester, can you please do it?

For benchmarking level3 subroutines, monitoring CPU frequency is recommended (if it is never done before) as thermal throttling can affect results.

Sure. A difference is that you probably use OPENMP with multiple threads, am I right?
Can you please re-test the numbers with the corresponding GEMM test in BLAS-Tester?

martin-frbg · 2020-02-14T11:56:50Z

@marxin couldn't you use the provided binaries from wjc404's repo (which also have MKL statically linked) ? And ISTR performance figures were obtained for both single and multiple threads.

wjc404 · 2020-02-14T13:09:15Z

@marxin If you have confirmed significant performance drop of SGEMM (especially in serial execution with dimensions > 4000) on zen/zen+ chips after PR #2361 , then you can try to specify different SGEMM kernels for zen and zen2 (probably by editing "KERNEL.ZEN" & "param.h" and modifying CPU detection codes, to choose "sgemm_kernel_16x4_haswell.S" for zen/zen+ and "sgemm_kernel_8x4_haswell.c" for zen2) and make it a PR. Unfortunately I cannot access google website in China to download your results. Currently I don't have a machine with zen/zen+ CPU to test.
I would be greatful if you can figure out the reason of the SGEMM performance drop (memory-bound or core-bound factors?) so I can modify the new kernel code accordingly to improve its compatibility to old zen processors.

martin-frbg · 2020-02-14T13:45:18Z

I believe the speed drops in xDOT post 0.3.6 might be due to #1965 if they are not just an artefact. If I read your table correctly, your figures for DSDOT/SDSDOT are even worse than for ZDOT, and they definitely did not receive any changes except that fix for undeclared clobbers.
(Possibly the compiler was able to apply some dangerous optimizations before).

martin-frbg · 2020-02-17T19:11:03Z

@wjc404 this is marxin's spreadsheet exported from the google docs site in .xlsx format
OpenBLAS - AMD ZEN.xlsx

wjc404 · 2020-02-18T02:10:12Z

@martin-frbg Thanks.
@marxin Maybe the changed settings in "param.h" played a role. I didn't realize this since I have never had chance to test SGEMM on EPYC CPUs. Could you try with SGEMM_DEFAULT_P = 640 and SGEMM_DEFAULT_Q = 448 (or even larger) (modify line 669,675 and 678 in param.h and recompile OpenBLAS 0.3.8)?

marxin · 2020-02-18T10:05:45Z

@marxin I see it's about parallel performance with >4 threads.

Note that my spreadsheet only contains results for single-threaded runs. I haven't had time to run parallel tests. I'm planning to do that.

Most likely the changed settings in "param.h" played a role. I didn't realize this since I have never had chance to test SGEMM on EPYC CPUs. Could you try with SGEMM_DEFAULT_P = 640 and SGEMM_DEFAULT_Q = 448 (or even larger) (modify line 669,675 and 678 in param.h and recompile OpenBLAS 0.3.8)?

Yes, I will test the suggested changes.

marxin · 2020-02-18T10:56:24Z

@marxin If you have confirmed significant performance drop of SGEMM (especially in serial execution with dimensions > 4000) on zen/zen+ chips after PR #2361 , then you can try to specify different SGEMM kernels for zen and zen2 (probably by editing "KERNEL.ZEN" & "param.h" and modifying CPU detection codes, to choose "sgemm_kernel_16x4_haswell.S" for zen/zen+ and "sgemm_kernel_8x4_haswell.c" for zen2) and make it a PR.

Ok, I've just made a minimal reversion of #2361 which restores speed on znver1 and it also helps on znver2. Let's discuss that in #2430.

marxin · 2020-02-18T12:31:39Z

I believe the speed drops in xDOT post 0.3.6 might be due to #1965 if they are not just an artefact. If I read your table correctly, your figures for DSDOT/SDSDOT are even worse than for ZDOT, and they definitely did not receive any changes except that fix for undeclared clobbers.
(Possibly the compiler was able to apply some dangerous optimizations before).

I've just re-run that locally and I can't get the slower numbers for current develop branch.

martin-frbg · 2020-02-18T12:42:04Z

Perhaps with Ryzen vs EPYC we are introducing some other variable besides znver1/znver2 even when running on a single core ? Unfortunately I cannot run benchmarks on my 2700K in the next few days (and I remember it was not easy to force it to run with a fixed core frequency and actually reproducible speeds)

MigMuc · 2020-02-18T19:59:44Z

I did the benchmark given above with my new Ryzen 7 3700X. I set the CPU frequency to 3.6 GHz (verified with zenmonitor) switching off any Turbo Core boost or Pecision Boost Overdrive settings in the BIOS. I have installed 2x8 GB RAM @ 3200 MHz. The results for the last 3 releases of OpenBLAS are given in the spreadsheet.
OpenBLAS-AMD_Ryzen_R7_3700X_3600MHz.xlsx
I can confirm that with the releases before v0.3.8 the SGEMM is slightly faster than in the current release.

wjc404 mentioned this issue Jul 16, 2019

Update "dgemm_kernel_4x8_haswell.S" for improving performance on zen2 chips #2186

Merged

martin-frbg mentioned this issue Jul 22, 2019

Replace vpermpd with vpermilpd in the Haswell/Zen zdot microkernel #2190

Merged

This was referenced Aug 3, 2019

Replace most vpermpd calls in the Haswell DTRSM_RN kernel #2199

Merged

Replace vpermpd with vpermilpd in the Haswell DTRMM kernel #2206

Merged

bartoldeman mentioned this issue Dec 1, 2022

scal benchmark: eliminate y, move init/timing out of loop #3847

Merged

Test and tune for Zen 2 #2180

Test and tune for Zen 2 #2180

Comments

TiborGY commented Jul 8, 2019

brada4 commented Jul 8, 2019

TiborGY commented Jul 8, 2019 • edited Loading

wjc404 commented Jul 14, 2019 • edited Loading

wjc404 commented Jul 14, 2019 • edited Loading

wjc404 commented Jul 14, 2019

martin-frbg commented Jul 14, 2019

TiborGY commented Jul 15, 2019 • edited Loading

wjc404 commented Jul 15, 2019 • edited Loading

wjc404 commented Jul 15, 2019 • edited Loading

brada4 commented Jul 15, 2019

wjc404 commented Jul 15, 2019 • edited Loading

wjc404 commented Jul 15, 2019 • edited Loading

wjc404 commented Jul 15, 2019 • edited Loading

wjc404 commented Jul 15, 2019 • edited Loading

wjc404 commented Jul 16, 2019 • edited Loading

wjc404 commented Jul 16, 2019 • edited Loading

wjc404 commented Jul 16, 2019 • edited Loading

wjc404 commented Jul 30, 2019 • edited Loading

wjc404 commented Jul 31, 2019

brada4 commented Jul 31, 2019

TiborGY commented Jul 31, 2019

TiborGY commented Jul 31, 2019

brada4 commented Jul 31, 2019

wjc404 commented Jul 31, 2019

martin-frbg commented Jul 31, 2019 • edited Loading

TiborGY commented Jul 31, 2019 via email

wjc404 commented Jul 31, 2019 • edited Loading

martin-frbg commented Jul 31, 2019

TiborGY commented Jul 31, 2019 via email

brada4 commented Jul 31, 2019

TiborGY commented Jul 31, 2019

brada4 commented Jul 31, 2019

marxin commented Feb 10, 2020

marxin commented Feb 10, 2020

marxin commented Feb 13, 2020

wjc404 commented Feb 13, 2020 • edited Loading

marxin commented Feb 14, 2020

martin-frbg commented Feb 14, 2020

wjc404 commented Feb 14, 2020 • edited Loading

martin-frbg commented Feb 14, 2020

martin-frbg commented Feb 17, 2020

wjc404 commented Feb 18, 2020 • edited Loading

marxin commented Feb 18, 2020

marxin commented Feb 18, 2020

marxin commented Feb 18, 2020

martin-frbg commented Feb 18, 2020

MigMuc commented Feb 18, 2020

TiborGY commented Jul 8, 2019 •

edited

Loading

wjc404 commented Jul 14, 2019 •

edited

Loading

wjc404 commented Jul 14, 2019 •

edited

Loading

TiborGY commented Jul 15, 2019 •

edited

Loading

wjc404 commented Jul 15, 2019 •

edited

Loading

wjc404 commented Jul 15, 2019 •

edited

Loading

wjc404 commented Jul 15, 2019 •

edited

Loading

wjc404 commented Jul 15, 2019 •

edited

Loading

wjc404 commented Jul 15, 2019 •

edited

Loading

wjc404 commented Jul 15, 2019 •

edited

Loading

wjc404 commented Jul 16, 2019 •

edited

Loading

wjc404 commented Jul 16, 2019 •

edited

Loading

wjc404 commented Jul 16, 2019 •

edited

Loading

wjc404 commented Jul 30, 2019 •

edited

Loading

martin-frbg commented Jul 31, 2019 •

edited

Loading

wjc404 commented Jul 31, 2019 •

edited

Loading

wjc404 commented Feb 13, 2020 •

edited

Loading

wjc404 commented Feb 14, 2020 •

edited

Loading

wjc404 commented Feb 18, 2020 •

edited

Loading