-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test and tune for Zen 2 #2180
Comments
It is not L3 cache per core or NUMA domain, it is per socket, like 1-2MB per core, in place of haswell's 2.5MB |
I have no idea what you are talking about. The 3700X has 8 cores and a total of 32 MiB of L3. Internally each cluster of 4 cores share their L3, so its more like 2*16MiB of L3. That still works out to 4 MiB of L3 per core. No idea where you are getting the 1-2 MiB from. L3 cache is not shared between the 4 core core complexes (CCXs), not even within the same die. |
@TiborGY I also found tuning of kernel is required for zen2. I tested single-thread dgemm performance of OpenBLAS(target=Haswell) on a Ryzen 7 3700X at 3.0GHz fixed clock, got ~33GFLOPS, which was far behind the theoretical maximum (48GFLOPS at 3.0GHz). By the way I also tested my dgemm subroutine and got a speed of ~44GFLOPS. |
I read the code of OpenBLAS's Haswell dgemm kernel and found the 2 most common FP arithmetic instructions are vfmadd231pd and (chained) vpermpd. I roughly tested the latency of vfmadd231pd and vpermpd on i9-9900K and r7-3700x, found that vfmadd231pd has a latency of 5 cycles on both CPUs, however for vpermpd the latency on r7-3700x (6 cycles) doubles that on 9900K (3 cycles). I guess the performance problem on zen2 may result from vpermpd instructions. |
Interesting observation. I now see this doubling of latency for vpermpd mentioned in Agner Fog's https://www.agner.org/optimize/instruction_tables.pdf for Zen - so this apparently still applies to Zen2 as well (and it is as obviously relevant for the old issue #1461) |
The reason why you memory latency is sky high is your memory clock. 2133 MHz is a huge performance nerf for Ryzen CPUs, because the internal bus that connects the cores to the memory controller (and each other) is running at 1/2 memory clock. (this bus is conceptually similar to intels mesh/uncore clock) 102 ns is crazy high, even for Ryzen. IMO 2400 MHz should be the bare minimum speed that anyone uses, even that is because ECC UDIMMs are kinda hard to find above that. If someone is not using ECC, using 2666 or more like 3000 MHz is very much recommended. You can easily shave off 20 ns from that figure you measured. |
I also tested the latencies of some other AVX instructions on r7 3700x in a way similar to my previous test of vpermpd. The results are as follows: |
Are you serious? You know that X GHz memory server that much words per second, there is no shortcut (There is one, called cache) |
I changed 8 vpermpd to vshufpd in the first 4 "KERNEL4x12_*" macros in the file "dgemm_kernel_4x8_haswell.S" and received a 1/4 speedup while maintaining the correct results. |
I then modified the macro "SAVE4x12" in a similar way and got 0.3% performance improvement. Now the performance is about 9/10 of theoretical maximum. |
Test of more avx(2) instructions of doubles on r7-3700x (1 thread at 3.6 GHz) A similar test on i9-9900K (1 thread, 4.4GHz) (chained vfmadd231pd test encountered endless running so it was removed from the test, luckily I've done it previously with different codes): |
I also found that alternating vaddpd and vmulpd in the test code can get a total IPC of 4 on zen2, which was only 2 for i9-9900K. |
A simple test of AVX load & store instructions of packed doubles on r7-3700x (3.6GHz, 1 thread): Instruction(AT&T)................ max_IPC |
Unlike vpermpd, vpermilpd share the same latency and IPC with vshufpd on r7-3700x, so it can also replace vpermpd in some cases. |
Data sharing between CCXs - still problematic Here's the code: |
Synchronization bandwidths of shared data between cores: test on r7-3700x (3.6 GHz) codes: |
AMD looks like 4-core clusters ? |
It it accurately shown by lstopo, the L3 cache is not shared between CCXs. But it is shown as a single NUMA node, since memory access is uniform for all cores, so technically it is not a NUMA setup. |
@wjc404 What sort of fabric clock (FCLK) are you running? The inter core bandwidth between the CCXs is probably largely affected by FCLK. |
Well, not exposed but 3x faster ... |
Sorry I don't know where to get the frequency of FCLK. It should be the default one for 3.6 GHz CPU clock. |
I believe AMD put in some effort to make the Linux and Windows10 schedulers aware of the special topology. OpenBLAS itself probably has little chance to create a "useful" default affinity map on its own without knowing the "bigger picture" of what kind of code it was called from and what the overall system utilization is. (I think FCLK is proportional to the clock speed of the RAM installed in a particular system, so it could be that the DDR4-2133 memory shown on your AIDA screenshot lead to less than optimal performance of the interconnect ) |
The FCLK is the clock for the fabric between the core chiplet(s) and the IO
die. (I think it is also the bus responsible for communication between the
CCXs)
The FCLK is set by the motherboard FW, under normal circumstances this
should mean exactly 1/2 of the memory clock. So on most motherboards,
memory speed will directly alter CCX to CCX latency and bandwidth. Memory
write bandwidth is also very highly dependent on FCLK.
Going from 2133 to 3200 should increase the BW between CCXs by about 50%,
if the motherboard correctly keeps FCLK in sync with the memory speed.
It is possible to have a desynchronized FCLK, however it is very
undesirable running your system like that, as it increases memory latency
by about 20 ns and generally worsens performance. Motherboards should
default to keeping the FCLK in sync with the memory speed, from 2133 to
3600 MHz. However, I have heard that some motherboards have had firmware
bugs, and sometimes desynced FCLK for no good reason.
wjc404 <notifications@github.com> ezt írta (időpont: 2019. júl. 31., Sze,
15:39):
… Sorry I don't know where to get the frequency of FCLK. It should be the
default one for 3.6 GHz CPU clock.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2180>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHD2KHSUXKJS2N6KK64XTMTQCGI2RANCNFSM4H627C4Q>
.
|
@TiborGY Thanks for your guidance~ The FCLK frequency setting in my bios is AUTO. |
So by replacing your memory with DDR4-3600 you could increase FCLK to 1800 which would make the cross-ccx transfers look less ugly (though at an added cost of something like $150 per 16GB) |
Officially, Zen2 only supports up to 3200 MHz memory. In practice, 3600
seems fine, beyond that you start running into issues with the fabric
getting unstable, of course depending on your luck on the silicon lottery.
For this reason motherboards seem to default to desynced FCLK if you apply
an XMP profile faster than 3600. On a serious workstation I would probably
not risk going beyond 3200. Memory stability is notoriously hard to stress
test, and I would guess the same applies to fabric stability.
This does have a silver lining though, 3200 MHz RAM is not too expensive,
unless you want very tight memory timings (CL14-CL15).
Martin Kroeker <notifications@github.com> ezt írta (időpont: 2019. júl.
31., Sze, 20:16):
… So by replacing your memory with DDR4-3600 you could increase FCLK to 1800
which would make the cross-ccx transfers look less ugly (though at an added
cost of something like $150 per 16GB)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2180>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHD2KHVRRIBNQ5AFFNMSWV3QCHJIXANCNFSM4H627C4Q>
.
|
It is HyperTransport (intels rough equivalent is QPI). Though no idea how modern one does around clocking/powersaving etc.... |
Not anymore. It used to be hypertransport before Zen. The official marketing name for the current fabric is "Infinity Fabric". |
It is not userspace programmable, if scheduler knows we might be able to just group threads in cluster sized groups accessing same memory pieces, and avoiding L3 to L3 copies It claims 40GB/s roughly ?duplex? ?half each way? ?you are at optimum already? |
Hello. Thank you very much for the measurement script. I modified that a bit and pushed here; For the
I verified the numbers with
|
Hey. |
Hey. I've just prepared a comparison on one which runs
all numbers are collected here: Based on the numbers I was able to identify the following problems:
I'm going to bisect other performance issues. |
@marxin I did most of the SGEMM and DGEMM benchmarks with the 2 programs "sgemmtest_new" and "dgemmtest_new" in my repository GEMM_AVX2_FMA3. When using them on Zen processors, please set the environment variable MKL_DEBUG_CPU_TYPE to 5. |
Ok, I see the program depends on a MKL header file (and needs to be linked against it).
Sure. A difference is that you probably use OPENMP with multiple threads, am I right? |
@marxin couldn't you use the provided binaries from wjc404's repo (which also have MKL statically linked) ? And ISTR performance figures were obtained for both single and multiple threads. |
@marxin If you have confirmed significant performance drop of SGEMM (especially in serial execution with dimensions > 4000) on zen/zen+ chips after PR #2361 , then you can try to specify different SGEMM kernels for zen and zen2 (probably by editing "KERNEL.ZEN" & "param.h" and modifying CPU detection codes, to choose "sgemm_kernel_16x4_haswell.S" for zen/zen+ and "sgemm_kernel_8x4_haswell.c" for zen2) and make it a PR. Unfortunately I cannot access google website in China to download your results. Currently I don't have a machine with zen/zen+ CPU to test. |
I believe the speed drops in xDOT post 0.3.6 might be due to #1965 if they are not just an artefact. If I read your table correctly, your figures for DSDOT/SDSDOT are even worse than for ZDOT, and they definitely did not receive any changes except that fix for undeclared clobbers. |
@wjc404 this is marxin's spreadsheet exported from the google docs site in .xlsx format |
@martin-frbg Thanks. |
Note that my spreadsheet only contains results for single-threaded runs. I haven't had time to run parallel tests. I'm planning to do that.
Yes, I will test the suggested changes. |
Ok, I've just made a minimal reversion of #2361 which restores speed on |
I've just re-run that locally and I can't get the slower numbers for current |
Perhaps with Ryzen vs EPYC we are introducing some other variable besides znver1/znver2 even when running on a single core ? Unfortunately I cannot run benchmarks on my 2700K in the next few days (and I remember it was not easy to force it to run with a fixed core frequency and actually reproducible speeds) |
I did the benchmark given above with my new Ryzen 7 3700X. I set the CPU frequency to 3.6 GHz (verified with zenmonitor) switching off any Turbo Core boost or Pecision Boost Overdrive settings in the BIOS. I have installed 2x8 GB RAM @ 3200 MHz. The results for the last 3 releases of OpenBLAS are given in the spreadsheet. |
Zen 2 is now released, bringing a number of improvements to the table.
Most notably, it now has 256 wide AVX units. This should in theory allow performance parity with Haswell-Coffee Lake CPUs, and initial results suggest this is true (at least for single thread).
https://i.imgur.com/sFhxPrW.png
The chips also have double the L3 cache, and a generally reworked cache hierarchy. One thing to note, is that these chips do not have enough TLB cache to cover all of L2 and L3, so hugepages might be a little more important.
I might be able to get my hands on a Zen 2 system in ~1-2 months.
The text was updated successfully, but these errors were encountered: