Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: add stockfish benchmark #55

Closed
ThomasKaiser opened this issue Nov 9, 2022 · 5 comments
Closed

Proposal: add stockfish benchmark #55

ThomasKaiser opened this issue Nov 9, 2022 · 5 comments

Comments

@ThomasKaiser
Copy link
Owner

ThomasKaiser commented Nov 9, 2022

From cnx-software.

First invocation on Rock 5B in lazy mode (phoronix-test-suite benchmark pts/stockfish-1.4.0) already ended up with the board freezing at the 2nd stockfish run. Attaching fan to power and repeating again also again freeze during 2nd stockfish bench 128 8 24 default depth run.

General problem was already known since so far on some boards highest DRAM clock wasn't usable and users needed to switch from 2112 MHz to 1560 MHz for stable operation.

My board hasn't seen any freezes on highest DRAM clock so this was a surprise. By updating my Armbian image to latest version I was hoping for getting most recent boot BLOBs as part of u-boot package. It now reads ii linux-u-boot-rock-5b-legacy 22.11.0-trunk.0106 arm64 Uboot loader 2017.09 but problems got even worse and now the board freezes on 2112 MHz DRAM clock already at 1st benchmark execution. Maybe @amazingfate can comment on whether my OS image is expected to run on latest BLOBs or not?

With lower DRAM clock everything works as expected but at 2112 MHz DRAM clock the board freezes regardless of the A76's clockspeeds (and as such DVFS/consumption) so it looks solely related to DRAM clock:

A76 clock DRAM clock Watts SoC temp Nodes per second
2360 MHz 528 MHz 8-9W 40°C 3238057
2360 MHz 1068 MHz 9-10W 43.5°C 4122771
2360 MHz 1560 MHz 10-11W 46°C 4653285
2360 MHz 2112 MHz 12W 46°C freeze
1800 MHz 2112 MHz 8-9W 39°C freeze

With other CPU benchmarks I haven't seen consumption exceeding 9W on Rock 5B so stockfish is really a potent load generator / stability tester. On top of making heavy use of SIMD extensions it also is heavy on memory access: walking through the different DRAM clockspeeds ended up with significantly different scores: https://openbenchmarking.org/result/2211099-NE-2211093NE82

Quick check on an AMD EPYC 7232P (8C/16T) thing also hints at stockfish being more demanding than both cpuminer and 7-zip:

First chart is from a NetIO powermeter (measuring at the wall), 2nd is the server's internal BMC showing PSU1 (PSU2 is always in standby on this machine so the whole productive consumption is PSU1's thing), the last two are the BMC measurements for CPU and DRAM separately (though no idea to which number the memory controller contributes):

Bildschirmfoto 2022-11-09 um 19 52 39 Kopie

@ThomasKaiser
Copy link
Owner Author

ThomasKaiser commented Nov 9, 2022

And while we're at it let's benchmark some benchmarks. Here with regard to the influence of DRAM clockspeed: how this has an effect on especially memory bandwidth and latency and the scores used currently by sbc-bench + stockfish.

The values as follows:

  • DRAM is the DRAM clock in MHz configured via userspace DMC governor
  • 7-zip multi 7-ZIP MIPS generated with all cores (A76 at ~2360 MHz, A55 at 1840 MHz)
  • 7-zip single 7-ZIP MIPS done on an A76 at ~2360 MHz
  • AES is from an A76 and always the same since ARMv8 Crypto Extensions do the job and the score scales linearly with CPU clockspeed
  • memcpy score from from an A76 reported by tinymembench
  • memset score from from an A76 reported by tinymembench
  • 4M ns 'single random read' / 'dual random read' latency from an A76 with 4M block size reported by tinymembench
  • 64M ns 'single random read' / 'dual random read' latency from an A76 with 64M block size reported by tinymembench
  • kH/s cpuminer scores generated on all cores working in parallel
  • stockfish is the 'Nodes per second' score generated on all cores with stockfish bench 128 8 24 default depth
DRAM 7-zip single 7-zip multi AES memcpy memset 4M ns 64M ns kH/s stockfish
528 2587 13050 1344830 3570 8450 63.2/99.3 235.8/271.3 22.06 3238057
1068 2940 15120 1344500 6270 16950 46.9/73.6 166.3/192.2 22.05 4122771
1560 3086 16040 1344060 8620 24390 38.6/58.8 139.9/158.0 22.03 4653285
2112 3167 16640 1343220 10850 29330 35.7/53.7 123.2/139.0 22.03 freeze

To interpret the results (not talking about memory bandwidth/latency since these numbers are self-explanatory):

  • 7-zip single single-threaded score depends highly on memory latency as such lower DRAM clock which results in massively higher latency negatively affects the scores. The scores when generated with 7-zip v16.02 are almost the same regardless of distribution in question thanks to p7zip package on Linux more or less being unmaintained. At least Debian Stretch, Buster, Bullseye and Ubuntu Bionic, Focal, Jammy, Kinetic all ship with v16.02 and 7-zip MIPS on same hardware with otherwise identical settings generate the same score for over six consecutive years now (7-zip distro packages built with GCC 6.3 up to GCC 12.2)
  • 7-zip multi: the same applies as for 7-zip single but there's a huge caveat: depending on kernel version the multi-threaded scores can differ significantly but that's not a benchmarking flaw but also affects real-world tasks supposed to run fully parallel – see the ODROID-XU4 example below
  • AES is from an A76 and always the same since ARMv8 Crypto Extensions do the job and the score scales linearly with clockspeed
  • kH/s cpuminer scores are not affected by DRAM clock (working set too small so everything fits into CPU caches) but by compiler version and flags (see the three Rock64 1400 MHz scores in my results list that only differ by GCC 6.3 vs. 7.3 vs. 8.2 or the fact that cpuminer generates a 25.31 score when built with GCC 9.3 vs. the 20% lower score when built with GCC 12.2 as above. Not always does a higher compiler version number result in better scores)
  • stockfish OTOH depends significantly on DRAM clock. So far no idea whether that's related to bandwidth, latency or both.

Speaking about the 7-zip multi scores... those above were all generated with same kernel version (a smelly 5.10 Rockchip BSP kernel). But with different kernel versions multi-threaded behaviour can change significantly as already outlined in my reasoning to use 7-zip as benchmark.

Let's have a look on kernel version and ODROID-XU4:

Kernel / Compiler 7-zip single 7-zip multi CPU utilisation compression CPU utilisation decompression
Kernel 4.9 / GCC 6.3 1622 6370 64% 78%
Kernel 4.14 / GCC 7.3 1633 7100 64% 78%
Kernel 5.4 / GCC 9.3 1604 8980 94% 84%

The single-threaded score is the same with all kernel versions but the multi-threaded scores differ a lot and also the reported CPU utilization. It's a scheduler and not a benchmark problem.

@ThomasKaiser
Copy link
Owner Author

ThomasKaiser commented Nov 10, 2022

Another suggestion from cnx-software: rule out the A55 cores:

root@rock-5b:/home/tk# echo performance >/sys/devices/platform/dmc/devfreq/dmc/governor
root@rock-5b:/home/tk# echo performance >/sys/devices/system/cpu/cpufreq/policy4/scaling_governor
root@rock-5b:/home/tk# echo performance >/sys/devices/system/cpu/cpufreq/policy6/scaling_governor
root@rock-5b:/home/tk# for i in 3 2 1 0 ; do echo 0 >/sys/devices/system/cpu/cpu${i}/online; done
root@rock-5b:/home/tk# htop (confirm that A55 cores are offline)
root@rock-5b:/home/tk# phoronix-test-suite benchmark pts/stockfish-1.4.0
...
Stockfish 15:
    pts/stockfish-1.4.0 [Total Time]
    Test 1 of 1
    Estimated Trial Run Count:    3                      
    Estimated Time To Completion: 14 Minutes [09:38 CET] 
        Started Run 1 @ 09:24:08
        Started Run 2 @ 09:28:58

Rock 5B frozen after 4:45m. Reported consumption 'at wall': 9-10W (all measurements with active fan which contributes 700mW to measurements).

@ThomasKaiser
Copy link
Owner Author

First implementation done: bddc8d4

@amazingfate
Copy link

Armbian has updated to the latest bl31 firmware since this commit. You have to see the current used firmware from serial console output,

@ThomasKaiser
Copy link
Owner Author

@amazingfate sbc-bench -s reliably freezes my Rock 5B even with latest BLOBs on 2112 MHz DRAM clock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants