Skip to content

Latest commit

 

History

History
221 lines (190 loc) · 16 KB

Benchmarking_some_benchmarks.md

File metadata and controls

221 lines (190 loc) · 16 KB

Benchmarking some benchmarks

"Casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C."

Let's look a bit closer at some popular passive benchmarks...

Dhrystone / DMIPS / DMIPS/MHz

DMIPS are 'Dhrystone MIPS', the single-threaded score of the Dhrystone benchmark developed 1984 (four decades ago!) as an improvement over the older MIPS metric ('million instructions per second' which became kinda pointless back then with the 'RISC vs. CISC' battle).

Written in the programming languages popular back then (FORTRAN, PL/1, SAL, ALGOL 68, and Pascal) and suited for the single-core CPUs of that time (almost every modern MCU is outperforming now) Dhrystone results do not represent anything that runs on today's computers. DMIPS since decades are misleading for the following reasons (quoted from Wikipedia and an ARM White Paper):

  • Dhrystone features unusual code that is not usually representative of modern real-life programs.
  • Dhrystone is susceptible to compiler optimizations. For example, it does a lot of string copying in an attempt to measure string copying performance. However, the strings in Dhrystone are of known constant length and their starts are aligned on natural boundaries, two characteristics usually absent from real programs. Therefore, an optimizer can replace a string copy with a sequence of word moves without any loops, which will be much faster. This optimization consequently overstates system performance, sometimes by more than 30%.
  • Dhrystone's small code size may fit in the instruction cache of a modern CPU, so that instruction fetch performance is not rigorously tested. Similarly, Dhrystone may also fit completely in the data cache, thus not exercising data cache miss performance. To counter fits-in-the-cache problem, the SPECint benchmark was created in 1988 to include a suite of (initially 8) much larger programs (including a compiler) which could not fit into L1 or L2 caches of that era.
  • Dhrystone numbers actually reflect the performance of the C compiler and libraries, probably more so than the performance of the processor itself
  • Dhrystone’s execution is largely spent in standard C library functions, such as strcmp(),strcpy(), and memcpy(). Compiler vendors generally provide these libraries that are typically optimized and hand-written in assembly language. While you may think you are benchmarking a processor, you are really benchmarking are the compiler writer’s optimizations of the C library functions for a particular platform

Maybe even more concerning is the completely flawed way those scores are generated in the wild. And of course the results you find somewhere on the net usually lack all the important info (like which OS, which libs and which compiler with which flags has been used).

Using the dhrystonePi64 binary (Dhrystone Benchmark, Version 2.1, Language: C or C++)' from http://www.roylongbottom.org.uk/dhrystone%20results.htm on a RK3588 device with four A76 CPU cores combined with four A55 while switching the memory clockspeed between 2112 (performance DMC governor) and 528 MHz (powersave DMC governor) we get these results:

Dhrystone 2.1 result A76 / 2112 MHz A76 / 528 MHz A55 / 2112 MHz A55 / 528 MHz
Nanoseconds one Dhrystone run 30.43 30.85 90.96 90.84
Dhrystones per Second 32860262 32415115 10994401 11008762
VAX MIPS rating 18702.48 18449.13 6257.49 6265.66

As can be seen memory clock doesn't matter at all since Dhrystone was already critized decades ago for its small working set fitting completely into CPU caches of that era.

When limiting the A76 CPU cores to the same 1.8 GHz the A55 are clocked with we get this result and as such a DMIPS/MHz comparison ratio:

Nanoseconds one Dhrystone run:        39.15
Dhrystones per Second:             25542413
VAX MIPS rating =                  14537.51

The VAX MIPS ratings generated with same dhrystone binary suggest the A76 being 2.32 faster than an A55 at same clockspeed (14540 / 6260 = 2.32). Interesting since places like Wikipedia tell us A76 would be 3.5 – 4.1 times faster than the A55 of this popular DynamIQ pairing (see table below). What went wrong at Wikipedia? Maybe ignoring Dhrystone being more a compiler than a hardware benchmark in the 'fire and forget' mode it's always used?

One of the few examples of using Dhrystone in a non flawed way (same Dhrystone binary as such same compiler version and same compiler flags and on the same OS image as such same libraries) it looks like this with few different ARMv8 Cortex cores:

  • A35 – 1.7 DMIPS/MHz
  • A53 – 2.2 DMIPS/MHz
  • A57 – 4.1 DMIPS/MHz
  • A72 – 4.5 DMIPS/MHz
  • A73 – 4.8 DMIPS/MHz
  • A75 – 6.1 DMIPS/MHz
  • A77 – 7.3 DMIPS/MHz

But of course you also find totally different numbers all over the web, for example at Wikipedia, Baselabs and even two differing DMIPS/MHz listings at bluelucky.

ARM Core Measured Wikipedia Baselabs bluelucky 1 bluelucky 2
A5 1.57
A7 1.9 1.9 1.9
A8 2.0 2.0 2.0
A9 2.5 2.0 2.5
A15 3.5 4.0 3.4
A17 2.8 4.0 3.2
A32 2.3 2.3
A35 1.7 1.78 2.5 2.5
A53 2.2 2.3 2.3 2.3 2.3
A55 3 3 2.3 2.7
A57 4.1 4.1 – 4.8 4.6 4.1
A72 4.5 6.3 – 7.3 7.4 5.4 4.7
A73 4.8 7.4 – 8.5 7.0 4.8
A75 6.1 8.2 – 9.5 7.0 5.2
A76 10.7 – 12.4 12
A77 7.3 13 – 16

The correctly measured Dhrystone MIPS/MHz score suggests Cortex-A72 (an out-of-order big core) being more than twice as fast as the corresponding Cortex-A53 (an in-order little core meant to be combined with A72/A73 for big.LITTLE hybrid CPU designs). But when trusting into Wikipedia A72 is almost 3 times faster. And with the Cortex-A77 for example it gets even more weird since Wikipedia numbers and correctly determined differ even more.

Blender

Blender is a popular open source render engine/tool that got an own benchmark mode/tool few years ago. Since I was interested in Apple's raytracing functionality introduced with their M3 SoCs (this little patch does the magic from version 4.0.0 on) I compared 4.0.0 with 3.6.0 scores:

GPU 4.0.0 score 3.6.0 score difference
Apple M3 Max (GPU - 40 cores) 3417.29 3014.83 113.3%
Apple M3 Pro (GPU - 18 cores) 1510.37 1314.46 114.9%

So 'hardware raytracing' makes up for a less than 15% performance improvement? Let's have a closer look whether benchmark scores done with different versions can be compared in the first place...

Grabbing data from https://opendata.blender.org/ on 22th Nov 2023 and filtering out all devices with less than 4 scores (52 GPU models remaining) we see a 'drop in performance' with 46 of them compared to the older 3.6.0 version. Especially Nvidia GPUs are affected (RTX 4060 Ti being the 'worst') and Apple's SoCs as such we can assume that the benefit of having HW accelerated raytracing on the M3 SoCs accounts for a performance improvement in Blender more close to 20%.

Comparing 4.0.0 with 3.6.0 in detail
grep "^\"" Blender-4.0.0.csv | while read ; do
Device="$(awk -F'"' '{print $2}' <<<"${REPLY}")"
Score4="$(awk -F'"' '{print $4}' <<<"${REPLY}")"
Score3="$(grep "\"${Device}\"" Blender-3.6.0.csv | awk -F'"' '{print $4}')"
Diff="$(awk '{printf ("%0.1f",100*$1/$2); }' <<<"${Score4} ${Score3}")"
echo -e "| ${Device} | ${Score4} | ${Score3} | ${Diff}% |"
done | sort -t '|' -k 5 -n
GPU 4.0.0 score 3.6.0 score difference
NVIDIA GeForce RTX 4060 Ti 3451.59 4306.28 80.2%
NVIDIA GeForce RTX 2060 1541.86 1851.51 83.3%
NVIDIA GeForce RTX 2070 2074.64 2441.76 85.0%
NVIDIA GeForce RTX 4090 11337.02 13093.11 86.6%
NVIDIA GeForce RTX 3080 Ti 5253.9 6055.71 86.8%
NVIDIA GeForce RTX 3070 Ti 3557.07 4092.95 86.9%
NVIDIA GeForce RTX 4060 3056.69 3482.13 87.8%
NVIDIA GeForce RTX 3080 4605.96 5227.13 88.1%
NVIDIA GeForce RTX 3070 3268.63 3704.15 88.2%
NVIDIA GeForce GTX 1660 Ti 753.36 851.8 88.4%
NVIDIA GeForce RTX 2060 SUPER 2167.41 2449.2 88.5%
NVIDIA GeForce RTX 3060 Ti 2835.63 3195.27 88.7%
NVIDIA GeForce RTX 4080 Laptop GPU 5650.66 6371.23 88.7%
NVIDIA GeForce RTX 3060 2246.81 2531.17 88.8%
NVIDIA GeForce RTX 4070 Ti 6514.48 7290.21 89.4%
NVIDIA GeForce RTX 4080 8558.09 9575.48 89.4%
NVIDIA GeForce RTX 3090 5651.84 6289.07 89.9%
NVIDIA GeForce RTX 2080 SUPER 2357.52 2617.67 90.1%
NVIDIA GeForce RTX 4090 Laptop GPU 7388.08 8203.46 90.1%
NVIDIA GeForce GTX 1660 SUPER 749.46 830.35 90.3%
NVIDIA GeForce RTX 4050 Laptop GPU 2610.53 2889.12 90.4%
NVIDIA GeForce RTX 3050 Laptop GPU 1212.3 1340.07 90.5%
NVIDIA GeForce RTX 3060 Laptop GPU 2390.45 2617.27 91.3%
NVIDIA GeForce RTX 4060 Laptop GPU 3351.88 3645.67 91.9%
NVIDIA GeForce RTX 4070 Laptop GPU 3674.3 3999.65 91.9%
Apple M2 Max (GPU - 38 cores) 1765.03 1914.88 92.2%
NVIDIA GeForce GTX 1070 528.56 573.36 92.2%
NVIDIA GeForce RTX 2070 SUPER 2398.9 2602.16 92.2%
NVIDIA GeForce RTX 2080 Ti 3075.79 3333.86 92.3%
NVIDIA GeForce RTX 4070 5581.39 6028.95 92.6%
Apple M1 Max (GPU - 32 cores) 933.21 1006.63 92.7%
NVIDIA GeForce GTX 1080 Ti 829.75 894.47 92.8%
AMD Radeon RX 6800 1793.94 1929.72 93.0%
NVIDIA GeForce RTX 3070 Ti Laptop GPU 3071.26 3287.45 93.4%
AMD Radeon RX 7800 XT 2270 2427.85 93.5%
Apple M2 Max (GPU - 30 cores) 1451.12 1550.73 93.6%
Apple M1 (GPU - 8 cores) 249.92 265.98 94.0%
Apple M2 Ultra (GPU - 76 cores) 3214.87 3420.98 94.0%
Intel Arc A770 Graphics 1980.98 2106.39 94.0%
AMD Radeon RX 6700 XT 1490.09 1566.79 95.1%
Apple M1 Pro (GPU - 16 cores) 469.32 487.02 96.4%
Apple M1 Max (GPU - 24 cores) 774.97 796.77 97.3%
AMD Radeon RX 7900 XTX 3958.38 3980.96 99.4%
AMD Radeon RX 6900 XT 2597.21 2611.39 99.5%
NVIDIA RTX A4000 3397.06 3408.64 99.7%
AMD Radeon RX 6800 XT 2432.05 2437.77 99.8%
Intel Arc A750 Graphics 2058.68 2054.02 100.2%
NVIDIA GeForce RTX 3070 Laptop GPU 3171.62 3161.06 100.3%
AMD Radeon RX 6950 XT 2776.02 2751.71 100.9%
AMD Radeon RX 6700 1404.47 1347.65 104.2%
Apple M3 Max (GPU - 40 cores) 3417.29 3014.83 113.3%
Apple M3 Pro (GPU - 18 cores) 1510.37 1314.46 114.9%

Does this only affect 3.6.0 vs. 4.0.0 so that we at least can rely on Blender 3.x scores to be comparable? Nope, there it's even worse. 3.6.0 vs. 3.0.1 ends up with some GPUs becoming 'three to four times faster'.

Comparing 3.6.0 with 3.0.1 in detail
grep "^\"" Blender-3.6.0.csv | while read ; do
Device="$(awk -F'"' '{print $2}' <<<"${REPLY}")"
Score4="$(awk -F'"' '{print $4}' <<<"${REPLY}")"
Score3="$(grep "\"${Device}\"" Blender-3.0.1.csv | awk -F'"' '{print $4}')"
Diff="$(awk '{printf ("%0.1f",100*$1/$2); }' <<<"${Score4} ${Score3}")"
echo -e "| ${Device} | ${Score4} | ${Score3} | ${Diff}% |"
done | sort -t '|' -k 5 -n
GPU 3.6.0 score 3.0.1 score difference
NVIDIA GeForce GTX 660 124.11 150.92 82.2%
NVIDIA GeForce GTX 1060 6GB 390.68 443.65 88.1%
NVIDIA GeForce GTX 1050 185.12 208.68 88.7%
NVIDIA GeForce GTX 1070 573.36 624.6 91.8%
NVIDIA GeForce RTX 3070 Ti Laptop GPU 3287.45 3495.73 94.0%
NVIDIA Quadro RTX 4000 2342.57 2485.81 94.2%
NVIDIA GeForce GTX 1650 480.56 505.67 95.0%
NVIDIA GeForce GTX 1050 Ti 231.88 242.73 95.5%
NVIDIA GeForce GTX 1080 Ti 894.47 935.9 95.6%
NVIDIA Quadro RTX 6000 3370.77 3521.78 95.7%
NVIDIA GeForce GTX 1080 621.83 643.01 96.7%
NVIDIA GeForce GTX 1660 Ti 851.8 879.17 96.9%
NVIDIA GeForce GTX 1650 Ti 518.39 533.94 97.1%
NVIDIA GeForce GTX 970 323.6 333.36 97.1%
NVIDIA GeForce GTX 1660 777.73 799.05 97.3%
NVIDIA GeForce RTX 3080 Laptop GPU 3300.05 3378.88 97.7%
NVIDIA GeForce GTX 1660 SUPER 830.35 849.14 97.8%
NVIDIA GeForce RTX 2060 SUPER 2449.2 2487.79 98.4%
NVIDIA GeForce RTX 2070 with Max-Q Design 2026.81 2055.26 98.6%
NVIDIA GeForce GTX 1060 380.67 385.74 98.7%
NVIDIA GeForce RTX 2080 Ti 3333.86 3373.16 98.8%
NVIDIA GeForce RTX 3060 2531.17 2513.31 100.7%
NVIDIA GeForce RTX 3050 1659.88 1629.28 101.9%
NVIDIA GeForce RTX 2060 1851.51 1809.77 102.3%
NVIDIA GeForce RTX 2080 2549.7 2490.22 102.4%
NVIDIA GeForce RTX 3060 Ti 3195.27 3120.36 102.4%
AMD Radeon RX 5700 XT 955.06 932.15 102.5%
NVIDIA GeForce RTX 2080 SUPER 2617.67 2535.25 103.3%
NVIDIA GeForce RTX 2070 SUPER 2602.16 2505.17 103.9%
NVIDIA GeForce RTX 3080 5227.13 5029.25 103.9%
NVIDIA GeForce RTX 3070 Laptop GPU 3161.06 3023.2 104.6%
NVIDIA GeForce RTX 3070 3704.15 3506.28 105.6%
NVIDIA RTX A6000 5785.7 5472.1 105.7%
NVIDIA GeForce RTX 3080 Ti 6055.71 5711.42 106.0%
NVIDIA GeForce RTX 3070 Ti 4092.95 3849.24 106.3%
NVIDIA GeForce RTX 2070 2441.76 2252.86 108.4%
NVIDIA GeForce RTX 3090 6289.07 5764.34 109.1%
NVIDIA GeForce RTX 3060 Laptop GPU 2617.27 2372.01 110.3%
NVIDIA GeForce RTX 3050 Laptop GPU 1340.07 1207.17 111.0%
AMD Radeon RX 6700 XT 1566.79 1359.52 115.2%
AMD Radeon RX 6900 XT 2611.39 2262.86 115.4%
AMD Radeon RX 6700S 918.46 789.23 116.4%
NVIDIA GeForce RTX 3080 Ti Laptop GPU 3978.32 3385.31 117.5%
AMD Radeon RX 5500 XT 506.08 428.89 118.0%
AMD Radeon RX 6800 XT 2437.77 2061.51 118.3%
AMD Radeon PRO W6800 1880.56 1584.62 118.7%
AMD Radeon RX 6600 XT 1103.97 928.68 118.9%
AMD Radeon RX 6600 1011.36 850.45 118.9%
NVIDIA GeForce RTX 3050 Ti Laptop GPU 1514.43 1253.98 120.8%
NVIDIA Tesla T4 1727.73 445.36 387.9%
NVIDIA RTX A2000 8GB Laptop GPU 1473.89 375.63 392.4%

As usual: scores generated with different software versions can't be compared!