-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Having trouble replicating results from README on 96-core CPU #10
Comments
Hi Jeff, |
@ii-BOY - It's slightly more complex than that, Ns is not 1:1 correlated to memory size, and finding the right parameters to make HPL use as much RAM as you have—but not too much—is mostly a matter of trial and error. As this project's README states, with Ns at 105000, the RAM usage is around 91 GB, which is about ideal for a 96 GB RAM system, assuming it's only running the benchmark. |
Seeing that one of the Ampere devs who is benchmarking the same system (and who's numbers are used in the README) has gotten different results, we compared everything about our systems, and determined the only real difference is the memory. I am currently running: And he is running Samsung ECC RAM, same spec though. You wouldn't think different vendors' RAM would cause a 20% performance difference (they are both similar down to CL22 CAS Latency...), but stranger things have happened. So I've ordered six sticks of 16 GB Samsung M393A2K40DB3-CWE DDR4-3200 ECC RAM, and they should come in a day or two... then I'll re-run my tests and see if they're any faster with Samsung RAM. |
For a point of reference, I even tried forcing 3200 (instead of 'Auto') for the memory speed in the BIOS, and got the same result (+/- 1%), and here are the current memory speed results from tinymembench:
Run with:
|
@geerlingguy I ran tinymembench on my machine and here the comparative results.
Except for the last 9 tests, my machine seems to be outperforming by 20% . Also attaching a graphical representation of the same.
|
@geerlingguy The next part of the test was the memory latency test. The results are as below
Run 2 with MADV_HUGEPAGE
I mapped one of the runs into a graph as seen below As seen with other latency benchmarks, we're good when comparing L1 and L2 cache. The differences start popping up as we move from L2 Cache to system memory. Once again my results are ~20% faster. |
Wow, what a difference the memory seems to make! I got 2 of the 6 new RAM sticks just now. Running HPL with N=50000, I see:
Encouraging early result! The rest of the RAM is coming Monday... And here are the new tinymembench results (NOTE: Just for the 2x16GB sticks, performance will differ filling all the memory channels...): Click to expand tinymembench results
Can't wait for the other sticks to arrive. I will finally pass the 'teraflop on a CPU' barrier :) |
I have a Twitter (X?) thread going on about the memory differences. Going to also try to see if I can look up timing data in Linux via |
Hmm...
|
Hi Jeef, |
@ii-BOY . Hi this test was run just once. Thanks. |
Click to view tinymembench results
|
New result:
1188.3 Gflops at 296W = 4.01 Gflops/W |
It seems like my Samsung RAM still performs just under whatever RAM @rbapat-ampere is using in his system, so that seems to explain the delta! I think this issue can be closed, as we've found the culprit. |
I have just re-created the test bench scenario using a 96-core Ampere Altra Dev Workstation with 96 GB of RAM, running Ubuntu 20.04 server aarch64, with the following kernel:
I have now replicated this setup with two clean installs (even going so far as removing all my NVMe drives, reformatting them, and re-installing Ubuntu 20.04 aarch64 twice for a completely fresh systsem).
And both times, I am getting around 980-1,000 Gflops following the explicit instructions in this repo each time (see also, geerlingguy/sbc-reviews#19).
My most recent run today, on a new fresh install:
And the contents of the HPL.dat file:
According to the README, I should be getting over 1.2 Tflops using this same configuration.
Can you help me figure out what might be different between my test workstation setup and the one used to generate these results?
The text was updated successfully, but these errors were encountered: