Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M128-28 not getting better scores than M96-28 #11

Closed
geerlingguy opened this issue Sep 12, 2023 · 10 comments
Closed

M128-28 not getting better scores than M96-28 #11

geerlingguy opened this issue Sep 12, 2023 · 10 comments

Comments

@geerlingguy
Copy link
Contributor

I've bumped Ps/Qs, and tweaked Ns a bit, but for some reason on my system, I can't get the M128-28 CPU I swapped in to get any higher number than my M96-28 CPU...

For the 96-core CPU, see: #10

I have kept everything else in the system identical (same Samsung 96 GB RAM, no additional PCIe cards, same USB network adapter plugged in, using VGA monitor output), but with the M128-28 CPU, I changed HPL.dat to use:

N: 100000
NB: 256
P: 8
Q: 16

That resulted in 1118.5 Gflops, nearly identical to the 96-core result. I installed lm-sensors and ran watch sensors, and the SoC temp never rose above 65-67°C. CPU power hovered around 125-130W, and never went any higher. Checking cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq, almost all cores were locked at 2800000, but maybe 10-15 would go up and down.

Here's the full HPL result. I also tried 105000 for N, with almost identical results.

root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  100000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      100000   256     8    16             596.08             1.1185e+03
HPL_pdgesv() start time Tue Sep 12 15:31:28 2023

HPL_pdgesv() end time   Tue Sep 12 15:41:24 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.62873305e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Is there something I could be missing? Do I need to change anything else in BIOS or on the COM-HPC carrier from ADLINK to unlock the additional performance / power? I believe the chip should go up to 170W or maybe even a little more at full blast...

@geerlingguy
Copy link
Contributor Author

geerlingguy commented Sep 15, 2023

Using RAM provided by Ampere (thanks!) I've upgraded the system to 384 GB of RAM:

jgeerling@ampere-ubuntu:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           376Gi        85Gi       289Gi        35Mi       1.7Gi       288Gi
Swap:          8.0Gi          0B       8.0Gi

For tinymembench results, see: geerlingguy/sbc-reviews#19 (comment)

And running again:

root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  200000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      200000   256     8    16            4214.60             1.2655e+03
HPL_pdgesv() start time Fri Sep 15 18:56:44 2023

HPL_pdgesv() end time   Fri Sep 15 20:06:59 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.13196075e-02 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

I still think something is bottlenecking the CPU. Maximum SoC Temperature was around 70°C, and max CPU power was around 144W. It should have more headroom, so I'm wondering if there might be something on the system or in the BIOS limiting power?

@geerlingguy
Copy link
Contributor Author

Running a stream benchmark:

$ wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
$ gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=41943040 -DNTIMES=100 stream.c -o stream
$ OMP_NUM_THREADS=32 GOMP_CPU_AFFINITY=0-31 ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 41943040 (elements), Offset = 0 (elements)
Memory per array = 320.0 MiB (= 0.3 GiB).
Total memory required = 960.0 MiB (= 0.9 GiB).
Each kernel will be executed 100 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 32
Number of Threads counted = 32
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 5540 microseconds.
   (= 5540 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          127508.5     0.005291     0.005263     0.005597
Scale:         127508.5     0.005307     0.005263     0.007091
Add:           126908.7     0.007971     0.007932     0.008237
Triad:         128707.6     0.007846     0.007821     0.007927
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

And NUMA configuration is set to Monolithic:

$ lscpu | grep NUMA
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-127

@naren-ampere
Copy link

naren-ampere commented Sep 22, 2023

Given your stream results and the fact that this is a 6 memory channel platform, you might be hitting the limits of memory bandwidth, Jeff. If that is the case, then the extra cores will not help since both processors are using DDR-4/3200.

A quick way to find out what your memory bandwidth usage is while running HPL is using the PMUs. It collects the number of memory requests every second. Multiply by 64 (the cache line size). That will give you bytes/second.

perf stat -e hnf_mc_reqs -I 1000

@geerlingguy
Copy link
Contributor Author

geerlingguy commented Sep 22, 2023

@naren-ampere - When I started the HPL run, it quickly went from around 300,000 to 2.6 billion, and seemed to top out around there:

jgeerling@ampere-ubuntu:~$ sudo perf stat -e hnf_mc_reqs -I 1000
#           time             counts unit events
...
    40.042555006            327,556      hnf_mc_reqs                                                 
    41.043624682            312,877      hnf_mc_reqs                                                 
    42.043772997         19,648,303      hnf_mc_reqs                                                 
    43.044881473        893,596,168      hnf_mc_reqs                                                 
    44.046009390      1,182,953,794      hnf_mc_reqs                                                 
    45.047135987        935,660,943      hnf_mc_reqs                                                 
    46.048284105      2,186,454,576      hnf_mc_reqs                                                 
    47.049566783      2,718,404,089      hnf_mc_reqs                                                 
    48.050036380      2,715,929,484      hnf_mc_reqs                                                 
    49.051314219      2,697,165,478      hnf_mc_reqs                                                 
    50.052606018      2,345,867,999      hnf_mc_reqs 
...
   303.331025958      2,536,833,920      hnf_mc_reqs                                                 
   304.332337697      2,555,866,594      hnf_mc_reqs                                                 
   305.333625356      2,606,566,658      hnf_mc_reqs                                                 
   306.334039495      2,187,907,619      hnf_mc_reqs                                                 
   307.335285155      2,609,550,294      hnf_mc_reqs                                                 
   308.336543614      2,560,355,144      hnf_mc_reqs  

2,718,404,089 * 64 = 173,977,861,696 bytes/sec? (Is that correct?)

And as the test continued on, the numbers hovered between 2.0-2.4 billion counts (down from the max of around 2.7 billion).

While monitoring with perf, it seems to lock up after 5-10 minutes :)

Looking at Anandtech's article, it seems the 128-core CPU can max out around 175 GB/s at lower thread counts, but dips down to the 140s-150s as you hit 120+ threads. 150-160 GB/sec seems to be in that range at least? (Or am I interpreting it incorrectly?)

@naren-ampere
Copy link

@geerlingguy, yes, the math is correct. I'm impressed you're getting 174 GB/sec with your config.
If these numbers are on the M96-28, then you are indeed memory bandwidth-bound and adding more cores won't help.

@geerlingguy
Copy link
Contributor Author

Those numbers came from the M128-28 part — I haven't tested with perf on the M96-28 part yet, I'm currently a bit space-constrained but I'm going to try to set up my Dev Kit with the M96-28 CPU so I can run both for comparison without having to do a full CPU swap each time :)

@geerlingguy
Copy link
Contributor Author

I'm also going to test with a Q64-22 to see how things scale (I'm presuming it will not be memory-bound). See: geerlingguy/top500-benchmark#19

@geerlingguy
Copy link
Contributor Author

That CPU has a similar efficiency, and scores about half the 128-core part, so I'm going to count it as a win, and say the bottleneck is memory in terms of getting the maximum possible scores. Not that the 6-channel board is a slouch, it's just that we would need to go with a server-grade motherboard to get any higher.

@ls-sloan
Copy link

Using RAM provided by Ampere (thanks!) I've upgraded the system to 384 GB of RAM:

Hi Jeff! Great post and video on YT. Would you mind sharing the model number of the RAM that you used?
Thanks!

@geerlingguy
Copy link
Contributor Author

Samsung DDR3200 ECC RAM - specifically M393A2K40DB3-CWE 16GB 1Rx4 PC4-25600 DDR4-3200AA Memory Module w60.

I have about 12 of these sticks now, as I've ordered a few sets for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants