Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having trouble replicating results from README on 96-core CPU #10

Closed
geerlingguy opened this issue Aug 6, 2023 · 14 comments
Closed

Having trouble replicating results from README on 96-core CPU #10

geerlingguy opened this issue Aug 6, 2023 · 14 comments

Comments

@geerlingguy
Copy link
Contributor

I have just re-created the test bench scenario using a 96-core Ampere Altra Dev Workstation with 96 GB of RAM, running Ubuntu 20.04 server aarch64, with the following kernel:

root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# uname -a
Linux ampere 5.4.0-155-generic #172-Ubuntu SMP Fri Jul 7 16:13:58 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

I have now replicated this setup with two clean installs (even going so far as removing all my NVMe drives, reformatting them, and re-installing Ubuntu 20.04 aarch64 twice for a completely fresh systsem).

And both times, I am getting around 980-1,000 Gflops following the explicit instructions in this repo each time (see also, geerlingguy/sbc-reviews#19).

My most recent run today, on a new fresh install:

root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun -np 96 --allow-run-as-root --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             774.54             9.9642e+02
HPL_pdgesv() start time Sun Aug  6 22:00:59 2023

HPL_pdgesv() end time   Sun Aug  6 22:13:54 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

And the contents of the HPL.dat file:

root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# cat HPL.dat 
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
105000       Ns
1            # of NBs
256          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
8            Ps
12           Qs
16.0         threshold
1            # of panel fact
2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

According to the README, I should be getting over 1.2 Tflops using this same configuration.

Can you help me figure out what might be different between my test workstation setup and the one used to generate these results?

@ii-BOY
Copy link

ii-BOY commented Aug 16, 2023

Hi Jeff,
I have some questions about the setting of HPL.dat, you set the Ns parameter as 105000, but you also mentioned that the system has 96GB RAM, I am a bit confused about this, because I found this website which can input the system info and gets the HPL.dat parameter for your input, and just copy and paste to the HPL.dat.
https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/
According to your HPL.dat configuration, the actual system memory of your system should be at least 100GB~110GB, or even more...
I am not sure about that, can you give me some advice?
Thanks
BR
ii-BOY

@geerlingguy
Copy link
Contributor Author

@ii-BOY - It's slightly more complex than that, Ns is not 1:1 correlated to memory size, and finding the right parameters to make HPL use as much RAM as you have—but not too much—is mostly a matter of trial and error.

As this project's README states, with Ns at 105000, the RAM usage is around 91 GB, which is about ideal for a 96 GB RAM system, assuming it's only running the benchmark.

@geerlingguy
Copy link
Contributor Author

Seeing that one of the Ampere devs who is benchmarking the same system (and who's numbers are used in the README) has gotten different results, we compared everything about our systems, and determined the only real difference is the memory.

I am currently running:

And he is running Samsung ECC RAM, same spec though. You wouldn't think different vendors' RAM would cause a 20% performance difference (they are both similar down to CL22 CAS Latency...), but stranger things have happened.

So I've ordered six sticks of 16 GB Samsung M393A2K40DB3-CWE DDR4-3200 ECC RAM, and they should come in a day or two... then I'll re-run my tests and see if they're any faster with Samsung RAM.

@geerlingguy
Copy link
Contributor Author

geerlingguy commented Sep 7, 2023

For a point of reference, I even tried forcing 3200 (instead of 'Auto') for the memory speed in the BIOS, and got the same result (+/- 1%), and here are the current memory speed results from tinymembench:

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   9424.0 MB/s
 C copy backwards (32 byte blocks)                    :   9387.8 MB/s
 C copy backwards (64 byte blocks)                    :   9390.8 MB/s
 C copy                                               :   9366.1 MB/s
 C copy prefetched (32 bytes step)                    :   9984.4 MB/s
 C copy prefetched (64 bytes step)                    :   9984.1 MB/s
 C 2-pass copy                                        :   6391.4 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   7237.8 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   7489.6 MB/s
 C fill                                               :  43884.4 MB/s
 C fill (shuffle within 16 byte blocks)               :  43885.4 MB/s
 C fill (shuffle within 32 byte blocks)               :  43884.2 MB/s
 C fill (shuffle within 64 byte blocks)               :  43877.5 MB/s
 NEON 64x2 COPY                                       :   9961.9 MB/s
 NEON 64x2x4 COPY                                     :  10091.6 MB/s
 NEON 64x1x4_x2 COPY                                  :   8171.5 MB/s
 NEON 64x2 COPY prefetch x2                           :  11822.9 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12123.8 MB/s
 NEON 64x2 COPY prefetch x1                           :  11836.5 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12122.3 MB/s
 ---
 standard memcpy                                      :   9894.0 MB/s
 standard memset                                      :  44745.2 MB/s
 ---
 NEON LDP/STP copy                                    :   9958.0 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  11415.6 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  11420.5 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  11475.2 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  11452.9 MB/s
 NEON LD1/ST1 copy                                    :  10094.8 MB/s
 NEON STP fill                                        :  44744.7 MB/s
 NEON STNP fill                                       :  44745.2 MB/s
 ARM LDP/STP copy                                     :  10136.4 MB/s
 ARM STP fill                                         :  44731.7 MB/s
 ARM STNP fill                                        :  44730.0 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    2.3 ns          /     2.9 ns 
    524288 :    3.2 ns          /     3.9 ns 
   1048576 :    3.6 ns          /     4.2 ns 
   2097152 :   22.9 ns          /    33.0 ns 
   4194304 :   32.6 ns          /    40.9 ns 
   8388608 :   38.1 ns          /    43.5 ns 
  16777216 :   43.2 ns          /    48.6 ns 
  33554432 :   86.2 ns          /   112.2 ns 
  67108864 :  109.3 ns          /   135.2 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    1.9 ns          /     2.3 ns 
    524288 :    2.2 ns          /     2.5 ns 
   1048576 :    2.6 ns          /     2.8 ns 
   2097152 :   21.6 ns          /    31.6 ns 
   4194304 :   31.1 ns          /    39.4 ns 
   8388608 :   35.8 ns          /    41.7 ns 
  16777216 :   38.5 ns          /    43.0 ns 
  33554432 :   79.9 ns          /   104.9 ns 
  67108864 :  101.1 ns          /   125.4 ns 

Run with:

git clone https://github.com/rojaster/tinymembench.git && cd tinymembench && make
./tinymembench

@rbapat-ampere
Copy link
Collaborator

@geerlingguy I ran tinymembench on my machine and here the comparative results.
I am attaching the memory bandwith test results in this comment and latency results in the next comment.

Tests JG-run RB-run
C copy backwards 9424 12800
C copy backwards (32 byte blocks) 9387.8 12822.2
C copy backwards (64 byte blocks) 9390.8 12831.5
C copy 9366.1 12852.6
C copy prefetched (32 bytes step) 9984.4 13667.5
C copy prefetched (64 bytes step) 9984.1 13659.3
C 2-pass copy 6391.4 8234.1
C 2-pass copy prefetched (32 bytes step) 7237.8 10070.3
C 2-pass copy prefetched (64 bytes step) 7489.6 10563.4
NEON 64x2 COPY 9961.9 13638.6
NEON 64x2x4 COPY 10091.6 13725.5
NEON 64x1x4_x2 COPY 8171.5 10066.8
NEON 64x2 COPY prefetch x2 11822.9 15860.6
NEON 64x2x4 COPY prefetch x1 12123.8 16100.7
NEON 64x2x4 COPY prefetch x1 12122.3 16105
NEON 64x2 COPY prefetch x1 11836.5 15872.7
standard memcpy 9894 13527.2
NEON LDP/STP copy 9958 13628.1
NEON LDP/STP copy pldl2strm (32 bytes step) 11415.6 15147.7
NEON LDP/STP copy pldl2strm (64 bytes step) 11420.5 15257.2
NEON LDP/STP copy pldl1keep (32 bytes step) 11475.2 15448.9
NEON LDP/STP copy pldl1keep (64 bytes step) 11452.9 15423.7
NEON LD1/ST1 copy 10094.8 13753.5
ARM LDP/STP copy 10136.4 13765.1
C fill 43884.4 43888.9
C fill (shuffle within 16 byte blocks) 43885.4 43891.8
C fill (shuffle within 32 byte blocks) 43884.2 43888.6
C fill (shuffle within 64 byte blocks) 43877.5 43875.3
standard memset 44745.2 44758
NEON STP fill 44744.7 44755.1
NEON STNP fill 44745.2 44749.3
ARM STP fill 44731.7 44723.3
ARM STNP fill 44730 44705.9

Except for the last 9 tests, my machine seems to be outperforming by 20% . Also attaching a graphical representation of the same.

image
Note: THe last 9 tests have not been mapped in the graph since they are within acceptable ranges of each other

@rbapat-ampere
Copy link
Collaborator

rbapat-ampere commented Sep 8, 2023

@geerlingguy The next part of the test was the memory latency test. The results are as below
Run 1 with MADV_NOHUGEPAGE

Run 1        
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 2.3 2.9 2.4 3
524288 3.2 3.9 3.4 3.9
1048576 3.6 4.2 4.1 4.5
2097152 22.9 33 17.8 24.8
4194304 32.6 40.9 25 30.5
8388608 38.1 43.5 30.1 35
16777216 43.2 48.6 37 45.7
33554432 86.2 112.2 71.2 93.7
67108864 109.3 135.2 91.4 112.1

Run 2 with MADV_HUGEPAGE

Run 2        
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 1.9 2.3 1.9 2.4
524288 2.2 2.5 2.3 2.5
1048576 2.6 2.8 2.6 2.8
2097152 21.6 31.6 16.2 23.1
4194304 31.1 39.4 23.3 28.7
8388608 35.8 41.7 26.7 30.4
16777216 38.5 43 28.3 31.5
33554432 79.9 104.9 64.9 85.9
67108864 101.1 125.4 83.7 102.9

I mapped one of the runs into a graph as seen below
image

As seen with other latency benchmarks, we're good when comparing L1 and L2 cache. The differences start popping up as we move from L2 Cache to system memory. Once again my results are ~20% faster.

@geerlingguy
Copy link
Contributor Author

geerlingguy commented Sep 8, 2023

ram-samsung-transcend-detail

Wow, what a difference the memory seems to make!

I got 2 of the 6 new RAM sticks just now. Running HPL with N=50000, I see:

  • Old Transcend RAM (2x16 GB): 279.22 Gflops
  • New Samsung RAM (2x16 GB): 369.05 Gflops

Encouraging early result! The rest of the RAM is coming Monday...

And here are the new tinymembench results (NOTE: Just for the 2x16GB sticks, performance will differ filling all the memory channels...):

Click to expand tinymembench results
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  10261.8 MB/s
 C copy backwards (32 byte blocks)                    :  10233.9 MB/s
 C copy backwards (64 byte blocks)                    :  10238.1 MB/s
 C copy                                               :  10277.0 MB/s
 C copy prefetched (32 bytes step)                    :  10403.7 MB/s
 C copy prefetched (64 bytes step)                    :  10407.1 MB/s
 C 2-pass copy                                        :   7065.6 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   8825.9 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   9179.0 MB/s
 C fill                                               :  42770.6 MB/s (1.1%)
 C fill (shuffle within 16 byte blocks)               :  42675.3 MB/s
 C fill (shuffle within 32 byte blocks)               :  42755.8 MB/s (0.2%)
 C fill (shuffle within 64 byte blocks)               :  42587.5 MB/s
 NEON 64x2 COPY                                       :  10633.4 MB/s
 NEON 64x2x4 COPY                                     :  10679.9 MB/s
 NEON 64x1x4_x2 COPY                                  :   6380.2 MB/s (0.1%)
 NEON 64x2 COPY prefetch x2                           :  12576.1 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12767.1 MB/s
 NEON 64x2 COPY prefetch x1                           :  12462.2 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12763.3 MB/s
 ---
 standard memcpy                                      :  10582.3 MB/s
 standard memset                                      :  42988.5 MB/s (1.3%)
 ---
 NEON LDP/STP copy                                    :  10645.9 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  11909.5 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  11902.6 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  11816.3 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  11818.2 MB/s
 NEON LD1/ST1 copy                                    :  10690.8 MB/s
 NEON STP fill                                        :  43059.6 MB/s (1.2%)
 NEON STNP fill                                       :  43150.2 MB/s (0.3%)
 ARM LDP/STP copy                                     :  10711.8 MB/s
 ARM STP fill                                         :  43011.2 MB/s (1.1%)
 ARM STNP fill                                        :  43117.3 MB/s (0.2%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    2.4 ns          /     2.9 ns 
    524288 :    3.4 ns          /     3.9 ns 
   1048576 :    7.7 ns          /    11.3 ns 
   2097152 :   20.5 ns          /    29.5 ns 
   4194304 :   28.9 ns          /    36.7 ns 
   8388608 :   35.7 ns          /    41.7 ns 
  16777216 :   45.2 ns          /    55.4 ns 
  33554432 :   74.5 ns          /    95.5 ns 
  67108864 :   89.0 ns          /   107.1 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    1.9 ns          /     2.3 ns 
    524288 :    2.3 ns          /     2.5 ns 
   1048576 :    2.6 ns          /     2.8 ns 
   2097152 :   19.1 ns          /    27.8 ns 
   4194304 :   27.6 ns          /    35.0 ns 
   8388608 :   31.4 ns          /    37.3 ns 
  16777216 :   33.6 ns          /    38.6 ns 
  33554432 :   67.7 ns          /    87.8 ns 
  67108864 :   80.6 ns          /    97.5 ns 

memcpy goes from 9894.0 to 10582.3, a 7% difference (again, with 2 sticks vs 6), while HPL goes from 279 to 369, almost a 30% improvement! Latency is vastly improved over the Transcend RAM as well.

Can't wait for the other sticks to arrive. I will finally pass the 'teraflop on a CPU' barrier :)

@geerlingguy
Copy link
Contributor Author

geerlingguy commented Sep 8, 2023

I have a Twitter (X?) thread going on about the memory differences. Going to also try to see if I can look up timing data in Linux via decode-dimms (CPU-Z under Windows on Arm isn't showing timing data).

@geerlingguy
Copy link
Contributor Author

Hmm...

$ sudo apt install -y i2c-tools
$ sudo modprobe eeprom
$ decode-dimms
# decode-dimms version 4.3

Memory Serial Presence Detect Decoder
By Philip Edelbrock, Christian Zuckschwerdt, Burkart Lingner,
Jean Delvare, Trent Piepho and others


Number of SDRAM DIMMs detected and decoded: 0

@ii-BOY
Copy link

ii-BOY commented Sep 11, 2023

@geerlingguy The next part of the test was the memory latency test. The results are as below

Run 1        
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 2.3 2.9 2.4 3
524288 3.2 3.9 3.4 3.9
1048576 3.6 4.2 4.1 4.5
2097152 22.9 33 17.8 24.8
4194304 32.6 40.9 25 30.5
8388608 38.1 43.5 30.1 35
16777216 43.2 48.6 37 45.7
33554432 86.2 112.2 71.2 93.7
67108864 109.3 135.2 91.4 112.1
Run 2        
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 1.9 2.3 1.9 2.4
524288 2.2 2.5 2.3 2.5
1048576 2.6 2.8 2.6 2.8
2097152 21.6 31.6 16.2 23.1
4194304 31.1 39.4 23.3 28.7
8388608 35.8 41.7 26.7 30.4
16777216 38.5 43 28.3 31.5
33554432 79.9 104.9 64.9 85.9
67108864 101.1 125.4 83.7 102.9
I mapped one of the runs into a graph as seen below image

As seen with other latency benchmarks, we're good when comparing L1 and L2 cache. The differences start popping up as we move from L2 Cache to system memory. Once again my results are ~20% faster.

Hi Jeef,
I am not sure about the Run1 and Run2 meaning, I tried to do the tinymembench and I saw [MADV_NOHUGEPAGE] at 1st run the is [MADV_HUGEPAGE], so you did the tinymembench with same configuration for 2 times or you did 1 time but get 2 result(HUGEPAGE and NOHUGEPAGE)?
Thanks
BR
ii-BOY

@rbapat-ampere
Copy link
Collaborator

@ii-BOY . Hi this test was run just once.
Internally, tinymembench ran twice. Once with THP disabled (MADV_NOHUGEPAGE) and once with THP enabled (MADV_NOHUGEPAGE).
You can find more information here : https://man7.org/linux/man-pages/man2/madvise.2.html

Thanks.
Note : Thanks for pointing out the omission in descriptions for Run1 and Run2 tables that were posted. I've edited those to reflect MADV_NOHUGEPAGE and MADV_HUGEPAGE in their tables respectively.

@geerlingguy
Copy link
Contributor Author

tinymembench run with all six sticks (96 GB total) of Sam sung RAM:

Click to view tinymembench results
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  11416.8 MB/s
 C copy backwards (32 byte blocks)                    :  11374.5 MB/s
 C copy backwards (64 byte blocks)                    :  11380.7 MB/s
 C copy                                               :  11486.6 MB/s
 C copy prefetched (32 bytes step)                    :  12074.4 MB/s
 C copy prefetched (64 bytes step)                    :  12072.5 MB/s
 C 2-pass copy                                        :   7456.1 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   8489.5 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   8901.7 MB/s
 C fill                                               :  43888.0 MB/s
 C fill (shuffle within 16 byte blocks)               :  43888.0 MB/s
 C fill (shuffle within 32 byte blocks)               :  43888.3 MB/s
 C fill (shuffle within 64 byte blocks)               :  43882.9 MB/s
 NEON 64x2 COPY                                       :  12176.6 MB/s
 NEON 64x2x4 COPY                                     :  12229.0 MB/s
 NEON 64x1x4_x2 COPY                                  :  10022.1 MB/s
 NEON 64x2 COPY prefetch x2                           :  13542.4 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  13902.0 MB/s
 NEON 64x2 COPY prefetch x1                           :  13579.6 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  13903.2 MB/s
 ---
 standard memcpy                                      :  12107.0 MB/s
 standard memset                                      :  44746.4 MB/s
 ---
 NEON LDP/STP copy                                    :  12186.3 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  13778.2 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  13785.9 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  13847.4 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  13825.8 MB/s
 NEON LD1/ST1 copy                                    :  12242.3 MB/s
 NEON STP fill                                        :  44745.9 MB/s
 NEON STNP fill                                       :  44747.5 MB/s
 ARM LDP/STP copy                                     :  12298.1 MB/s
 ARM STP fill                                         :  44730.0 MB/s
 ARM STNP fill                                        :  44730.8 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    2.4 ns          /     2.9 ns 
    524288 :    3.4 ns          /     3.9 ns 
   1048576 :    4.1 ns          /     4.4 ns 
   2097152 :   23.2 ns          /    33.2 ns 
   4194304 :   32.7 ns          /    41.1 ns 
   8388608 :   39.7 ns          /    46.2 ns 
  16777216 :   47.7 ns          /    51.0 ns 
  33554432 :   81.6 ns          /   103.5 ns 
  67108864 :  102.1 ns          /   122.2 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    1.9 ns          /     2.3 ns 
    524288 :    2.3 ns          /     2.5 ns 
   1048576 :    2.6 ns          /     2.8 ns 
   2097152 :   21.6 ns          /    31.6 ns 
   4194304 :   31.4 ns          /    39.4 ns 
   8388608 :   36.2 ns          /    41.7 ns 
  16777216 :   38.5 ns          /    43.0 ns 
  33554432 :   74.8 ns          /    95.7 ns 
  67108864 :   93.6 ns          /   112.0 ns 

@geerlingguy
Copy link
Contributor Author

New result:

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             649.46             1.1883e+03
HPL_pdgesv() start time Mon Sep 11 20:21:22 2023

HPL_pdgesv() end time   Mon Sep 11 20:32:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

1188.3 Gflops at 296W = 4.01 Gflops/W

@geerlingguy
Copy link
Contributor Author

It seems like my Samsung RAM still performs just under whatever RAM @rbapat-ampere is using in his system, so that seems to explain the delta!

I think this issue can be closed, as we've found the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants