Skip to content

Matrix Multiplication benchmark analysis #356

@certik

Description

@certik

See #355 for the background. I am now applying the same methodology to the matrix multiplication benchmark:

I modified the kernel to not zero C, to simplify things a bit:

julia> N = 64
64

julia> A = rand(N,N); B = rand(N,N); C = rand(N,N);

julia> function A_mul_B2!(C, A, B)
           @turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
               for k ∈ indices((A,B), (2,1))
                   C[m,n] += A[m,k] * B[k,n]
               end
           end
       end
A_mul_B2! (generic function with 1 method)

julia> @benchmark A_mul_B2!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.708 μs …  47.333 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.294 μs ± 717.909 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▂  ▁                        ▂  ▆   █  ▆  ▄  ▃  ▁       ▂
  ▃▁▁█▁▁█▁▁█▁▁▇▁▁▅▁▁▁▄▁▁▄▁▁▃▁▁▁▁▁▁▁▁█▁▁█▁▁▁█▁▁█▁▁█▁▁█▁▁█▁▁▆▁▁▄ █
  23.7 μs       Histogram: log(frequency) by time      24.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.2e9 # Ghz
3.2e9

julia> 23.708e-6 * cpu_freq / N^3
0.289404296875

In C++ I tested two kernels:

void gemm_mnk(double* C, double* A, double* B, long M, long K, long N){
  for (long m = 0; m < M; m++){
    for (long n = 0; n < N; n++){
      for (long k = 0; k < K; k++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}

void gemm_nkm(double* C, double* A, double* B, long M, long K, long N){
   for (long n = 0; n < N; n++){
    for (long k = 0; k < K; k++){
      for (long m = 0; m < M; m++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}

For N=64, I am getting:

gemm_mnk: 1.78319
gemm_nkm: 0.21441

Aren't these two kernels equivalent? They seem to operate on the same matrix layout and return exactly the same answer, they just differ in loop ordering.

The inner loop from gemm_nkm as generated by Clang seems to be:

LBB7_5:                                 ;   Parent Loop BB7_3 Depth=1
                                        ;     Parent Loop BB7_4 Depth=2
                                        ; =>    This Inner Loop Header: Depth=3
	ldr	d0, [x13, x14]
	ldp	q1, q2, [x4, #-256]
	fmla.2d	v10, v1, v0[0]
	fmla.2d	v9, v2, v0[0]
	ldp	q1, q2, [x4, #-224]
	fmla.2d	v8, v1, v0[0]
	fmla.2d	v31, v2, v0[0]
	ldp	q1, q2, [x4, #-192]
	fmla.2d	v30, v1, v0[0]
	fmla.2d	v29, v2, v0[0]
	ldp	q1, q2, [x4, #-160]
	fmla.2d	v28, v1, v0[0]
	fmla.2d	v27, v2, v0[0]
	ldp	q1, q2, [x4, #-128]
	fmla.2d	v26, v1, v0[0]
	fmla.2d	v25, v2, v0[0]
	ldp	q1, q2, [x4, #-96]
	fmla.2d	v24, v1, v0[0]
	fmla.2d	v23, v2, v0[0]
	ldp	q1, q2, [x4, #-64]
	fmla.2d	v22, v1, v0[0]
	fmla.2d	v21, v2, v0[0]
	ldp	q1, q2, [x4, #-32]
	fmla.2d	v20, v1, v0[0]
	fmla.2d	v19, v2, v0[0]
	ldp	q1, q2, [x4]
	fmla.2d	v18, v1, v0[0]
	fmla.2d	v17, v2, v0[0]
	ldp	q1, q2, [x4, #32]
	fmla.2d	v16, v1, v0[0]
	fmla.2d	v7, v2, v0[0]
	ldp	q1, q2, [x4, #64]
	fmla.2d	v6, v1, v0[0]
	fmla.2d	v5, v2, v0[0]
	ldp	q1, q2, [x4, #96]
	fmla.2d	v4, v1, v0[0]
	ldp	q1, q3, [sp, #96]               ; 32-byte Folded Reload
	fmla.2d	v1, v2, v0[0]
	str	q1, [sp, #96]                   ; 16-byte Folded Spill
	ldp	q1, q2, [x4, #128]
	fmla.2d	v3, v1, v0[0]
	ldr	q1, [sp, #128]                  ; 16-byte Folded Reload
	fmla.2d	v1, v2, v0[0]
	stp	q3, q1, [sp, #112]              ; 32-byte Folded Spill
	ldp	q1, q2, [x4, #160]
	fmla.2d	v12, v1, v0[0]
	fmla.2d	v11, v2, v0[0]
	ldp	q1, q2, [x4, #192]
	fmla.2d	v13, v1, v0[0]
	fmla.2d	v15, v2, v0[0]
	ldp	q1, q2, [x4, #224]
	fmla.2d	v14, v1, v0[0]
	ldr	q1, [sp, #144]                  ; 16-byte Folded Reload
	fmla.2d	v1, v2, v0[0]
	str	q1, [sp, #144]                  ; 16-byte Folded Spill
	add	x14, x14, #8                    ; =8
	add	x4, x4, #512                    ; =512
	cmp	x14, #512                       ; =512
	b.ne	LBB7_5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions