-
Notifications
You must be signed in to change notification settings - Fork 73
Open
Description
See #355 for the background. I am now applying the same methodology to the matrix multiplication benchmark:
I modified the kernel to not zero C, to simplify things a bit:
julia> N = 64
64
julia> A = rand(N,N); B = rand(N,N); C = rand(N,N);
julia> function A_mul_B2!(C, A, B)
@turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
for k ∈ indices((A,B), (2,1))
C[m,n] += A[m,k] * B[k,n]
end
end
end
A_mul_B2! (generic function with 1 method)
julia> @benchmark A_mul_B2!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 23.708 μs … 47.333 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 24.250 μs ┊ GC (median): 0.00%
Time (mean ± σ): 24.294 μs ± 717.909 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▁ ▂ ▆ █ ▆ ▄ ▃ ▁ ▂
▃▁▁█▁▁█▁▁█▁▁▇▁▁▅▁▁▁▄▁▁▄▁▁▃▁▁▁▁▁▁▁▁█▁▁█▁▁▁█▁▁█▁▁█▁▁█▁▁█▁▁▆▁▁▄ █
23.7 μs Histogram: log(frequency) by time 24.5 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> cpu_freq = 3.2e9 # Ghz
3.2e9
julia> 23.708e-6 * cpu_freq / N^3
0.289404296875In C++ I tested two kernels:
void gemm_mnk(double* C, double* A, double* B, long M, long K, long N){
for (long m = 0; m < M; m++){
for (long n = 0; n < N; n++){
for (long k = 0; k < K; k++){
C[m + n*M] += A[m + k*M] * B[k + n*K];
}
}
}
return;
}
void gemm_nkm(double* C, double* A, double* B, long M, long K, long N){
for (long n = 0; n < N; n++){
for (long k = 0; k < K; k++){
for (long m = 0; m < M; m++){
C[m + n*M] += A[m + k*M] * B[k + n*K];
}
}
}
return;
}For N=64, I am getting:
gemm_mnk: 1.78319
gemm_nkm: 0.21441
Aren't these two kernels equivalent? They seem to operate on the same matrix layout and return exactly the same answer, they just differ in loop ordering.
The inner loop from gemm_nkm as generated by Clang seems to be:
LBB7_5: ; Parent Loop BB7_3 Depth=1
; Parent Loop BB7_4 Depth=2
; => This Inner Loop Header: Depth=3
ldr d0, [x13, x14]
ldp q1, q2, [x4, #-256]
fmla.2d v10, v1, v0[0]
fmla.2d v9, v2, v0[0]
ldp q1, q2, [x4, #-224]
fmla.2d v8, v1, v0[0]
fmla.2d v31, v2, v0[0]
ldp q1, q2, [x4, #-192]
fmla.2d v30, v1, v0[0]
fmla.2d v29, v2, v0[0]
ldp q1, q2, [x4, #-160]
fmla.2d v28, v1, v0[0]
fmla.2d v27, v2, v0[0]
ldp q1, q2, [x4, #-128]
fmla.2d v26, v1, v0[0]
fmla.2d v25, v2, v0[0]
ldp q1, q2, [x4, #-96]
fmla.2d v24, v1, v0[0]
fmla.2d v23, v2, v0[0]
ldp q1, q2, [x4, #-64]
fmla.2d v22, v1, v0[0]
fmla.2d v21, v2, v0[0]
ldp q1, q2, [x4, #-32]
fmla.2d v20, v1, v0[0]
fmla.2d v19, v2, v0[0]
ldp q1, q2, [x4]
fmla.2d v18, v1, v0[0]
fmla.2d v17, v2, v0[0]
ldp q1, q2, [x4, #32]
fmla.2d v16, v1, v0[0]
fmla.2d v7, v2, v0[0]
ldp q1, q2, [x4, #64]
fmla.2d v6, v1, v0[0]
fmla.2d v5, v2, v0[0]
ldp q1, q2, [x4, #96]
fmla.2d v4, v1, v0[0]
ldp q1, q3, [sp, #96] ; 32-byte Folded Reload
fmla.2d v1, v2, v0[0]
str q1, [sp, #96] ; 16-byte Folded Spill
ldp q1, q2, [x4, #128]
fmla.2d v3, v1, v0[0]
ldr q1, [sp, #128] ; 16-byte Folded Reload
fmla.2d v1, v2, v0[0]
stp q3, q1, [sp, #112] ; 32-byte Folded Spill
ldp q1, q2, [x4, #160]
fmla.2d v12, v1, v0[0]
fmla.2d v11, v2, v0[0]
ldp q1, q2, [x4, #192]
fmla.2d v13, v1, v0[0]
fmla.2d v15, v2, v0[0]
ldp q1, q2, [x4, #224]
fmla.2d v14, v1, v0[0]
ldr q1, [sp, #144] ; 16-byte Folded Reload
fmla.2d v1, v2, v0[0]
str q1, [sp, #144] ; 16-byte Folded Spill
add x14, x14, #8 ; =8
add x4, x4, #512 ; =512
cmp x14, #512 ; =512
b.ne LBB7_5Metadata
Metadata
Assignees
Labels
No labels