# Biostat M280: Matrix Multiplication By Looping

##### Dr. Hua Zhou, Jan 18, 2016

This example shows the effect of the looping order on computational efficiency.

This is a `Julia` function that overwrites $C$ matrix by $C + AB$.

In [1]:
function matmul_by_loop!(A::Matrix{Float64}, B::Matrix{Float64}, 
    C::Matrix{Float64}, order::ASCIIString)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    
    if order == "jki"
        for j = 1:n
            for k = 1:p
                for i = 1:m
                    C[i, j] += A[i, k] * B[k, j]
                end
            end
        end
    end

    if order == "kji"
        for k = 1:p
            for j = 1:n
                for i = 1:m
                    C[i, j] += A[i, k] * B[k, j]
                end
            end
        end
    end
    
    if order == "ikj"
        for i = 1:m
            for k = 1:p
                for j = 1:n
                    C[i, j] += A[i, k] * B[k, j]
                end
            end
        end
    end

    if order == "kij"
        for k = 1:p
            for i = 1:m
                for j = 1:n
                    C[i, j] += A[i, k] * B[k, j]
                end
            end
        end
    end
    
    if order == "ijk"
        for i = 1:m
            for j = 1:n
                for k = 1:p
                    C[i, j] += A[i, k] * B[k, j]
                end
            end
        end
    end
    
    if order == "jik"
        for j = 1:n
            for i = 1:m
                for k = 1:p
                    C[i, j] += A[i, k] * B[k, j]
                end
            end
        end
    end
    
end

matmul_by_loop! (generic function with 1 method)

Generate data.

In [2]:
srand(280)
m = 2000; n = 100; p = 2000
A = rand(m, n)
B = rand(n, p)
C = zeros(m, p);

Now let's compute matrix multiplication by different looping order.

* $jki$ and $kji$ loop

In [3]:
fill!(C, 0.0)
@elapsed matmul_by_loop!(A, B, C, "jki")

0.869936677

In [4]:
fill!(C, 0.0)
@elapsed matmul_by_loop!(A, B, C, "kji")

0.843544412

* $ikj$ and $kij$ loop

In [5]:
fill!(C, 0.0)
@elapsed matmul_by_loop!(A, B, C, "ikj")

4.157876637

In [6]:
fill!(C, 0.0)
@elapsed matmul_by_loop!(A, B, C, "kij")

4.50656364

* $ijk$ and $jik$ loop

In [7]:
fill!(C, 0.0)
@elapsed matmul_by_loop!(A, B, C, "ijk")

1.165143593

In [8]:
fill!(C, 0.0)
@elapsed matmul_by_loop!(A, B, C, "jik")

1.081677383

How much time does BLAS take?

In [9]:
fill!(C, 0.0)
@elapsed C += A * B

0.377898466

Let's call BLAS directly.

In [10]:
fill!(C, 0.0)
@elapsed BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)

0.013247082

It's surprising calling BLAS directly is faster. What's going on? Let's put both in a loop and and profile using `@time` macro.

In [13]:
gc()
fill!(C, 0.0)
@time(for i = 1:100; C += A * B; end)
gc()
fill!(C, 0.0)
@time(for i = 1:100; BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C); end)

  2.876988 seconds (1000 allocations: 5.960 GB, 20.28% gc time)
  0.856120 seconds


Now it's clear that calling BLAS directly avoids creating intermediate, temporary matrices and therefore is more efficient.

Show system information.

In [14]:
versioninfo()

Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
