## Numerical algebra: introduction

* The first big chunk of this course is numerical linear algebra.

* Topics in numerical algebra: 
    - BLAS  
    - solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
    - regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$  
    - eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$  
    - generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$  
    - singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$  
    - iterative methods  

* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. Our major **goal** is to  
    0. know the computational cost (flop count) of each task
    0. be familiar with the BLAS and LAPACK functions (what they do)  
    0. do _not_ re-invent wheels by implementing these subroutines by yourself  
    0. apply appropriate numerical algebra tools to various statistical problems 

* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See [documentation](https://docs.julialang.org/en/stable/stdlib/linalg/?highlight=blas#module-Base.LinAlg.BLAS).

## BLAS

* BLAS stands for _basic linear algebra subprograms_. 

* See [netlib](http://www.netlib.org/blas/) for a complete listing of standardized BLAS functions.

* There are three levels of BLAS functions.
    - Level 1: vector operation
    - Level 2: matrix-vector operation
    - Level 3: matrix-matrix operation

| Level | Example Operation                      | Name        | Dimension                                 | Flops |
|-------|----------------------------------------|-------------|-------------------------------------------|-------|
| 1     | $\alpha \gets \mathbf{x}^T \mathbf{y}$ | dot product | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $2n$  |
|       | $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ |  saxpy           |  $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ |  $2n$    |
| 2     | $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ |  gaxpy           |  $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$                                     |  $2mn$     |
|       | $\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ | rank one update            |    $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$                                       | $2mn$      |
| 3     | $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$                                       |  matrix multiplication           |  $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$                                         | $2mnp$      |

* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). 

* Some operations appear as level-3 but indeed are level-2.  
    - A common operation in statistics is column scaling or row scaling
    $$ A = AD $$
    $$ A = DA $$
    - These are essentially level-2 operation!

In [17]:
using BenchmarkTools

n = 2000

A = rand(n, n)
d = rand(n)  # d vector
D = diagm(d) # diagonal matrix with d as diagonal

# this is calling BLAS routine for matrix multiplication
@benchmark A * D

BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     101.338 ms (3.51% GC)
  median time:      102.322 ms (3.50% GC)
  mean time:        104.217 ms (4.26% GC)
  maximum time:     145.673 ms (30.60% GC)
  --------------
  samples:          48
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [19]:
# columnwise scaling
@benchmark scale(A, d)

BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  68
  --------------
  minimum time:     20.959 ms (16.29% GC)
  median time:      24.166 ms (15.51% GC)
  mean time:        24.399 ms (16.75% GC)
  maximum time:     77.463 ms (71.77% GC)
  --------------
  samples:          205
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [20]:
# another way for columnwise scaling
@benchmark A * Diagonal(d)

BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  3
  --------------
  minimum time:     19.813 ms (1.37% GC)
  median time:      23.585 ms (16.13% GC)
  mean time:        23.873 ms (17.02% GC)
  maximum time:     71.266 ms (71.53% GC)
  --------------
  samples:          210
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

## Memory hierarchy and BLAS level-3 fraction

> Key to high performance is effective use of memory hierarchy. True on all architectures.

* Flops is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

<img src="http://tssdr1.thessdreview1.netdna-cdn.com/wp-content/uploads/2013/10/Base-Open.png" width="500" align="center">

<img src="http://hothardware.com/ContentImages/NewsItem/34743/content/Intel-Skylake-Graphics-Gen-9-Block-Diag.jpg" width="500" align="center">

<img src="http://images.bit-tech.net/content_images/2007/11/the_secrets_of_pc_memory_part_1/hei.png" width="500" align="center">

* Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?  
Answer: use **high-level BLAS** as much as possible.

| BLAS                                                           | Dimension                                                                           | Mem. Refs. | Flops  | Ratio |
|----------------------------------------------------------------|-------------------------------------------------------------------------------------|------------|--------|-------|
| Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$     | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$                                           | $3n$       | $2n$   | 3:3   |
| Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ | $n^2$      | $2n^2$ | 1:2   |
| Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ | $\mathbf{A}, \mathbf{B}, \mathbf{C} \in \mathbb{R}^{n \times n}$                    | $4n^2$     | $2n^3$ | $2:n$ |  

* BLAS 1 tend to be **memory bandwidth-limited**. E.g., Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.

* Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy.

* A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called **level-3 fraction**.

## Effect of data layout

* Data layout in memory effects execution speed too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.

* Storage mode: **column-major** (Fortran, Matlab, R, Julia) vs **row-major** (C/C++).

* **Cache line** is the minimum amount of cache which can be loaded and stored to memory.
    - x86 CPUs: 64 bytes  
    - ARM CPUS: 32 bytes

<img src="https://patterns.eecs.berkeley.edu/wordpress/wp-content/uploads/2013/04/dense02.png" width="500" align="center"/>

* Accessing column-major stored matrix by rows causes lots of **cache misses**.

* When you call BLAS from C/C++, use the CBLAS library instead of the traditional BLAS library implemented in Fortran.

* Take matrix multiplication as an example. 
$$ 
\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}.
$$
Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops. We pay attention to the innermost loop, where the vector calculation occurs. 
    - `jki` or `kji` looping:
    ```julia
    for i = 1:m
        C[i, j] = C[i, j] + A[i, k] * B[k, j]
    end
    ```  
    - `ikj` or `kij` looping:
    ```julia
    for j = 1:n
        C[i, j] = C[i, j] + A[i, k] * B[k, j]
    end
    ```  
    - `ijk` or `jik` looping:
    ```julia
    for k = 1:p
        C[i, j] = C[i, j] + A[i, k] * B[k, j]
    end
    ```
The associated **stride** when accessing the three matrices in memory (assuming column-major storage) is  

| Variant        | A Stride | B Stride | C Stride |
|----------------|----------|----------|----------|
| $jki$ or $kji$ | Unit     | 0        | Unit     |
| $ikj$ or $kij$ | 0        | Non-Unit | Non-Unit |
| $ijk$ or $jik$ | Non-Unit | Unit     | 0        |

Apparently the variants $jki$ or $kji$ are preferred.

In [32]:
function matmul_by_loop!(A::Matrix{Float64}, B::Matrix{Float64}, 
    C::Matrix{Float64}, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    
    if order == "jki"
        for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

srand(123)
m, n, p = 2000, 100, 2000
A = rand(m, n)
B = rand(n, p)
C = zeros(m, p);



* $jki$ and $kji$ looping:

In [33]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "jki")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     429.365 ms (0.00% GC)
  median time:      464.860 ms (0.00% GC)
  mean time:        479.010 ms (0.00% GC)
  maximum time:     576.146 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [34]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "kji")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     492.760 ms (0.00% GC)
  median time:      573.256 ms (0.00% GC)
  mean time:        573.022 ms (0.00% GC)
  maximum time:     707.823 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

* $ikj$ and $kij$ looping:

In [35]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ikj")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.444 s (0.00% GC)
  median time:      2.508 s (0.00% GC)
  mean time:        2.508 s (0.00% GC)
  maximum time:     2.573 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [36]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "kij")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.615 s (0.00% GC)
  median time:      2.657 s (0.00% GC)
  mean time:        2.657 s (0.00% GC)
  maximum time:     2.699 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

* $ijk$ and $jik$ looping:

In [37]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ijk")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.090 s (0.00% GC)
  median time:      1.333 s (0.00% GC)
  mean time:        1.257 s (0.00% GC)
  maximum time:     1.397 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [38]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ijk")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.095 s (0.00% GC)
  median time:      1.138 s (0.00% GC)
  mean time:        1.158 s (0.00% GC)
  maximum time:     1.277 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

* Julia wrapper of BLAS function.

In [41]:
fill!(C, 0.0)
@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.844 ms (0.00% GC)
  median time:      6.862 ms (0.00% GC)
  mean time:        6.759 ms (0.00% GC)
  maximum time:     11.662 ms (0.00% GC)
  --------------
  samples:          735
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

## Avoid memory transactions as much as possible

* Transpose or not.  
    - Julia is smart to avoid transposing matrix if possible.
    - In R, the command `t(A) %*% x` will first transpose `A` then perform matrix multiplication, causing unnecessary memory allocation

In [45]:
srand(123)

n = 1000
A = rand(n, n)
x = rand(n)

@benchmark A' * x

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     88.529 μs (0.00% GC)
  median time:      130.786 μs (0.00% GC)
  mean time:        141.644 μs (0.00% GC)
  maximum time:     895.696 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [44]:
@benchmark At_mul_B(A, x)

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     98.216 μs (0.00% GC)
  median time:      126.025 μs (0.00% GC)
  mean time:        133.685 μs (0.00% GC)
  maximum time:     685.291 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

* [Broadcasting](https://docs.julialang.org/en/stable/manual/functions/#dot-syntax-for-vectorizing-functions) in Julia achieves vectorized code without creating intermediate arrays.

In [67]:
srand(123)
A = rand(1000, 1000)
B = similar(A) # allocate a matrix same size as A

@benchmark B .= log.(exp.(asin.(sin.(A))))

BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     68.583 ms (0.00% GC)
  median time:      69.665 ms (0.00% GC)
  mean time:        70.769 ms (0.00% GC)
  maximum time:     78.866 ms (0.00% GC)
  --------------
  samples:          71
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [68]:
@benchmark B = log(exp(asin(sin(A))))

BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  12
  --------------
  minimum time:     45.258 ms (2.65% GC)
  median time:      49.257 ms (4.68% GC)
  mean time:        49.358 ms (4.90% GC)
  maximum time:     59.911 ms (0.62% GC)
  --------------
  samples:          102
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

* [View](https://docs.julialang.org/en/stable/stdlib/arrays/#Base.view) avoids creating extra copy of matrix data.

In [72]:
# sum entries in a sub-matrix
@benchmark sum(A[1:500, 500:end])

BenchmarkTools.Trial: 
  memory estimate:  1.91 MiB
  allocs estimate:  6
  --------------
  minimum time:     301.451 μs (0.00% GC)
  median time:      420.798 μs (0.00% GC)
  mean time:        604.061 μs (21.44% GC)
  maximum time:     3.389 ms (60.29% GC)
  --------------
  samples:          8112
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [71]:
@benchmark sum(@view A[1:500, 500:end])

BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  8
  --------------
  minimum time:     266.879 μs (0.00% GC)
  median time:      267.322 μs (0.00% GC)
  mean time:        281.743 μs (0.00% GC)
  maximum time:     886.228 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%