# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Julia:-Numerical-Linear-Algebra" data-toc-modified-id="Julia:-Numerical-Linear-Algebra-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Julia: Numerical Linear Algebra</a></div><div class="lev2 toc-item"><a href="#Numerical-linear-algebra:-introduction" data-toc-modified-id="Numerical-linear-algebra:-introduction-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Numerical linear algebra: introduction</a></div><div class="lev2 toc-item"><a href="#BLAS" data-toc-modified-id="BLAS-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>BLAS</a></div><div class="lev2 toc-item"><a href="#Memory-hierarchy-and-level-3-fraction" data-toc-modified-id="Memory-hierarchy-and-level-3-fraction-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Memory hierarchy and level-3 fraction</a></div><div class="lev2 toc-item"><a href="#Effect-of-data-layout" data-toc-modified-id="Effect-of-data-layout-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Effect of data layout</a></div><div class="lev2 toc-item"><a href="#Avoid-memory-allocation:-some-examples" data-toc-modified-id="Avoid-memory-allocation:-some-examples-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Avoid memory allocation: some examples</a></div><div class="lev3 toc-item"><a href="#Transposing-matrix-is-expensive" data-toc-modified-id="Transposing-matrix-is-expensive-151"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>Transposing matrix is expensive</a></div><div class="lev2 toc-item"><a href="#Sparse-linear-algebra" data-toc-modified-id="Sparse-linear-algebra-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Sparse linear algebra</a></div><div class="lev2 toc-item"><a href="#Iterative-methods-for-linear-algebra" data-toc-modified-id="Iterative-methods-for-linear-algebra-17"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Iterative methods for linear algebra</a></div>

# Julia: Numerical Linear Algebra

Numerical linear algebra occupies much of statistical computing. This notebook gives a quick overview of linear algebra in Julia.

Machine information:

In [12]:
versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)


## Numerical linear algebra: introduction

* Topics in numerical linear algebra: 
    - BLAS: vector operations, matrix-vector multiplications, matrix-matrix multiplications  
    - solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
    - regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$  
    - eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$  
    - generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$  
    - singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$  
    - iterative methods for numerical linear algebra    

* Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the **building blocks** of most statistical computing tasks (optimization, MCMC).

* All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra. 
    - Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation: [BLAS](https://docs.julialang.org/en/stable/stdlib/linalg/#BLAS-Functions-1), [LAPACK](https://docs.julialang.org/en/stable/stdlib/linalg/#LAPACK-Functions-1).

In [13]:
using BenchmarkTools

srand(123) # seed
n = 1000
A = randn(n, n)
B = randn(n, n)
@benchmark $A * $B

BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     14.346 ms (0.00% GC)
  median time:      15.501 ms (0.00% GC)
  mean time:        16.296 ms (6.26% GC)
  maximum time:     21.766 ms (14.27% GC)
  --------------
  samples:          307
  evals/sample:     1

In [14]:
using RCall

R"""
library(microbenchmark)
microbenchmark($A %*% $B)
"""

RCall.RObject{RCall.VecSxp}
Unit: milliseconds
                expr      min       lq     mean   median       uq      max
 `#JL`$A %*% `#JL`$B 602.5493 627.3499 646.7977 642.7177 658.0193 728.1404
 neval
   100


Base R is using a very outdated BLAS library. For this matrix multiplication example, we see a ~30-40 fold slowdown from Julia's OpenBLAS library.

## BLAS

* BLAS stands for _basic linear algebra subprograms_. 

* See [netlib](http://www.netlib.org/blas/) for a complete list of standardized BLAS functions.

* There are many implementations of BLAS. 
    - [Netlib](http://www.netlib.org/blas/) provides a reference implementation  
    - Matlab uses Intel's [MKL](https://software.intel.com/en-us/node/520724) (mathematical kernel libaries)  
    - Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS)  
    - JuliaPro offers the option of using MKL

* There are 3 levels of BLAS functions.
    - [Level 1](http://www.netlib.org/blas/#_level_1): vector-vector operation
    - [Level 2](http://www.netlib.org/blas/#_level_2): matrix-vector operation
    - [Level 3](http://www.netlib.org/blas/#_level_3): matrix-matrix operation

| Level | Example Operation                      | Name        | Dimension                                 | Flops |
|-------|----------------------------------------|-------------|-------------------------------------------|-------|
| 1     | $\alpha \gets \mathbf{x}^T \mathbf{y}$ | dot product | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $2n$  |
|       | $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ |  axpy           |  $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ |  $2n$    |
| 2     | $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ |  gaxpy           |  $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$                                     |  $2mn$     |
|       | $\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ | rank one update            |    $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$                                       | $2mn$      |
| 3     | $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$                                       |  matrix multiplication           |  $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$                                         | $2mnp$      |

* Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z). 

* Some operations _appear_ as level-3 but indeed are level-2.  
    - A common operation in statistics is column scaling or row scaling
    $$
    \begin{eqnarray*}
        \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\
        \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)},
    \end{eqnarray*}
    $$
    where $\mathbf{D}$ is diagonal.  
    - These are essentially level-2 operations!

In [15]:
using BenchmarkTools

srand(123) # seed
n = 2000
A = rand(n, n)
d = rand(n)  # d vector

2000-element Array{Float64,1}:
 0.763192
 0.668759
 0.709337
 0.088416
 0.289151
 0.375468
 0.437356
 0.179474
 0.122238
 0.895783
 0.332146
 0.206425
 0.747789
 ⋮       
 0.484487
 0.567484
 0.75455 
 0.703483
 0.166205
 0.754612
 0.231834
 0.769243
 0.805681
 0.553389
 0.450904
 0.814614

In [16]:
D = diagm(d) # diagonal matrix with d as diagonal

2000×2000 Array{Float64,2}:
 0.763192  0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.668759  0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.709337  0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.088416     0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 ⋮                     

In [17]:
# this is calling BLAS routine for matrix multiplication: O(n^3) flops
@benchmark A * D

BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     107.945 ms (0.60% GC)
  median time:      118.194 ms (3.17% GC)
  mean time:        120.971 ms (4.45% GC)
  maximum time:     173.536 ms (39.71% GC)
  --------------
  samples:          42
  evals/sample:     1

In [18]:
Diagonal(d)

2000×2000 Diagonal{Float64}:
 0.763192   ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅        0.668759   ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅        0.709337   ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅        0.088416      ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
 ⋮                    

In [19]:
# current way for columnwise scaling: O(n^2) flops
@benchmark A * Diagonal(d)

BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  3
  --------------
  minimum time:     8.891 ms (5.99% GC)
  median time:      12.059 ms (32.07% GC)
  mean time:        12.341 ms (33.58% GC)
  maximum time:     94.327 ms (90.80% GC)
  --------------
  samples:          405
  evals/sample:     1

In [20]:
@which A * Diagonal(d)

In [21]:
# in-place: avoid allocate space for result
@benchmark scale!(A, d)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.866 ms (0.00% GC)
  median time:      10.123 ms (0.00% GC)
  mean time:        10.787 ms (0.00% GC)
  maximum time:     18.797 ms (0.00% GC)
  --------------
  samples:          463
  evals/sample:     1

## Memory hierarchy and level-3 fraction

> **Key to high performance is effective use of memory hierarchy. True on all architectures.**

* Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

<img src="./macpro_inside.png" width="400" align="center">

<img src="./cpu_die.png" width="400" align="center">  

<img src="http://images.bit-tech.net/content_images/2007/11/the_secrets_of_pc_memory_part_1/hei.png" width="400" align="center">

* Numbers everyone should know

| Operation                           | Time           |
|-------------------------------------|----------------|
| L1 cache reference                  | 0.5 ns         |
| L2 cache reference                  | 7 ns           |
| Main memory reference               | 100 ns         |
| Read 1 MB sequentially from memory  | 250,000 ns     |
| Read 1 MB sequentially from SSD     | 1,000,000 ns   |  
| Read 1 MB sequentially from disk    | 20,000,000 ns  |


<!-- | Operation                           | Time           | -->
<!-- |-------------------------------------|----------------| -->
<!-- | L1 cache reference                  | 0.5 ns         | -->
<!-- | Branch mispredict                   | 5 ns           | -->
<!-- | L2 cache reference                  | 7 ns           | -->
<!-- | Mutex lock/unlock                   | 100 ns         | -->
<!-- | Main memory reference               | 100 ns         | -->
<!-- | Compress 1K bytes with Zippy        | 10,000 ns      | -->
<!-- | Send 2K bytes over 1 Gbps network   | 20,000 ns      | -->
<!-- | Read 1 MB sequentially from memory  | 250,000 ns     | -->
<!-- | Round trip within same datacenter   | 500,000 ns     | -->
<!-- | Disk seek                           | 10,000,000 ns  | -->
<!-- | Read 1 MB sequentially from network | 10,000,000 ns  | -->
<!-- | Read 1 MB sequentially from disk    | 30,000,000 ns  | -->
<!-- | Send packet CA->Netherlands->CA     | 150,000,000 ns | -->

   Source: <https://gist.github.com/jboner/2841832>  

* For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.  

* Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?  
Answer: use **high-level BLAS** as much as possible.

| BLAS                                                           | Dimension                                                                           | Mem. Refs. | Flops  | Ratio |
|----------------------------------------------------------------|-------------------------------------------------------------------------------------|------------|--------|-------|
| Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$     | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$                                           | $3n$       | $2n$   | 3:2   |
| Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ | $n^2$      | $2n^2$ | 1:2   |
| Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ | $\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$                    | $4n^2$     | $2n^3$ | 2:n |  

* Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. **Surface-to-volume** effect.  
See [Dongarra slides](https://www.samsi.info/wp-content/uploads/2017/02/SAMSI-0217_Dongarra.pdf).

<img src="./blas_throughput.png" width="500" align="center"/>

* A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called **level-3 fraction**.

## Effect of data layout

* Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.

* Storage mode: **column-major** (Fortran, Matlab, R, Julia) vs **row-major** (C/C++).

* **Cache line** is the minimum amount of cache which can be loaded and stored to memory.
    - x86 CPUs: 64 bytes  
    - ARM CPUS: 32 bytes

<img src="https://patterns.eecs.berkeley.edu/wordpress/wp-content/uploads/2013/04/dense02.png" width="500" align="center"/>

* Accessing column-major stored matrix by rows causes lots of **cache misses**.

* Take matrix multiplication as an example 
$$ 
\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}.
$$
Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops. 
    - `jki` or `kji` looping:
        ```julia
        # inner most loop
        for i = 1:m
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
        ```  
    - `ikj` or `kij` looping:
        ```julia
        # inner most loop        
        for j = 1:n
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
        ```  
    - `ijk` or `jik` looping:
        ```julia
        # inner most loop        
        for k = 1:p
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
        ```
* We pay attention to the innermost loop, where the vector calculation occurs. The associated **stride** when accessing the three matrices in memory (assuming column-major storage) is  

| Variant        | A Stride | B Stride | C Stride |
|----------------|----------|----------|----------|
| $jki$ or $kji$ | Unit     | 0        | Unit     |
| $ikj$ or $kij$ | 0        | Non-Unit | Non-Unit |
| $ijk$ or $jik$ | Non-Unit | Unit     | 0        |       
Apparently the variants $jki$ or $kji$ are preferred.

In [22]:
"""
    matmul_by_loop!(A, B, C, order)

Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    fill!(C, 0)
    
    if order == "jki"
        for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

srand(123) # seed
m, n, p = 2000, 100, 2000
A = rand(m, n)
B = rand(n, p)
C = zeros(m, p);

* $jki$ and $kji$ looping:

In [23]:
using BenchmarkTools

@benchmark matmul_by_loop!($A, $B, $C, "jki")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     447.619 ms (0.00% GC)
  median time:      462.089 ms (0.00% GC)
  mean time:        462.884 ms (0.00% GC)
  maximum time:     474.856 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

In [24]:
@benchmark matmul_by_loop!($A, $B, $C, "kji")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     507.482 ms (0.00% GC)
  median time:      520.323 ms (0.00% GC)
  mean time:        522.122 ms (0.00% GC)
  maximum time:     542.848 ms (0.00% GC)
  --------------
  samples:          10
  evals/sample:     1

* $ikj$ and $kij$ looping:

In [25]:
@benchmark matmul_by_loop!($A, $B, $C, "ikj")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.549 s (0.00% GC)
  median time:      2.560 s (0.00% GC)
  mean time:        2.560 s (0.00% GC)
  maximum time:     2.571 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

In [26]:
@benchmark matmul_by_loop!($A, $B, $C, "kij")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.860 s (0.00% GC)
  median time:      2.903 s (0.00% GC)
  mean time:        2.903 s (0.00% GC)
  maximum time:     2.945 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

* $ijk$ and $jik$ looping:

In [27]:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     970.307 ms (0.00% GC)
  median time:      982.267 ms (0.00% GC)
  mean time:        994.529 ms (0.00% GC)
  maximum time:     1.047 s (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

In [28]:
@benchmark matmul_by_loop!(A, B, C, "ijk")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     954.502 ms (0.00% GC)
  median time:      966.731 ms (0.00% GC)
  mean time:        966.520 ms (0.00% GC)
  maximum time:     976.980 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

* Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).

In [30]:
@which A_mul_B!(C, A, B)

In [31]:
@benchmark A_mul_B!($C, $A, $B)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.216 ms (0.00% GC)
  median time:      7.340 ms (0.00% GC)
  mean time:        7.486 ms (0.00% GC)
  maximum time:     10.680 ms (0.00% GC)
  --------------
  samples:          667
  evals/sample:     1

In [32]:
@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, $A, $B, 1.0, C)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.849 ms (0.00% GC)
  median time:      5.351 ms (0.00% GC)
  mean time:        5.417 ms (0.00% GC)
  maximum time:     9.311 ms (0.00% GC)
  --------------
  samples:          921
  evals/sample:     1

To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the [Quora question](https://www.quora.com/What-algorithm-does-BLAS-use-for-matrix-multiplication-Of-all-the-considerations-e-g-cache-popular-instruction-sets-Big-O-etc-which-one-turned-out-to-be-the-primary-bottleneck), especially the [video](https://youtu.be/JzNpKDW07rw). Bottomline is 

> **Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.**

## Avoid memory allocation: some examples

### Transposing matrix is expensive

* In R, the command 
    ```R
    t(A) %*% x
    ```
will first transpose `A` then perform matrix multiplication, causing unnecessary memory allocation
- Julia is smart to avoid transposing matrix if possible.

In [33]:
srand(123)

n = 1000
A = rand(n, n)
x = rand(n)

# dispatch to At_mul_B (and then to BLAS)
# does *not* actually transpose the matrix
@benchmark $A' * $x

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     87.899 μs (0.00% GC)
  median time:      119.331 μs (0.00% GC)
  mean time:        124.014 μs (0.00% GC)
  maximum time:     378.717 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [35]:
@which A' * x

In [36]:
# dispatch to BLAS
@benchmark At_mul_B($A, $x)

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     88.266 μs (0.00% GC)
  median time:      119.130 μs (0.00% GC)
  mean time:        123.509 μs (0.00% GC)
  maximum time:     375.729 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [37]:
# let's force transpose
@benchmark transpose($A) * $x

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  3
  --------------
  minimum time:     3.163 ms (0.00% GC)
  median time:      3.388 ms (0.00% GC)
  mean time:        3.884 ms (21.92% GC)
  maximum time:     7.483 ms (41.43% GC)
  --------------
  samples:          1284
  evals/sample:     1

In [38]:
# pre-allocate result
out = zeros(size(A, 2))
@benchmark At_mul_B!($out, $A, $x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     84.990 μs (0.00% GC)
  median time:      118.686 μs (0.00% GC)
  mean time:        122.021 μs (0.00% GC)
  maximum time:     431.885 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [39]:
using RCall

R"""
library(microbenchmark)
microbenchmark(t($A) %*% $x)
"""

RCall.RObject{RCall.VecSxp}
Unit: milliseconds
                   expr      min       lq     mean  median      uq      max
 t(`#JL`$A) %*% `#JL`$x 5.871699 5.986926 6.970327 6.12627 6.43358 11.90592
 neval
   100


* [Broadcasting](https://docs.julialang.org/en/stable/manual/functions/#man-vectorized-1) in Julia achieves vectorized code without creating intermediate arrays.

In [40]:
srand(123)
X, Y = rand(1000,1000), rand(1000,1000)

# two temporary arrays are created
@benchmark max(abs(X), abs(Y))

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mabs[22m[22m[1m([22m[22m::Array{Float64,2}[1m)[22m[22m at [1m./deprecated.jl:57[22m[22m
 [3] [1m##core#800[22m[22m[1m([22m[22m[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316[22m[22m
 [4] [1m##sample#801[22m[22m[1m([22m[22m::BenchmarkTools.Parameters[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322[22m[22m
 [5] [1m#_run#24[22m[22m[1m([22m[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#799")}, ::BenchmarkTools.Parameters[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350[22m[22m
 [6] [1m(::BenchmarkTools.#kw##_run)[22m[22m[1m([22m[22m::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#799")}, ::BenchmarkTools.Parameters[1m)[22m

BenchmarkTools.Trial: 
  memory estimate:  22.92 MiB
  allocs estimate:  240
  --------------
  minimum time:     8.133 ms (21.82% GC)
  median time:      8.801 ms (23.11% GC)
  mean time:        9.076 ms (23.46% GC)
  maximum time:     15.686 ms (12.32% GC)
  --------------
  samples:          551
  evals/sample:     1

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mabs[22m[22m[1m([22m[22m::Array{Float64,2}[1m)[22m[22m at [1m./deprecated.jl:57[22m[22m
 [3] [1m##core#800[22m[22m[1m([22m[22m[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316[22m[22m
 [4] [1m##sample#801[22m[22m[1m([22m[22m::BenchmarkTools.Parameters[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322[22m[22m
 [5] [1m#_run#24[22m[22m[1m([22m[22m::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#799")}, ::BenchmarkTools.Parameters[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350[22m[22m
 [6] [1m_run[22m[22m[1m([22m[22m::BenchmarkTools.Benchmark{Symbol("##benchmark#799")}, ::BenchmarkTools.Parameters[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/BenchmarkTools/src/execut

In [41]:
# no temporary arrays created
@benchmark max.(abs.(X), abs.(Y))

BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  27
  --------------
  minimum time:     2.721 ms (0.00% GC)
  median time:      2.961 ms (0.00% GC)
  mean time:        3.770 ms (20.88% GC)
  maximum time:     7.652 ms (33.75% GC)
  --------------
  samples:          1325
  evals/sample:     1

In [42]:
# no memory allocation at all!
Z = zeros(X)
@benchmark $Z .= max.(abs.($X), abs.($Y))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.551 ms (0.00% GC)
  median time:      1.634 ms (0.00% GC)
  mean time:        1.719 ms (0.00% GC)
  maximum time:     5.972 ms (0.00% GC)
  --------------
  samples:          2896
  evals/sample:     1

* [View](https://docs.julialang.org/en/stable/stdlib/arrays/#Base.view) avoids creating extra copy of matrix data.

In [43]:
srand(123) # seed
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum($A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 
  memory estimate:  488.45 KiB
  allocs estimate:  4
  --------------
  minimum time:     72.105 μs (0.00% GC)
  median time:      328.491 μs (0.00% GC)
  mean time:        307.315 μs (15.52% GC)
  maximum time:     3.638 ms (87.48% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [44]:
# view avoids creating a separate sub-matrix
@benchmark sum(@view $A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  4
  --------------
  minimum time:     90.750 μs (0.00% GC)
  median time:      91.284 μs (0.00% GC)
  mean time:        96.048 μs (0.00% GC)
  maximum time:     246.577 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

## Sparse linear algebra

Julia has native support for sparse linear algebra.

Generate an n-by-n random sparse matrix

In [45]:
srand(123) # seed
n, p = 1000, 500
A = sprandn(n, p, 0.001) # about 0.1% non-zeros

1000×500 SparseMatrixCSC{Float64,Int64} with 489 stored entries:
  [778 ,    4]  =  -0.673863
  [921 ,    5]  =  1.03671
  [92  ,    7]  =  1.15882
  [397 ,    8]  =  -0.00376628
  [798 ,    9]  =  -0.611241
  [408 ,   10]  =  1.00541
  [532 ,   10]  =  0.848104
  [290 ,   11]  =  -0.0717345
  [331 ,   11]  =  -0.911516
  [473 ,   11]  =  1.39056
  ⋮
  [784 ,  492]  =  -0.438378
  [804 ,  495]  =  0.542732
  [567 ,  496]  =  -1.52276
  [638 ,  496]  =  0.0264262
  [58  ,  497]  =  0.962116
  [381 ,  497]  =  -0.15295
  [884 ,  497]  =  0.680199
  [743 ,  498]  =  0.728707
  [382 ,  499]  =  -2.16785
  [413 ,  499]  =  -0.248223
  [816 ,  500]  =  0.508117

For comparison we create the corresponding dense matrix

In [46]:
Afull = full(A)

1000×500 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0

How much memory does sparse A take?

In [47]:
Base.summarysize(A) 

11872

How much memory does dense A take?

In [48]:
Base.summarysize(Afull)

4000000

Matrix-vector multiplication:

In [49]:
b = randn(p)
@benchmark $A * $b # sparse linear algebra

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     1.806 μs (0.00% GC)
  median time:      2.315 μs (0.00% GC)
  mean time:        3.344 μs (18.81% GC)
  maximum time:     293.115 μs (97.77% GC)
  --------------
  samples:          10000
  evals/sample:     10

In [50]:
@benchmark $Afull * $b # dense linear algebra, i.e., BLAS

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     33.890 μs (0.00% GC)
  median time:      47.684 μs (0.00% GC)
  mean time:        49.562 μs (0.00% GC)
  maximum time:     1.408 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Least squares problem:

In [51]:
y = randn(n)
@benchmark $A \ $y # sparse linear algebra

BenchmarkTools.Trial: 
  memory estimate:  202.41 KiB
  allocs estimate:  105
  --------------
  minimum time:     184.378 μs (0.00% GC)
  median time:      191.375 μs (0.00% GC)
  mean time:        217.796 μs (3.65% GC)
  maximum time:     3.864 ms (29.20% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [52]:
@benchmark $Afull \ $y # dense linear algebra, i.e., LAPACK

BenchmarkTools.Trial: 
  memory estimate:  6.13 MiB
  allocs estimate:  2455
  --------------
  minimum time:     19.976 ms (0.00% GC)
  median time:      21.649 ms (0.00% GC)
  mean time:        22.141 ms (2.30% GC)
  maximum time:     27.655 ms (8.12% GC)
  --------------
  samples:          226
  evals/sample:     1

## Iterative methods for linear algebra

Singular value decomposition (SVD)

In [53]:
@benchmark svds(A) # Lanczos iterative method for top singular values/vectors

BenchmarkTools.Trial: 
  memory estimate:  773.61 KiB
  allocs estimate:  2063
  --------------
  minimum time:     11.744 ms (0.00% GC)
  median time:      13.462 ms (0.00% GC)
  mean time:        13.573 ms (0.65% GC)
  maximum time:     17.275 ms (15.70% GC)
  --------------
  samples:          369
  evals/sample:     1

In [54]:
@benchmark svd(Afull) # dense linear algebra, i.e., LAPACK

BenchmarkTools.Trial: 
  memory estimate:  19.14 MiB
  allocs estimate:  20
  --------------
  minimum time:     71.320 ms (0.00% GC)
  median time:      76.612 ms (2.56% GC)
  mean time:        76.644 ms (2.58% GC)
  maximum time:     81.567 ms (4.26% GC)
  --------------
  samples:          66
  evals/sample:     1

[IterativeSolvers.jl](https://github.com/JuliaMath/IterativeSolvers.jl) package implements many common iterative methods for sparse or more general structured matrices: conjugate gradient (CG), LSQR/LSMR for least squares, ... Combined with [LinearMaps.jl](https://github.com/Jutho/LinearMaps.jl) package, it provides powerful numerial linear algebra engine for structured large arrays.