# A list of "easy" linear systems

Consider $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A} \in \mathbb{R}^{n \times n}$. Or, consider matrix inverse (if you want). $\mathbf{A}$ can be huge. Keep massive data in mind: 1000 Genome Project, NetFlix, Google PageRank, finance, spatial statistics, ... We should be alert to many easy linear systems. 

Don't blindly use `A \ b` and `inv` in Julia or `solve` function in R. **Don't waste computing resources by bad choices of algorithms!**

* Diagonal $\mathbf{A}$: $n$ flops. Use `Diagonal` type of Julia.

In [1]:
using BenchmarkTools

srand(280)
n = 1000
A = diagm(randn(n)) # a full matrix
b = randn(n)

@which A \ b

In [2]:
# check `istril(A)` and `istriu(A)`, then call `Diagonal(A) \ b`
@benchmark A \ b

BenchmarkTools.Trial: 
  memory estimate:  15.97 KiB
  allocs estimate:  5
  --------------
  minimum time:     859.328 μs (0.00% GC)
  median time:      1.069 ms (0.00% GC)
  mean time:        1.139 ms (0.11% GC)
  maximum time:     4.591 ms (60.96% GC)
  --------------
  samples:          4334
  evals/sample:     1

In [3]:
# O(n) computation
@benchmark Diagonal(A) \ b

BenchmarkTools.Trial: 
  memory estimate:  15.98 KiB
  allocs estimate:  6
  --------------
  minimum time:     11.563 μs (0.00% GC)
  median time:      13.614 μs (0.00% GC)
  mean time:        16.139 μs (7.89% GC)
  maximum time:     2.752 ms (98.39% GC)
  --------------
  samples:          10000
  evals/sample:     1

* Bidiagonal, tridiagonal, or banded $\mathbf{A}$: Band LU, band Cholesky, ... roughly $O(n)$ flops.   
    - Use [`Bidiagonal`](https://docs.julialang.org/en/stable/stdlib/linalg/#Base.Bidiagonal), [`Tridiagonal`](https://docs.julialang.org/en/stable/stdlib/linalg/#Base.Tridiagonal), [`SymTridiagonal`](https://docs.julialang.org/en/stable/stdlib/linalg/#Base.SymTridiagonal) types of Julia.

In [4]:
srand(280) 

n  = 1000
dv = randn(n)
ev = randn(n - 1)
b  = randn(n)
# symmetric tridiagonal matrix
A  = SymTridiagonal(dv, ev)

1000×1000 SymTridiagonal{Float64}:
 0.126238   0.618244    ⋅        …    ⋅          ⋅          ⋅      
 0.618244  -2.34688    1.10206        ⋅          ⋅          ⋅      
  ⋅         1.10206    1.91661        ⋅          ⋅          ⋅      
  ⋅          ⋅        -0.447244       ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅        …    ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
  ⋅          ⋅          ⋅             ⋅          ⋅          ⋅      
 ⋮                               ⋱                                 
  ⋅          

In [5]:
# convert to a full matrix
Afull = full(A)

# LU decomposition (2/3) n^3 flops!
@benchmark Afull \ b

BenchmarkTools.Trial: 
  memory estimate:  7.65 MiB
  allocs estimate:  8
  --------------
  minimum time:     12.675 ms (0.00% GC)
  median time:      14.803 ms (0.00% GC)
  mean time:        16.700 ms (5.18% GC)
  maximum time:     30.224 ms (0.00% GC)
  --------------
  samples:          299
  evals/sample:     1

In [6]:
# specialized algorithm for tridiagonal matrix
@benchmark A \ b

BenchmarkTools.Trial: 
  memory estimate:  23.97 KiB
  allocs estimate:  9
  --------------
  minimum time:     12.805 μs (0.00% GC)
  median time:      15.517 μs (0.00% GC)
  mean time:        18.816 μs (10.29% GC)
  maximum time:     2.552 ms (98.83% GC)
  --------------
  samples:          10000
  evals/sample:     1

* Triangular $\mathbf{A}$: $n^2$ flops.

In [7]:
srand(280)

n = 1000
A = tril(randn(n, n))
b = randn(n)

# check istril() then triangular solve
@benchmark A \ b

BenchmarkTools.Trial: 
  memory estimate:  7.95 KiB
  allocs estimate:  2
  --------------
  minimum time:     792.164 μs (0.00% GC)
  median time:      817.588 μs (0.00% GC)
  mean time:        864.531 μs (0.03% GC)
  maximum time:     3.915 ms (0.00% GC)
  --------------
  samples:          5718
  evals/sample:     1

In [8]:
# triangular solve directly
@benchmark LowerTriangular(A) \ b

BenchmarkTools.Trial: 
  memory estimate:  7.97 KiB
  allocs estimate:  3
  --------------
  minimum time:     335.609 μs (0.00% GC)
  median time:      415.831 μs (0.00% GC)
  mean time:        509.724 μs (0.08% GC)
  maximum time:     3.516 ms (0.00% GC)
  --------------
  samples:          9544
  evals/sample:     1

* Block diagonal: Suppose $n = \sum_i n_i$. $(\sum_i n_i)^3$ vs $\sum_i n_i^3$.  

Julia has a [`blkdiag`](https://docs.julialang.org/en/stable/stdlib/linalg/?highlight=blkdiag#Base.blkdiag) function that generates a **full** matrix. Anyone interested writing a `BlockDiagonal.jl` package?

In [9]:
srand(280)

B  = 10 # number of blocks
ni = 100
A  = blkdiag([sprandn(ni, ni, 0.01) for b in 1:B]...)

1000×1000 sparse matrix with 969 Float64 nonzero entries:
	[31  ,    1]  =  2.07834
	[53  ,    1]  =  -1.11883
	[58  ,    1]  =  -0.66448
	[14  ,    4]  =  1.11793
	[96  ,    5]  =  1.22813
	[81  ,    8]  =  -0.919643
	[48  ,    9]  =  1.0185
	[49  ,    9]  =  -0.996332
	[15  ,   10]  =  1.30841
	[28  ,   10]  =  -0.818757
	⋮
	[956 ,  987]  =  -0.900804
	[967 ,  987]  =  -0.438788
	[971 ,  991]  =  0.176756
	[929 ,  992]  =  -1.17384
	[974 ,  993]  =  1.59235
	[967 ,  994]  =  0.542169
	[994 ,  995]  =  0.627832
	[998 ,  997]  =  0.60382
	[935 ,  998]  =  0.342675
	[947 ,  998]  =  0.482228
	[975 , 1000]  =  0.991598

* Kronecker product. 
$$
\begin{eqnarray*}
    (\mathbf{A} \otimes \mathbf{B})^{-1} &=& \mathbf{A}^{-1} \otimes \mathbf{B}^{-1} \\
    (\mathbf{C}^T \otimes \mathbf{A}) \text{vec}(\mathbf{B}) &=& \text{vec}(\mathbf{A} \mathbf{B} \mathbf{C}).
\end{eqnarray*}    
$$

* Sparsity: sparse matrix decomposition, or iterative method.  
    - The easiest recognizable structure. Familiarize yourself with the sparse matrix computation tools in Julia, Matlab, R (`Matrix` package), MKL (sparse BLAS), ... as much as possible.

In [10]:
srand(280)

n = 1000
# a sparse pd matrix, about 0.5% non-zero entries
A = sprand(n, n, 0.002)
A = A + A' + n * I
b = randn(n)
Afull = full(A)
countnz(A) / length(A)

0.005096

In [11]:
# dense matrix-vector multiplication
@benchmark Afull * b

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     99.436 μs (0.00% GC)
  median time:      125.809 μs (0.00% GC)
  mean time:        178.717 μs (0.00% GC)
  maximum time:     1.545 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [12]:
# sparse matrix-vector multiplication
@benchmark A * b

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     13.882 μs (0.00% GC)
  median time:      14.539 μs (0.00% GC)
  mean time:        16.012 μs (0.00% GC)
  maximum time:     696.466 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [13]:
# dense Cholesky decomposition
@benchmark cholfact(Afull)

BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  8
  --------------
  minimum time:     9.152 ms (0.00% GC)
  median time:      9.763 ms (0.00% GC)
  mean time:        10.992 ms (9.69% GC)
  maximum time:     31.831 ms (22.33% GC)
  --------------
  samples:          453
  evals/sample:     1

In [14]:
# sparse Cholesky decomposition
@benchmark cholfact(A)

BenchmarkTools.Trial: 
  memory estimate:  1.33 MiB
  allocs estimate:  53
  --------------
  minimum time:     3.333 ms (0.00% GC)
  median time:      3.369 ms (0.00% GC)
  mean time:        3.463 ms (0.71% GC)
  maximum time:     5.354 ms (9.13% GC)
  --------------
  samples:          1439
  evals/sample:     1

In [15]:
# solve via dense Cholesky
xchol = cholfact(Afull) \ b
@benchmark cholfact(Afull) \ b

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  10
  --------------
  minimum time:     10.030 ms (0.00% GC)
  median time:      10.230 ms (0.00% GC)
  mean time:        10.654 ms (7.19% GC)
  maximum time:     14.153 ms (0.00% GC)
  --------------
  samples:          468
  evals/sample:     1

In [16]:
# solve via sparse Cholesky
xcholsp = cholfact(A) \ b
vecnorm(xchol - xcholsp)

8.777424222872659e-18

In [17]:
@benchmark cholfact(A) \ b

BenchmarkTools.Trial: 
  memory estimate:  1.36 MiB
  allocs estimate:  65
  --------------
  minimum time:     3.832 ms (0.00% GC)
  median time:      4.244 ms (0.00% GC)
  mean time:        4.248 ms (0.59% GC)
  maximum time:     5.630 ms (7.94% GC)
  --------------
  samples:          1171
  evals/sample:     1

In [18]:
# sparse solve via conjugate gradient
using IterativeSolvers

xcg, = cg(A, b)
vecnorm(xcg - xchol)

1.743452494241876e-16

In [19]:
@benchmark cg(A, b)

BenchmarkTools.Trial: 
  memory estimate:  262.66 KiB
  allocs estimate:  44
  --------------
  minimum time:     121.925 μs (0.00% GC)
  median time:      136.441 μs (0.00% GC)
  mean time:        175.594 μs (14.13% GC)
  maximum time:     2.878 ms (91.52% GC)
  --------------
  samples:          10000
  evals/sample:     1

* Easy plus low rank: $\mathbf{U} \in \mathbb{R}^{n \times r}$, $\mathbf{V} \in \mathbb{R}^{r \times n}$, $r \ll n$. Woodbury formula
$$
	(\mathbf{A} + \mathbf{U} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_r + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1},
$$

    * Keep HW2 Q2(2) in mind.  
    * [`WoodburyMatrices.jl`](https://github.com/timholy/WoodburyMatrices.jl) package can be useful.

In [20]:
using BenchmarkTools, WoodburyMatrices

srand(280)
n = 1000
r = 5

A = Diagonal(rand(n))
B = randn(n, r)
D = Diagonal(rand(r))
b = randn(n)
# W = A + B*D*B'
W = SymWoodbury(A, B, D)
Wfull = Symmetric(full(W))

1000×1000 Symmetric{Float64,Array{Float64,2}}:
  1.8571     0.513107    0.872146   …   0.764278    -0.241331    0.54921 
  0.513107   4.57505    -0.636972      -1.86465     -1.92237    -1.72569 
  0.872146  -0.636972    4.81387        1.99357      1.99337     3.66327 
 -0.516414  -0.996711   -0.0919924      0.262832     0.612402    0.621834
  0.193686   1.68244    -0.770028      -0.723437    -1.4868     -1.32247 
  1.6567     0.0634435  -0.901968   …  -0.241872    -0.0356772  -0.39826 
  0.553996  -0.274515    2.21265        0.219437     2.20382     2.60902 
  0.402356   1.89288    -1.13032       -0.771441    -1.96862    -1.93483 
 -1.07744   -1.63881     1.78016        0.96551      1.7292      1.91326 
 -2.21617   -2.90695    -2.55971       -0.47867      0.855389   -0.933916
  1.29975    0.779828    4.12459    …   1.87358      0.737112    2.84136 
 -0.80833    1.44882     1.67581       -0.139063    -0.107873    0.818132
 -2.32469   -4.83109    -2.31796       -0.0346402    2.65564     

In [21]:
# solve via Cholesky
@benchmark cholfact(Wfull) \ b

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  9
  --------------
  minimum time:     8.654 ms (0.00% GC)
  median time:      8.952 ms (0.00% GC)
  mean time:        9.344 ms (8.24% GC)
  maximum time:     12.436 ms (19.91% GC)
  --------------
  samples:          533
  evals/sample:     1

In [22]:
# use Woodbury formula
@benchmark W \ b

BenchmarkTools.Trial: 
  memory estimate:  76.19 KiB
  allocs estimate:  27
  --------------
  minimum time:     19.506 μs (0.00% GC)
  median time:      23.607 μs (0.00% GC)
  mean time:        31.046 μs (15.39% GC)
  maximum time:     2.800 ms (98.21% GC)
  --------------
  samples:          10000
  evals/sample:     1

* Easy plus border: For $\mathbf{A}$ pd and $\mathbf{V}$ full row rank,
$$
	\begin{pmatrix}
	\mathbf{A} & \mathbf{V}^T \\
	\mathbf{V} & \mathbf{0}
	\end{pmatrix}^{-1} = \begin{pmatrix}
	\mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & \mathbf{A}^{-1} \mathbf{V}^T (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \\
	(\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1} \mathbf{V} \mathbf{A}^{-1} & - (\mathbf{V} \mathbf{A}^{-1} \mathbf{V}^T)^{-1}
	\end{pmatrix}.
$$

* Orthogonal $\mathbf{A}$: $n^2$ flops **at most**. Permutation matrix, Householder matrix, Jacobi matrix, ... take less.

* Toeplitz systems:
\begin{eqnarray*}
	\mathbf{T} = \begin{pmatrix}
	r_0 & r_1 & r_2 & r_3 \\
	r_{-1} & r_0 & r_1 & r_2 \\
	r_{-2} & r_{-1} & r_0 & r_1 \\
	r_{-3} & r_{-2} & r_{-1} & r_0
	\end{pmatrix}.
\end{eqnarray*}
$\mathbf{T} \mathbf{x} = \mathbf{b}$, where $\mathbf{T}$ is pd and Toeplitz, can be solved in $O(n^2)$ flops. Durbin algorithm (Yule-Walker equation), Levinson algorithm (general $\mathbf{b}$), Trench algorithm (inverse). These matrices occur in auto-regressive models and econometrics.

    * [`ToeplitzMatrices.jl`](https://github.com/JuliaMatrices/ToeplitzMatrices.jl) package can be useful.

* Circulant systems: Toeplitz matrix with wraparound
$$
	C(\mathbf{z}) = \begin{pmatrix}
	z_0 & z_4 & z_3 & z_2 & z_1 \\
	z_1 & z_0 & z_4 & z_3 & z_2 \\
	z_2 & z_1 & z_0 & z_4 & z_3 \\
	z_3 & z_2 & z_1 & z_0 & z_4 \\
	z_4 & z_3 & z_2 & z_1 & z_0
	\end{pmatrix},
$$
FFT type algorithms: DCT (discrete cosine transform) and DST (discrete sine transform).

* Vandermonde matrix: such as in interpolation and approximation problems
$$
	\mathbf{V}(x_0,\ldots,x_n) = \begin{pmatrix}
	1 & 1 & \cdots & 1 \\
	x_0 & x_1 & \cdots & x_n \\
	\vdots & \vdots & & \vdots \\
	x_0^n & x_1^n & \cdots & x_n^n
	\end{pmatrix}.
$$
$\mathbf{V} \mathbf{x} = \mathbf{b}$ or $\mathbf{V}^T \mathbf{x} = \mathbf{b}$ can be solved in $O(n^2)$ flops.

* Cauchy-like matrices:
$$
	\Omega \mathbf{A} - \mathbf{A} \Lambda = \mathbf{R} \mathbf{S}^T,
$$
where $\Omega = \text{diag}(\omega_1,\ldots,\omega_n)$ and $\Lambda = (\lambda_1,\ldots, \lambda_n)$. $O(n)$ flops for LU and QR.

* Structured-rank problems: semiseparable matrices (LU and QR takes $O(n)$ flops), quasiseparable matrices, ...