# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Nonnegative-Matrix-Factorization---Put-It-Together" data-toc-modified-id="Nonnegative-Matrix-Factorization---Put-It-Together-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Nonnegative Matrix Factorization - Put It Together</a></div><div class="lev2 toc-item"><a href="#Nonnegative-Matrix-Factorization" data-toc-modified-id="Nonnegative-Matrix-Factorization-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Nonnegative Matrix Factorization</a></div><div class="lev2 toc-item"><a href="#Generate-an-artificial-data-set" data-toc-modified-id="Generate-an-artificial-data-set-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Generate an artificial data set</a></div><div class="lev2 toc-item"><a href="#Step-1:-Prototype-code" data-toc-modified-id="Step-1:-Prototype-code-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Step 1: Prototype code</a></div><div class="lev2 toc-item"><a href="#Step-2:-Flop-count" data-toc-modified-id="Step-2:-Flop-count-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Step 2: Flop count</a></div><div class="lev2 toc-item"><a href="#Step-3:-Memory-management" data-toc-modified-id="Step-3:-Memory-management-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Step 3: Memory management</a></div><div class="lev2 toc-item"><a href="#Step-4:-GPU" data-toc-modified-id="Step-4:-GPU-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Step 4: GPU</a></div><div class="lev3 toc-item"><a href="#Double-precision-on-AMD-Radeon-Pro-460" data-toc-modified-id="Double-precision-on-AMD-Radeon-Pro-460-161"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>Double precision on AMD Radeon Pro 460</a></div><div class="lev3 toc-item"><a href="#Single-precision-on-AMD-Radeon-Pro-460" data-toc-modified-id="Single-precision-on-AMD-Radeon-Pro-460-162"><span class="toc-item-num">1.6.2&nbsp;&nbsp;</span>Single precision on AMD Radeon Pro 460</a></div><div class="lev3 toc-item"><a href="#Single-precision-on-Intel(R)-HD-Graphics-530" data-toc-modified-id="Single-precision-on-Intel(R)-HD-Graphics-530-163"><span class="toc-item-num">1.6.3&nbsp;&nbsp;</span>Single precision on Intel(R) HD Graphics 530</a></div>

# Nonnegative Matrix Factorization - Put It Together

In this session, we use techniques (numerical linear algebra, profiling, GPU) we learnt so far to implement the algorithm for nonnegative matrix factorization (NNMF).

In [None]:
versioninfo()

## Nonnegative Matrix Factorization

Nonnegative matrix factorization (NNMF) was introduced by [Lee and Seung (1999)](https://www.nature.com/articles/44565) as an analog of principal components and vector quantization with applications in data compression and clustering.

<img src="./nnmf.png" width="500" align="center"/>

In genomics, we can also use NNMF to decompose a genotype matrix $\mathbf{X}$ (peopole-by-SNPs) as a product of $\mathbf{V}$ (each row is the population admixture proportions) and $\mathbf{W}$ (each column is population allele frequences at a specific SNP).

In mathematical terms, one approximates a data matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ with nonnegative entries $x_{ij}$ by a product of two low-rank matrices $\mathbf{V} \in \mathbb{R}^{m \times r}$ and $\mathbf{W} \in \mathbb{R}^{r \times n}$ with nonnegative entries $v_{ik}$ and $w_{kj}$. Consider minimization of the squared Frobenius norm
$$
	L(\mathbf{V}, \mathbf{W}) = \|\mathbf{X} - \mathbf{V} \mathbf{W}\|_{\text{F}}^2 = \sum_i \sum_j \left(x_{ij} - \sum_k v_{ik} w_{kj} \right)^2, \quad v_{ik} \ge 0, w_{kj} \ge 0,
$$
which should lead to a good factorization. [Lee and Seung (1999)](https://www.nature.com/articles/44565) presents an iterative algorithm with updates
$$
	v_{ik}^{(t+1)} = v_{ik}^{(t)} \frac{\sum_j x_{ij} w_{kj}^{(t)}}{\sum_j b_{ij}^{(t)} w_{kj}^{(t)}}, \quad \text{where } b_{ij}^{(t)} = \sum_k v_{ik}^{(t)} w_{kj}^{(t)},
$$
$$
	w_{kj}^{(t+1)} = w_{kj}^{(t)} \frac{\sum_i x_{ij} v_{ik}^{(t+1)}}{\sum_i b_{ij}^{(t+1/2)} v_{ik}^{(t+1)}}, \quad \text{where } b_{ij}^{(t+1/2)} = \sum_k v_{ik}^{(t+1)} w_{kj}^{(t)}, 
$$
that drive the objective $L^{(t)} = L(\mathbf{V}^{(t)}, \mathbf{W}^{(t)})$ downhill. Superscript $t$ indicates iteration number. In matrix notations, the updates are
$$
    V \gets V \,\, .* \,\, (X * W^T) \,\, ./ \,\, (V * W * W^T)
$$
$$
    W \gets W \,\, .* \,\, (V^T * X) \,\, ./ \,\, (V^T * V * W),
$$
where $.*$ and $./$ are elementwise multiplication and elementwise division respectively.

## Generate an artificial data set

We generate a random `X` and aim to fit a rank `r` NNMF.

In [None]:
srand(123) # seed

m, n, r = 2048, 1024, 64
X = rand(m, n)

## Step 1: Prototype code

First let's get something that works.

In [None]:
function nnmf1(
    X::Matrix{T},
    r::Integer,
    maxiter::Integer = 1000,
    tolfun::T = 1e-4,
    V::Matrix{T} = rand(T, size(X, 1), r),
    W::Matrix{T} = rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    obj = vecnorm(X - V * W)^2
    # MM loop
    for iter in 1:maxiter
        V = V .* (X * W') ./ (V * W * W')
        W = W .* (V' * X) ./ (V' * V * W)
        # convergence check
        objold = obj
        obj = vecnorm(X - V * W)^2
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # output
    return V, W
    
end

For consistent timing, we set `tolfun` to 0, use start point of all 1s for `V` and `W`, and let algorithm run for 100 iterations.

In [None]:
V0, W0 = ones(m, r), ones(r, n) # start point
nnmf1(X, r, 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf1(X, r, 100, 0.0, V0, W0)

Make sure our code is type stable.

In [None]:
@code_warntype nnmf1(X, r, 10, 0.0, V0, W0)

Profiling to identify computational bottleneck:

In [None]:
fill!(V0, 1)
fill!(W0, 1)
Profile.clear()
@profile nnmf1(X, r, 100, 0.0, V0, W0)
Profile.print(format=:flat)

## Step 2: Flop count

Considering the shapes of `V` and `W`, we realize that `V * W * W'` and `V * (W * W')` have different flop counts. `V * W * W'` has $4mnr$ flops, while `V * (W * W')` needs only $2(m + n)r^2$ flops. This makes a big difference if $r \ll m, n$.

In [None]:
function nnmf2(
    X::Matrix{T},
    r::Integer,
    maxiter::Integer = 1000,
    tolfun::T = 1e-4,
    V::Matrix{T} = rand(T, size(X, 1), r),
    W::Matrix{T} = rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    obj = vecnorm(X - V * W)^2
    # MM loop
    for iter in 1:maxiter
        V = V .* (X * W') ./ (V * (W * W'))
        W = W .* (V' * X) ./ ((V' * V) * W)
        # convergence check
        objold = obj
        obj = vecnorm(X - V * W)^2
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # output
    return V, W
    
end

We see immediate improvement in run time, memory allocation, and GC time.

In [None]:
nnmf2(X, r, 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf2(X, r, 100, 0.0, V0, W0)

Profile again. We see significant improvement in line 14.

In [None]:
fill!(V0, 1)
fill!(W0, 1)
Profile.clear()
@profile nnmf2(X, r, 100, 0.0, V0, W0)
Profile.print(format=:flat)

## Step 3: Memory management

The exessive memory allocation and garbage collection (GC) overhead are worriesome. We observe

1. `X * W'` and `W * W'` will actually transpose `W`, causing unnecessary memory allocation.

2. Intermediate arrays (`W * W'`, `V' * V`, `V * W`, etc) should be pre-allocated and re-used in loop.  

In [None]:
function nnmf3(
    X::Matrix{T},
    r::Integer,
    maxiter::Integer=1000,
    tolfun::T=1e-4,
    V::Matrix{T}=rand(T, size(X, 1), r),
    W::Matrix{T}=rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    # pre-allocate arrays
    storageV = similar(V) # m-by-r
    storageW = similar(W) # r-by-n
    storageX = similar(X) # m-by-n
    storageR = zeros(eltype(X), r, r) # r-by-r
    # start point
    A_mul_B!(storageX, V, W)
    storageX .= X .- storageX
    obj = vecnorm(storageX)^2
    # MM loop
    for iter in 1:maxiter
        # V = V .* (X * W') ./ (V * (W * W'))
        A_mul_Bt!(storageR, W, W)
        A_mul_B!(storageV, V, storageR)
        V .= V ./ storageV
        A_mul_Bt!(storageV, X, W)
        V .= V .* storageV
        # W = W .* (V' * X) ./ ((V' * V) * W)
        At_mul_B!(storageR, V, V)
        A_mul_B!(storageW, storageR, W)
        W .= W ./ storageW
        At_mul_B!(storageW, V, X)
        W .= W .* storageW
        # convergence check
        A_mul_B!(storageX, V, W)
        objold = obj
        storageX .= X .- storageX
        obj = vecnorm(storageX)^2
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # output
    return V, W
    
end

Now we see huge reduction in memory allocation and GC time is essentially 0.

In [None]:
nnmf3(X, r, 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf3(X, r, 100, 0.0, V0, W0)

Profile again:

In [None]:
fill!(V0, 1)
fill!(W0, 1)
Profile.clear()
@profile nnmf3(X, r, 100, 0.0, V0, W0)
Profile.print(format=:flat)

## Step 4: GPU

**Warning:** this section will not run on the server, which isn't equipped with GPU.

Let's inspect available GPU resources on my laptop.

In [None]:
using GPUArrays, CLArrays

# check available devices on this machine
mydevices = CLArrays.devices()

Duing coding, we found two issues:

1. `vecnorm()` doesn't work on GPU.

2. `A_mul_Bt!(storageRd, Wd, Wd)` and `At_mul_B!(storageRd, Vd, Vd)` don't work on GPU.

In [None]:
function nnmf4(
    X::Matrix{T},
    r::Integer,
    dev,
    maxiter::Integer=1000,
    tolfun::T = 1e-4,
    V::Matrix{T} = rand(T, size(X, 1), r),
    W::Matrix{T} = rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    # initialize device
    ctx = CLArrays.init(dev)
    # transfer X, V, W to device
    Xd, Vd, Wd = CLArray(X), CLArray(V), CLArray(W)
    # pre-allocate arrays on device
    storageVd = zeros(CLArray{T}, m, r)
    storageWd = zeros(CLArray{T}, r, n)
    storageXd = zeros(CLArray{T}, m, n)
    storageRd = zeros(CLArray{T}, r, r)
    # start point
    A_mul_B!(storageXd, Vd, Wd)
    storageXd .= Xd .- storageXd
    # obj = vecnorm(storageXd)^2 # not working on GPU
    obj = sum(abs2, storageXd)
    # MM loop
    for iter in 1:maxiter
        # V = V .* (X * W') ./ (V * (W * W'))
        copy!(storageWd, Wd)
        A_mul_Bt!(storageRd, Wd, storageWd)
        A_mul_B!(storageVd, Vd, storageRd)
        Vd .= Vd ./ storageVd
        A_mul_Bt!(storageVd, Xd, Wd)
        Vd .= Vd .* storageVd
        # W = W .* (V' * X) ./ (V' * B)
        copy!(storageVd, Vd)
        At_mul_B!(storageRd, Vd, storageVd)
        A_mul_B!(storageWd, storageRd, Wd)
        Wd .= Wd ./ storageWd
        At_mul_B!(storageWd, Vd, Xd)
        Wd .= Wd .* storageWd
        # convergence check
        A_mul_B!(storageXd, Vd, Wd)
        objold = obj
        storageXd .= Xd .- storageXd
        # obj = vecnorm(storageXd)^2
        obj = sum(abs2, storageXd)
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # collect result from GPU
    V = collect(Vd)
    W = collect(Wd)
    # output    
    return V, W
    
end

### Double precision on AMD Radeon Pro 460

In [None]:
V0, W0 = ones(m, r), ones(r, n) # start point
nnmf4(X, r, mydevices[2], 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf4(X, r, mydevices[2], 100, 0.0, V0, W0)

Slower than CPU. But we know that GPU is not good at double precision computation. 

### Single precision on AMD Radeon Pro 460

Let's try single precision.

In [None]:
Xsp = Float32.(X)
V0sp = ones(Float32, m, r)
W0sp = ones(Float32, r, n);

In [None]:
nnmf4(Xsp, r, mydevices[2], 10, Float32(0), V0sp, W0sp) # warm-up
fill!(V0sp, 1)
fill!(W0sp, 1)
@time nnmf4(Xsp, r, mydevices[2], 100, Float32(0), V0sp, W0sp)

Single precision gives slight different answer from double precision, but close.

### Single precision on Intel(R) HD Graphics 530

In [None]:
fill!(V0sp, 1)
fill!(W0sp, 1)
@time nnmf4(Xsp, r, mydevices[1], 100, Float32(0), V0sp, W0sp)

The weaker Intel(R) HD Graphics 530 GPU does not yield much speedup for this example.