# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Nonnegative-Matrix-Factorization---Put-It-Together" data-toc-modified-id="Nonnegative-Matrix-Factorization---Put-It-Together-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Nonnegative Matrix Factorization - Put It Together</a></div><div class="lev2 toc-item"><a href="#Nonnegative-Matrix-Factorization" data-toc-modified-id="Nonnegative-Matrix-Factorization-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Nonnegative Matrix Factorization</a></div><div class="lev2 toc-item"><a href="#Generate-an-artificial-data-set" data-toc-modified-id="Generate-an-artificial-data-set-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Generate an artificial data set</a></div><div class="lev2 toc-item"><a href="#Step-1:-Prototype-code" data-toc-modified-id="Step-1:-Prototype-code-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Step 1: Prototype code</a></div><div class="lev2 toc-item"><a href="#Step-2:-Flop-count" data-toc-modified-id="Step-2:-Flop-count-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Step 2: Flop count</a></div><div class="lev2 toc-item"><a href="#Step-3:-Memory-management" data-toc-modified-id="Step-3:-Memory-management-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Step 3: Memory management</a></div><div class="lev2 toc-item"><a href="#Step-4:-GPU" data-toc-modified-id="Step-4:-GPU-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Step 4: GPU</a></div><div class="lev3 toc-item"><a href="#Double-precision-on-AMD-Radeon-Pro-460" data-toc-modified-id="Double-precision-on-AMD-Radeon-Pro-460-161"><span class="toc-item-num">1.6.1&nbsp;&nbsp;</span>Double precision on AMD Radeon Pro 460</a></div><div class="lev3 toc-item"><a href="#Single-precision-on-AMD-Radeon-Pro-460" data-toc-modified-id="Single-precision-on-AMD-Radeon-Pro-460-162"><span class="toc-item-num">1.6.2&nbsp;&nbsp;</span>Single precision on AMD Radeon Pro 460</a></div><div class="lev3 toc-item"><a href="#Single-precision-on-Intel(R)-HD-Graphics-530" data-toc-modified-id="Single-precision-on-Intel(R)-HD-Graphics-530-163"><span class="toc-item-num">1.6.3&nbsp;&nbsp;</span>Single precision on Intel(R) HD Graphics 530</a></div>

# Nonnegative Matrix Factorization - Put It Together

In this session, we use techniques (numerical linear algebra, profiling, GPU) we learnt so far to implement the algorithm for nonnegative matrix factorization (NNMF).

Machine information:

In [1]:
versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)


## Nonnegative Matrix Factorization

Nonnegative matrix factorization (NNMF) was introduced by [Lee and Seung (1999)](https://www.nature.com/articles/44565) as an analog of principal components and vector quantization with applications in data compression and clustering.

<img src="./nnmf.png" width="500" align="center"/>

In genomics, we can also use NNMF to decompose a genotype matrix $\mathbf{X}$ (peopole-by-SNPs) as a product of $\mathbf{V}$ (each row is the population admixture proportions) and $\mathbf{W}$ (each column is population allele frequences at a specific SNP).

In mathematical terms, one approximates a data matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ with nonnegative entries $x_{ij}$ by a product of two low-rank matrices $\mathbf{V} \in \mathbb{R}^{m \times r}$ and $\mathbf{W} \in \mathbb{R}^{r \times n}$ with nonnegative entries $v_{ik}$ and $w_{kj}$. Consider minimization of the squared Frobenius norm
$$
	L(\mathbf{V}, \mathbf{W}) = \|\mathbf{X} - \mathbf{V} \mathbf{W}\|_{\text{F}}^2 = \sum_i \sum_j \left(x_{ij} - \sum_k v_{ik} w_{kj} \right)^2, \quad v_{ik} \ge 0, w_{kj} \ge 0,
$$
which should lead to a good factorization. [Lee and Seung (1999)](https://www.nature.com/articles/44565) presents an iterative algorithm with updates
$$
	v_{ik}^{(t+1)} = v_{ik}^{(t)} \frac{\sum_j x_{ij} w_{kj}^{(t)}}{\sum_j b_{ij}^{(t)} w_{kj}^{(t)}}, \quad \text{where } b_{ij}^{(t)} = \sum_k v_{ik}^{(t)} w_{kj}^{(t)},
$$
$$
	w_{kj}^{(t+1)} = w_{kj}^{(t)} \frac{\sum_i x_{ij} v_{ik}^{(t+1)}}{\sum_i b_{ij}^{(t+1/2)} v_{ik}^{(t+1)}}, \quad \text{where } b_{ij}^{(t+1/2)} = \sum_k v_{ik}^{(t+1)} w_{kj}^{(t)}, 
$$
that drive the objective $L^{(t)} = L(\mathbf{V}^{(t)}, \mathbf{W}^{(t)})$ downhill. Superscript $t$ indicates iteration number. In matrix notations, the updates are
$$
    V \gets V \,\, .* \,\, (X * W^T) \,\, ./ \,\, (V * W * W^T)
$$
$$
    W \gets W \,\, .* \,\, (V^T * X) \,\, ./ \,\, (V^T * V * W),
$$
where $.*$ and $./$ are elementwise multiplication and elementwise division respectively.

## Generate an artificial data set

We generate a random `X` and aim to fit a rank `r` NNMF.

In [1]:
srand(123) # seed

m, n, r = 2048, 1024, 64
X = rand(m, n)

2048×1024 Array{Float64,2}:
 0.768448   0.033336     0.587894   …  0.961168  0.404655   0.44379  
 0.940515   0.344277     0.428558      0.682964  0.61097    0.894155 
 0.673959   0.976737     0.325033      0.574526  0.40331    0.0594985
 0.395453   0.359901     0.94405       0.348814  0.0813795  0.906431 
 0.313244   0.29952      0.327619      0.247602  0.559101   0.662719 
 0.662555   0.367672     0.233517   …  0.928816  0.817082   0.733205 
 0.586022   0.892923     0.572855      0.893675  0.98278    0.312359 
 0.0521332  0.309707     0.932029      0.392561  0.462704   0.821076 
 0.26864    0.578843     0.576308      0.878799  0.556768   0.592392 
 0.108871   0.181177     0.205463      0.719009  0.7508     0.739252 
 0.163666   0.981504     0.0846958  …  0.906244  0.315583   0.0309041
 0.473017   0.438424     0.905506      0.867153  0.746357   0.271093 
 0.865412   0.543612     0.373122      0.965846  0.496548   0.0407442
 ⋮                                  ⋱                         

## Step 1: Prototype code

First let's get something that works.

In [3]:
function nnmf1(
    X::Matrix{T},
    r::Integer,
    maxiter::Integer = 1000,
    tolfun::T = 1e-4,
    V::Matrix{T} = rand(T, size(X, 1), r),
    W::Matrix{T} = rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    obj = vecnorm(X - V * W)^2
    # MM loop
    for iter in 1:maxiter
        V = V .* (X * W') ./ (V * W * W')
        W = W .* (V' * X) ./ (V' * V * W)
        # convergence check
        objold = obj
        obj = vecnorm(X - V * W)^2
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # output
    return V, W
    
end

nnmf1 (generic function with 5 methods)

For consistent timing, we set `tolfun` to 0, use start point of all 1s for `V` and `W`, and let algorithm run for 100 iterations.

In [5]:
V0, W0 = ones(m, r), ones(r, n) # start point
nnmf1(X, r, 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf1(X, r, 100, 0.0, V0, W0)

([0.00776078 0.00776078 … 0.00776078 0.00776078; 0.00784003 0.00784003 … 0.00784003 0.00784003; … ; 0.00765134 0.00765134 … 0.00765134 0.00765134; 0.00795023 0.00795023 … 0.00795023 0.00795023], [0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982; … ; 0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982])

Make sure our code is type stable.

In [6]:
@code_warntype nnmf1(X, r, 10, 0.0, V0, W0)

  3.207919 seconds (2.61 k allocations: 5.308 GiB, 21.03% gc time)
Variables:
  #self# <optimized out>
  X::Array{Float64,2}
  r <optimized out>
  maxiter::Int64
  tolfun::Float64
  V@_6::Array{Float64,2}
  W@_7::Array{Float64,2}
  #1 <optimized out>
  #2 <optimized out>
  iter@_10 <optimized out>
  objold::Float64
  #temp#@_12::Int64
  m <optimized out>
  n <optimized out>
  #temp#@_15 <optimized out>
  obj::Float64
  V@_17::Array{Float64,2}
  W@_18::Array{Float64,2}
  TS@_19 <optimized out>
  TS@_20 <optimized out>
  TS@_21 <optimized out>
  T@_22 <optimized out>
  shape@_23 <optimized out>
  iter@_24 <optimized out>
  newout@_25::Base.OneTo{Int64}
  newout@_26::Base.OneTo{Int64}
  newout@_27::Base.OneTo{Int64}
  newout@_28::Base.OneTo{Int64}
  C@_29::Array{Float64,2}
  keeps@_30::Tuple{Tuple{Bool,Bool},Tuple{Bool,Bool},Tuple{Bool,Bool}}
  Idefaults@_31::Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64},Tuple{Int64,Int64}}
  #temp#@_32 <optimized out>
  keeps@_33::Tuple{Tuple{Bool,Bool},Tu

      SSAValue(174) = (Base.select_value)((Base.slt_int)(SSAValue(151), 0)::Bool, 0, SSAValue(151))::Int64
      SSAValue(175) = (Base.select_value)((Base.slt_int)(SSAValue(152), 0)::Bool, 0, SSAValue(152))::Int64
      # meta: location broadcast.jl shapeindexer 111 # line 112:
      SSAValue(164) = (Core.tuple)((Base.and_int)((Base.and_int)((1 === 1)::Bool, (1 === 1)::Bool)::Bool, ((Core.getfield)(newout@_28::Base.OneTo{Int64}, :stop)::Int64 === SSAValue(175))::Bool)::Bool)::Tuple{Bool}
      SSAValue(165) = (Core.tuple)(1)::Tuple{Int64}
      keep@_70::Tuple{Bool} = SSAValue(164)
      Idefault@_71::Tuple{Int64} = SSAValue(165)
      # meta: pop location
      # meta: pop location
      SSAValue(176) = (Core.tuple)((Base.and_int)((Base.and_int)((1 === 1)::Bool, (1 === 1)::Bool)::Bool, ((Core.getfield)(newout@_27::Base.OneTo{Int64}, :stop)::Int64 === SSAValue(174))::Bool)::Bool, (Core.getfield)(keep@_70::Tuple{Bool}, 1)::Bool)::Tuple{Bool,Bool}
      SSAValue(177) = (Core.tuple)(1, (C

      SSAValue(395) = (Core.tuple)((Base.and_int)((Base.and_int)((1 === 1)::Bool, (1 === 1)::Bool)::Bool, ((Core.getfield)(newout@_114::Base.OneTo{Int64}, :stop)::Int64 === SSAValue(406))::Bool)::Bool)::Tuple{Bool}
      SSAValue(396) = (Core.tuple)(1)::Tuple{Int64}
      keep@_156::Tuple{Bool} = SSAValue(395)
      Idefault@_157::Tuple{Int64} = SSAValue(396)
      # meta: pop location
      # meta: pop location
      SSAValue(407) = (Core.tuple)((Base.and_int)((Base.and_int)((1 === 1)::Bool, (1 === 1)::Bool)::Bool, ((Core.getfield)(newout@_113::Base.OneTo{Int64}, :stop)::Int64 === SSAValue(405))::Bool)::Bool, (Core.getfield)(keep@_156::Tuple{Bool}, 1)::Bool)::Tuple{Bool,Bool}
      SSAValue(408) = (Core.tuple)(1, (Core.getfield)(Idefault@_157::Tuple{Int64}, 1)::Int64)::Tuple{Int64,Int64}
      keep@_122::Tuple{Bool,Bool} = SSAValue(407)
      Idefault@_123::Tuple{Int64,Int64} = SSAValue(408)
      # meta: pop location
      SSAValue(464) = (Core.tuple)(keep@_122::Tuple{Bool,Bool}, (Co

Profiling to identify computational bottleneck:

In [7]:
fill!(V0, 1)
fill!(W0, 1)
Profile.clear()
@profile nnmf1(X, r, 100, 0.0, V0, W0)
Profile.print(format=:flat)

 Count File                        Line Function                               
  1055 ./<missing>                   -1 anonymous                              
     7 ./In[3]                       12 nnmf1(::Array{Float64,2}, ::Int64, ... 
   159 ./In[3]                       15 nnmf1(::Array{Float64,2}, ::Int64, ... 
   257 ./In[3]                       16 nnmf1(::Array{Float64,2}, ::Int64, ... 
   632 ./In[3]                       19 nnmf1(::Array{Float64,2}, ::Int64, ... 
     1 ./abstractarray.jl           573 copy!(::Array{Any,1}, ::Core.Infere... 
     1 ./array.jl                   391 _collect(::Type{Any}, ::Core.Infere... 
     1 ./array.jl                   388 collect(::Type{Any}, ::Core.Inferen... 
   584 ./arraymath.jl                39 -(::Array{Float64,2}, ::Array{Float... 
   317 ./broadcast.jl               141 _broadcast!                            
   707 ./broadcast.jl               455 broadcast                              
   707 ./broadcast.jl               316 

## Step 2: Flop count

Considering the shapes of `V` and `W`, we realize that `V * W * W'` and `V * (W * W')` have different flop counts. `V * W * W'` has $4mnr$ flops, while `V * (W * W')` needs only $2(m + n)r^2$ flops. This makes a big difference if $r \ll m, n$.

In [8]:
function nnmf2(
    X::Matrix{T},
    r::Integer,
    maxiter::Integer = 1000,
    tolfun::T = 1e-4,
    V::Matrix{T} = rand(T, size(X, 1), r),
    W::Matrix{T} = rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    obj = vecnorm(X - V * W)^2
    # MM loop
    for iter in 1:maxiter
        V = V .* (X * W') ./ (V * (W * W'))
        W = W .* (V' * X) ./ ((V' * V) * W)
        # convergence check
        objold = obj
        obj = vecnorm(X - V * W)^2
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # output
    return V, W
    
end

nnmf2 (generic function with 5 methods)

We see immediate improvement in run time, memory allocation, and GC time.

In [9]:
nnmf2(X, r, 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf2(X, r, 100, 0.0, V0, W0)

([0.00776078 0.00776078 … 0.00776078 0.00776078; 0.00784003 0.00784003 … 0.00784003 0.00784003; … ; 0.00765134 0.00765134 … 0.00765134 0.00765134; 0.00795023 0.00795023 … 0.00795023 0.00795023], [0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982; … ; 0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982])

  2.025511 seconds (2.21 k allocations: 3.602 GiB, 14.73% gc time)


Profile again. We see significant improvement in line 14.

In [10]:
fill!(V0, 1)
fill!(W0, 1)
Profile.clear()
@profile nnmf2(X, r, 100, 0.0, V0, W0)
Profile.print(format=:flat)

 Count File                        Line Function                               
   601 ./<missing>                   -1 anonymous                              
     5 ./In[8]                       12 nnmf2(::Array{Float64,2}, ::Int64, ... 
   136 ./In[8]                       15 nnmf2(::Array{Float64,2}, ::Int64, ... 
    77 ./In[8]                       16 nnmf2(::Array{Float64,2}, ::Int64, ... 
   383 ./In[8]                       19 nnmf2(::Array{Float64,2}, ::Int64, ... 
   380 ./arraymath.jl                39 -(::Array{Float64,2}, ::Array{Float... 
   384 ./broadcast.jl               141 _broadcast!                            
   452 ./broadcast.jl               455 broadcast                              
   452 ./broadcast.jl               316 broadcast_c                            
    68 ./broadcast.jl               268 broadcast_t                            
   384 ./broadcast.jl               270 broadcast_t                            
   384 ./broadcast.jl               149 

## Step 3: Memory management

The exessive memory allocation and garbage collection (GC) overhead are worriesome. We observe

1. `X * W'` and `W * W'` will actually transpose `W`, causing unnecessary memory allocation.

2. Intermediate arrays (`W * W'`, `V' * V`, `V * W`, etc) should be pre-allocated and re-used in loop.  

In [11]:
function nnmf3(
    X::Matrix{T},
    r::Integer,
    maxiter::Integer=1000,
    tolfun::T=1e-4,
    V::Matrix{T}=rand(T, size(X, 1), r),
    W::Matrix{T}=rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    # pre-allocate arrays
    storageV = similar(V) # m-by-r
    storageW = similar(W) # r-by-n
    storageX = similar(X) # m-by-n
    storageR = zeros(eltype(X), r, r) # r-by-r
    # start point
    A_mul_B!(storageX, V, W)
    storageX .= X .- storageX
    obj = vecnorm(storageX)^2
    # MM loop
    for iter in 1:maxiter
        # V = V .* (X * W') ./ (V * (W * W'))
        A_mul_Bt!(storageR, W, W)
        A_mul_B!(storageV, V, storageR)
        V .= V ./ storageV
        A_mul_Bt!(storageV, X, W)
        V .= V .* storageV
        # W = W .* (V' * X) ./ ((V' * V) * W)
        At_mul_B!(storageR, V, V)
        A_mul_B!(storageW, storageR, W)
        W .= W ./ storageW
        At_mul_B!(storageW, V, X)
        W .= W .* storageW
        # convergence check
        A_mul_B!(storageX, V, W)
        objold = obj
        storageX .= X .- storageX
        obj = vecnorm(storageX)^2
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # output
    return V, W
    
end

nnmf3 (generic function with 5 methods)

Now we see huge reduction in memory allocation and GC time is essentially 0.

In [12]:
nnmf3(X, r, 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf3(X, r, 100, 0.0, V0, W0)

([0.00776078 0.00776078 … 0.00776078 0.00776078; 0.00784003 0.00784003 … 0.00784003 0.00784003; … ; 0.00765134 0.00765134 … 0.00765134 0.00765134; 0.00795023 0.00795023 … 0.00795023 0.00795023], [0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982; … ; 0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982])

  1.410878 seconds (13 allocations: 17.532 MiB)


Profile again:

In [13]:
fill!(V0, 1)
fill!(W0, 1)
Profile.clear()
@profile nnmf3(X, r, 100, 0.0, V0, W0)
Profile.print(format=:flat)

 Count File                        Line Function                               
   348 ./<missing>                   -1 anonymous                              
     2 ./In[11]                      19 nnmf3(::Array{Float64,2}, ::Int64, ... 
    29 ./In[11]                      24 nnmf3(::Array{Float64,2}, ::Int64, ... 
     4 ./In[11]                      25 nnmf3(::Array{Float64,2}, ::Int64, ... 
    43 ./In[11]                      27 nnmf3(::Array{Float64,2}, ::Int64, ... 
     5 ./In[11]                      30 nnmf3(::Array{Float64,2}, ::Int64, ... 
     7 ./In[11]                      31 nnmf3(::Array{Float64,2}, ::Int64, ... 
     7 ./In[11]                      32 nnmf3(::Array{Float64,2}, ::Int64, ... 
    85 ./In[11]                      33 nnmf3(::Array{Float64,2}, ::Int64, ... 
   166 ./In[11]                      38 nnmf3(::Array{Float64,2}, ::Int64, ... 
   175 ./broadcast.jl               141 _broadcast!                            
   175 ./broadcast.jl               206 

## Step 4: GPU

**Warning:** this section will not run on the server, which isn't equipped with GPU.

Let's inspect available GPU resources on my laptop.

In [2]:
using GPUArrays, CLArrays

# check available devices on this machine
mydevices = CLArrays.devices()

3-element Array{OpenCL.cl.Device,1}:
 OpenCL.Device(Intel(R) HD Graphics 530 on Apple @0x0000000001024500)                 
 OpenCL.Device(AMD Radeon Pro 460 Compute Engine on Apple @0x0000000001021c00)        
 OpenCL.Device(Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz on Apple @0x00000000ffffffff)

Duing coding, we found two issues:

1. `vecnorm()` doesn't work on GPU.

2. `A_mul_Bt!(storageRd, Wd, Wd)` and `At_mul_B!(storageRd, Vd, Vd)` don't work on GPU.

In [3]:
function nnmf4(
    X::Matrix{T},
    r::Integer,
    dev,
    maxiter::Integer=1000,
    tolfun::T = 1e-4,
    V::Matrix{T} = rand(T, size(X, 1), r),
    W::Matrix{T} = rand(T, r, size(X, 2))
    ) where T <: AbstractFloat
    
    # dimensions
    m, n = size(X)
    # initialize device
    ctx = CLArrays.init(dev)
    # transfer X, V, W to device
    Xd, Vd, Wd = CLArray(X), CLArray(V), CLArray(W)
    # pre-allocate arrays on device
    storageVd = zeros(CLArray{T}, m, r)
    storageWd = zeros(CLArray{T}, r, n)
    storageXd = zeros(CLArray{T}, m, n)
    storageRd = zeros(CLArray{T}, r, r)
    # start point
    A_mul_B!(storageXd, Vd, Wd)
    storageXd .= Xd .- storageXd
    # obj = vecnorm(storageXd)^2 # not working on GPU
    obj = sum(x -> x * x, storageXd)
    # MM loop
    for iter in 1:maxiter
        # V = V .* (X * W') ./ (V * (W * W'))
        copy!(storageWd, Wd)
        A_mul_Bt!(storageRd, Wd, storageWd)
        A_mul_B!(storageVd, Vd, storageRd)
        Vd .= Vd ./ storageVd
        A_mul_Bt!(storageVd, Xd, Wd)
        Vd .= Vd .* storageVd
        # W = W .* (V' * X) ./ (V' * B)
        copy!(storageVd, Vd)
        At_mul_B!(storageRd, Vd, storageVd)
        A_mul_B!(storageWd, storageRd, Wd)
        Wd .= Wd ./ storageWd
        At_mul_B!(storageWd, Vd, Xd)
        Wd .= Wd .* storageWd
        # convergence check
        A_mul_B!(storageXd, Vd, Wd)
        objold = obj
        storageXd .= Xd .- storageXd
        # obj = vecnorm(storageXd)^2
        obj = sum(x -> x * x, storageXd)
        abs(obj - objold) < tolfun * (abs(objold) + 1) && break
    end
    # collect result from GPU
    V = collect(Vd)
    W = collect(Wd)
    # output    
    return V, W
    
end

nnmf4 (generic function with 5 methods)

### Double precision on AMD Radeon Pro 460

In [17]:
V0, W0 = ones(m, r), ones(r, n) # start point
nnmf4(X, r, mydevices[2], 10, 0.0, V0, W0) # warm-up
fill!(V0, 1)
fill!(W0, 1)
@time nnmf4(X, r, mydevices[2], 100, 0.0, V0, W0)

([0.00776078 0.00776078 … 0.00776078 0.00776078; 0.00784003 0.00784003 … 0.00784003 0.00784003; … ; 0.00765134 0.00765134 … 0.00765134 0.00765134; 0.00795023 0.00795023 … 0.00795023 0.00795023], [0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982; … ; 0.999757 0.981911 … 0.973329 0.996982; 0.999757 0.981911 … 0.973329 0.996982])

  1.954353 seconds (263.32 k allocations: 10.854 MiB)


Slower than CPU. But we know that GPU is not good at double precision computation. 

### Single precision on AMD Radeon Pro 460

Let's try single precision.

In [4]:
Xsp = Float32.(X)
V0sp = ones(Float32, m, r)
W0sp = ones(Float32, r, n);

In [9]:
nnmf4(Xsp, r, mydevices[2], 10, Float32(0), V0sp, W0sp) # warm-up
fill!(V0sp, 1)
fill!(W0sp, 1)
@time nnmf4(Xsp, r, mydevices[2], 100, Float32(0), V0sp, W0sp)

(Float32[0.00776083 0.00776083 … 0.00776083 0.00776083; 0.00784008 0.00784008 … 0.00784008 0.00784008; … ; 0.0076514 0.0076514 … 0.0076514 0.0076514; 0.00795028 0.00795028 … 0.00795028 0.00795028], Float32[0.999749 0.981904 … 0.973321 0.996975; 0.999749 0.981904 … 0.973321 0.996975; … ; 0.999749 0.981904 … 0.973321 0.996975; 0.999749 0.981904 … 0.973321 0.996975])

  0.571990 seconds (285.50 k allocations: 10.411 MiB, 1.79% gc time)


Single precision gives slight different answer from double precision, but close.

### Single precision on Intel(R) HD Graphics 530

In [8]:
fill!(V0sp, 1)
fill!(W0sp, 1)
@time nnmf4(Xsp, r, mydevices[1], 100, Float32(0), V0sp, W0sp)

(Float32[0.00776084 0.00776084 … 0.00776084 0.00776084; 0.00784009 0.00784009 … 0.00784009 0.00784009; … ; 0.0076514 0.0076514 … 0.0076514 0.0076514; 0.00795029 0.00795029 … 0.00795029 0.00795029], Float32[0.999749 0.981903 … 0.97332 0.996975; 0.999749 0.981903 … 0.97332 0.996975; … ; 0.999749 0.981903 … 0.97332 0.996975; 0.999749 0.981903 … 0.97332 0.996975])

  1.346441 seconds (264.53 k allocations: 10.114 MiB)


The weaker Intel(R) HD Graphics 530 GPU does not yield much speedup for this example.