# Using GPU

**Note: This notebook fails to run on a macbook due to [a bug](https://github.com/JuliaGPU/OpenCL.jl/issues/176) in OpenCL.jl. If you are on macbook, try running the file gpu_example.jl directly from your terminal.**

The following are implemented on GPU via OpenCL.

- `SnpCLVariables(z, x, y; model=ADDITIVE_MODEL, center=true, scale=true)`. All other models, centering option, non-scaling are not supported yet.
- `mul!(Xty::AbstractVector{T}, Xt::Transpose{UInt8, SnpArray}, y::AbstractVector{T}; v::SnpCLVariables) where T <: Union{Float64, Float32}`. Based on Kevin L. Keys's kernels on [PLINK.jl](https://github.com/klkeys/PLINK.jl).

In [1]:
] activate ..

In [2]:
versioninfo()

Julia Version 1.0.1
Commit 0d713926f8 (2018-09-29 19:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)


In [3]:
using SnpArrays, OpenCL, LinearAlgebra, BenchmarkTools

First, let's view the information of platform and the device.

In [4]:
using Printf
# The below is taken from JuliaGPU/OpenCL.jl/examples/performance.jl

# platform information
platform = first(cl.platforms())
@printf("Platform name:    %s\n",  platform[:name])
@printf("Platform profile: %s\n",  platform[:profile])
@printf("Platform vendor:  %s\n",  platform[:vendor])
@printf("Platform version: %s\n",  platform[:version])

println()
# device information
device = last(cl.devices(:gpu))
@printf("Device name: %s\n", device[:name])
@printf("Device type: %s\n", device[:device_type])
@printf("Device mem: %i MB\n",           device[:global_mem_size] / 1024^2)
@printf("Device max mem alloc: %i MB\n", device[:max_mem_alloc_size] / 1024^2)
@printf("Device max clock freq: %i MHZ\n",  device[:max_clock_frequency])
@printf("Device max compute units: %i\n",   device[:max_compute_units])
@printf("Device max work group size: %i\n", device[:max_work_group_size])
@printf("Device max work item size: %s\n",  device[:max_work_item_size])

device = last(cl.devices(:gpu)) # This selection is the default for SnpCLVariables in order to use AMD GPU on mac. 

Platform name:    NVIDIA CUDA
Platform profile: FULL_PROFILE
Platform vendor:  NVIDIA Corporation
Platform version: OpenCL 1.2 CUDA 10.1.113

Device name: GeForce GTX 1080
Device type: gpu
Device mem: 8120 MB
Device max mem alloc: 2030 MB
Device max clock freq: 1733 MHZ
Device max compute units: 20
Device max work group size: 1024
Device max work item size: (1024, 1024, 64)


OpenCL.Device(GeForce GTX 1080 on NVIDIA CUDA @0x0000000002a81c60)

Loading EUR data:

In [5]:
# setting up
const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"));
X = EUR
m,n  = size(EUR)

(379, 54051)

A random vector to be multiplied to transpose(X):

In [6]:
using Random
Random.seed!(95376)
y = randn(m);

SnpBitMatrix (which supports our CPU implementation for the same operation) and whrere to keep the result:

In [7]:
X_bm = SnpBitMatrix{Float64}(X; model=ADDITIVE_MODEL, center=true, scale=true);

Where to keep the results:

In [8]:
z_bm = Vector{Float64}(undef, n)
z = Vector{Float64}(undef, n) ;

We also make 32-bit storages corresponding to `y` and `z`.

In [9]:
y32 = convert(Array{Float32}, y)
z32 = convert(Array{Float32}, z);

An `SnpCLVariables` is necessary for GPU operation.

In [10]:
v = SnpCLVariables(z, X, y)
v32 = SnpCLVariables(z32, X, y32)

SnpCLVariables{Float32,OpenCL.cl.Buffer{Float32}}(Buffer{Float32}(@0x0000000006a8eec0), Buffer{UInt8}(@0x0000000006575310), Buffer{Float32}(@0x0000000006a92e50), Buffer{Int64}(@0x0000000006a9af50), OpenCL.Device(GeForce GTX 1080 on NVIDIA CUDA @0x0000000002a81c60), OpenCL.Context(@0x00000000048ba5f0 on GeForce GTX 1080), OpenCL.CmdQueue(@0x00000000064aa2d0), Buffer{Float32}(@0x00000000052944b0), Buffer{Float32}(@0x000000000246be00), Buffer{Float32}(@0x0000000006b260d0), OpenCL.Kernel("compute_xt_times_vector" nargs=12), OpenCL.Kernel("reduce_xt_vec_chunks" nargs=7), OpenCL.Kernel("reset_x" nargs=3), 256, 2, 423, 379, 54051, 379, 54051, 2, 95, 256, 1, 108102, OpenCL.cl.LocalMem{Float32}(0x0000000000000400))

The following block compares running time of CPU version, 64-bit GPU version, and 32-bit GPU version (lower precision).

In [None]:
# benchmarks
print("Benchmark time for CPU version (via SnpBitMatrix): ")
mul!(z_bm, transpose(X_bm), y)
@btime mul!(z_bm, transpose(X_bm), y)

print("Benchmark time for GPU version (64-bit): ")
mul!(z, transpose(X), y; v=v)
@btime mul!(z, transpose(X), y; v=v)

print("Benchmark time for GPU version (32-bit): ")
mul!(z32, transpose(X), y32; v=v32)
@btime mul!(z32, transpose(X), y32; v=v32);

GPU version is 20-35x faster than the CPU SnpBitMatrix implementation in this case. On a Macbook Pro with with a AMD Radeon Pro 560,
 
```
julia> versioninfo()
Julia Version 1.1.1
Commit 55e36cc (2019-05-16 04:10 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
```

 the benchmark timing (running the examples/gpu_example.jl) was:
```
Benchmark time for CPU version (via SnpBitMatrix):   74.001 ms (1 allocation: 16 bytes)
Benchmark time for GPU version (64-bit):   7.962 ms (304 allocations: 17.03 KiB)
Benchmark time for GPU version (32-bit):   6.642 ms (299 allocations: 16.88 KiB)
```

Finally, we check the correctness of GPU operations.

In [None]:
# check correctness
@assert isapprox(z, z_bm; rtol=1e-12)
@assert isapprox(z32, z_bm; rtol=1e-5)