# GPU Computing in Julia

This session introduces GPU computing in Julia.

In [1]:
versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)


## GPGPU

GPUs are ubiquitous in modern computers. Following are GPUs today's typical computer systems.

| NVIDIA GPUs         | Tesla K80                            | GTX 1080                                 | GT 650M                              |
|---------------------|----------------------------------------|-----------------------------------------|--------------------------------------|
|                     | ![Tesla M2090](nvidia_k80.jpg) | ![GTX 580](nvidia_gtx1080.jpg)    | ![GT 650M](nvidia_gt650m.jpg) |
| Computers           | servers, cluster                       | desktop                                 | laptop                               |
|                     | ![Server](gpu_server.jpg)       | ![Desktop](alienware-area51.png) | ![Laptop](macpro_inside.png)  |
| Main usage          | scientific computing                   | daily work, gaming                      | daily work                           |
| Memory              | 24 GB                                    | 8 GB                                   | 1GB                                  |
| Memory bandwidth    | 480 GB/sec                              | 320 GB/sec                               | 80GB/sec                             |
| Number of cores     | 4992                                    | 2560                                     | 384                                  |
| Processor clock     | 562 MHz                                 | 1.6 GHz                                  | 0.9GHz                               |
| Peak DP performance | 2.91 TFLOPS                              | 257 GFLOPS                                        |                                      |
| Peak SP performance | 8.73 TFLOPS                            | 8228 GFLOPS                              | 691Gflops                            |

GPU architecture vs CPU architecture.  
* GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC  
* Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm  
* Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly

| ![i7 die](cpu_i7_die.png) | ![Fermi die](Fermi_Die.png) |
|----------------------------------|------------------------------------|
| ![Einstein](einstein.png) | ![Rain man](rainman.png)    |

## GPGPU in Julia

GPU support by Julia is under active development. Check [JuliaGPU](https://github.com/JuliaGPU) for currently available packages. 

There are at least two paradigms to program GPU.

- **CUDA** is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuDNN, ...

- **OpenCL** is supported multiple manufacturers (Nvidia, AMD, Intel, Apple, ...), but lacks some libraries essential for statistical computing.

Because my laptop has does not have Nvidia GPU, I'll illustrate using OpenCL.

## Query GPU devices in the system

In [2]:
using CLArrays

# check available devices on this machine
CLArrays.devices()

3-element Array{OpenCL.cl.Device,1}:
 OpenCL.Device(Intel(R) HD Graphics 530 on Apple @0x0000000001024500)                 
 OpenCL.Device(AMD Radeon Pro 460 Compute Engine on Apple @0x0000000001021c00)        
 OpenCL.Device(Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz on Apple @0x00000000ffffffff)

In [3]:
# use the AMD Radeon Pro 460 GPU
dev = CLArrays.devices()[2]
CLArrays.init(dev)

OpenCL context with:
CL version: OpenCL 1.2 
Device: CL AMD Radeon Pro 460 Compute Engine
            threads: 256
             blocks: (256, 256, 256)
      global_memory: 4294.967296 mb
 free_global_memory: NaN mb
       local_memory: 0.032768 mb


## Generate arrays on GPU devices

In [4]:
# generate GPU arrays
xd = rand(CLArray{Float32}, 5, 3)

GPU: 5×3 Array{Float32,2}:
 0.817386  0.553966  0.9694   
 0.11034   0.450713  0.850215 
 0.143936  0.126783  0.544901 
 0.189676  0.223372  0.145011 
 0.606684  0.931789  0.0843384

In [5]:
yd = ones(CLArray{Float32}, 5, 3)

GPU: 5×3 Array{Float32,2}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

## Transfer data between main memory and GPU

In [6]:
# transfer data from main memory to GPU
x = randn(5, 3)
xd = CLArray(x)

GPU: 5×3 Array{Float64,2}:
  0.807471  -0.424316  -0.436997
  1.75592   -0.630868   1.62172 
 -0.2748    -1.5282     2.97822 
 -1.25279   -0.427537  -0.362384
 -1.14854    0.955066   0.602261

In [7]:
# transfer data from main memory to GPU
x = collect(xd)

5×3 Array{Float64,2}:
  0.807471  -0.424316  -0.436997
  1.75592   -0.630868   1.62172 
 -0.2748    -1.5282     2.97822 
 -1.25279   -0.427537  -0.362384
 -1.14854    0.955066   0.602261

## elementiwise operations

In [8]:
zd = log.(yd .+ sin.(xd))

GPU: 5×3 Array{Float64,2}:
  0.543801  -0.530514  -0.550295
  0.684568  -0.891223   0.692499
 -0.316568  -7.0052     0.150702
 -2.99296   -0.535512  -0.437737
 -2.43231    0.59683    0.448848

In [9]:
# getting back x
asin.(exp.(zd) .- yd)

GPU: 5×3 Array{Float64,2}:
  0.807471  -0.424316  -0.436997
  1.38568   -0.630868   1.51987 
 -0.2748    -1.5282     0.163375
 -1.25279   -0.427537  -0.362384
 -1.14854    0.955066   0.602261

## Linear algebra

In [10]:
zd = zeros(CLArray{Float32}, 3, 3)
At_mul_B!(zd, xd, yd)

GPU: 3×3 Array{Float32,2}:
 -0.112745  -0.112745  -0.112745
 -2.05585   -2.05585   -2.05585 
  4.40282    4.40282    4.40282 

In [11]:
using BenchmarkTools

n = 1024
xd = rand(CLArray{Float32}, n, n)
yd = rand(CLArray{Float32}, n, n)
zd = zeros(CLArray{Float32}, n, n)

# SP matrix multiplication on GPU
@benchmark A_mul_B!($zd, $xd, $yd)

[91mERROR (unhandled task failure): [91mOpenCL Error: OpenCL.Context error: [39m
Stacktrace:
 [1] [1mraise_context_error[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m/Users/huazhou/.julia/v0.6/OpenCL/src/context.jl:109[22m[22m
 [2] [1mmacro expansion[22m[22m at [1m/Users/huazhou/.julia/v0.6/OpenCL/src/context.jl:148[22m[22m [inlined]
 [3] [1m(::OpenCL.cl.##43#44)[22m[22m[1m([22m[22m[1m)[22m[22m at [1m./task.jl:335[22m[22m
[39m

BenchmarkTools.Trial: 
  memory estimate:  3.14 KiB
  allocs estimate:  114
  --------------
  minimum time:     17.157 μs (0.00% GC)
  median time:      21.811 μs (0.00% GC)
  mean time:        24.471 μs (3.38% GC)
  maximum time:     17.171 ms (48.17% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [12]:
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)

# SP matrix multiplication on CPU
@benchmark A_mul_B!($z, $x, $y)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.458 ms (0.00% GC)
  median time:      8.928 ms (0.00% GC)
  mean time:        9.285 ms (0.00% GC)
  maximum time:     15.329 ms (0.00% GC)
  --------------
  samples:          538
  evals/sample:     1

We ses ~50 fold speedup in this matrix multiplication example.