# GPU Computing in Julia

This session demonstrates the GPU computing in Julia.

In [1]:
versioninfo()

Query GPU devices in the system.

In [2]:
using GPUArrays, CLArrays

# check available devices on this machine
CLArrays.devices()

3-element Array{OpenCL.cl.Device,1}:
 OpenCL.Device(Intel(R) HD Graphics 530 on Apple @0x0000000001024500)                 
 OpenCL.Device(AMD Radeon Pro 460 Compute Engine on Apple @0x0000000001021c00)        
 OpenCL.Device(Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz on Apple @0x00000000ffffffff)

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)


In [3]:
# use the AMD Radeon Pro 460 GPU
dev = CLArrays.devices()[2]
CLArrays.init(dev)

OpenCL context with:
CL version: OpenCL 1.2 
Device: CL AMD Radeon Pro 460 Compute Engine
            threads: 256
             blocks: (256, 256, 256)
      global_memory: 4294.967296 mb
 free_global_memory: NaN mb
       local_memory: 0.032768 mb


Generate arrays on GPU devices

In [4]:
# generate GPU arrays
xd = rand(CLArray{Float32}, 5, 5)

GPU: 5×5 Array{Float32,2}:
 0.282438   0.288461    0.425512  0.632621  0.141995 
 0.3416     0.730267    0.405668  0.925261  0.606434 
 0.0207534  0.00322844  0.937711  0.503301  0.148619 
 0.13636    0.553946    0.894025  0.446051  0.0555661
 0.54389    0.484792    0.972734  0.196462  0.308813 

In [5]:
yd = ones(CLArray{Float32}, 5, 5)

GPU: 5×5 Array{Float32,2}:
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0

In [6]:
# elementiwise operations
zd = log.(yd .+ sin.(xd))

GPU: 5×5 Array{Float32,2}:
 0.245842   0.250352    0.345564  0.464527  0.132359 
 0.288927   0.511067    0.332631  0.587107  0.451038 
 0.0205394  0.00322323  0.591228  0.393609  0.138084 
 0.127458   0.422681    0.576388  0.358658  0.0540501
 0.417043   0.382554    0.602362  0.178315  0.265381 

In [7]:
# getting back x
asin.(exp.(zd) .- yd) 

GPU: 5×5 Array{Float32,2}:
 0.282438   0.288461    0.425512  0.632621  0.141995 
 0.3416     0.730267    0.405668  0.925261  0.606434 
 0.0207533  0.00322843  0.937711  0.503301  0.148619 
 0.13636    0.553946    0.894025  0.446051  0.0555661
 0.54389    0.484792    0.972734  0.196463  0.308813 

Linear algebra.

In [8]:
A_mul_B!(zd, xd, yd)

GPU: 5×5 Array{Float32,2}:
 1.77103  1.77103  1.77103  1.77103  1.77103
 3.00923  3.00923  3.00923  3.00923  3.00923
 1.61361  1.61361  1.61361  1.61361  1.61361
 2.08595  2.08595  2.08595  2.08595  2.08595
 2.50669  2.50669  2.50669  2.50669  2.50669

In [9]:
using BenchmarkTools

n = 500
xd = rand(CLArray{Float32}, n, n)
yd = rand(CLArray{Float32}, n, n)
zd = zeros(CLArray{Float32}, n, n)
@benchmark A_mul_B!($zd, $xd, $yd)

BenchmarkTools.Trial: 
  memory estimate:  2.86 KiB
  allocs estimate:  96
  --------------
  minimum time:     17.486 μs (0.00% GC)
  median time:      23.223 μs (0.00% GC)
  mean time:        26.641 μs (3.17% GC)
  maximum time:     18.880 ms (44.70% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [11]:
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)

@benchmark A_mul_B!($z, $x, $y)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.019 ms (0.00% GC)
  median time:      1.438 ms (0.00% GC)
  mean time:        1.450 ms (0.00% GC)
  maximum time:     3.142 ms (0.00% GC)
  --------------
  samples:          3437
  evals/sample:     1