# CUDA with Julia
(based on material from Tim Besard)

https://docs.google.com/presentation/d/1l-BuAtyKgoVYakJSijaSqaTL3friESDyTOnU2OLqGoA/edit#slide=id.p
https://docs.google.com/presentation/d/1VtZ-gNe0Bz2GjLJYfJ5Jp70jFMXDmnfAlnGsaZUbJoI/edit#slide=id.g5522e03163_0_130


### Tombo
```
module load julia
module load cuda
module load cuDNN/6.0_CUDA_8.0.27
srun -N 1 -p gpu --gres=gpu:1 --pty bash
```

```
] add CUDAnative CuArrays
```

Goal: Just another package, no changes to Julia itself.

In [None]:
using CUDAnative

In [1]:
function vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return nothing
end

vadd (generic function with 1 method)

In [None]:
using CuArrays

In [None]:
a = CuArray([1,2,3])
b = CuArray([4,5,6])
c = zero(a)

In [None]:
@cuda threads=length(a) vadd(a, b, c)
c

It's fast! We outperform `nvcc` on Rodinia benchmark suite.

![CUDAnative performance](img/cudanative_perf.png)

# CuArrays.jl

Array-based abstractions of GPU computations:

In [None]:
a = CuArray(rand(2,2))
b = CuArray(rand(2,2))

In [None]:
a * b

But we have a Julia to GPU compiler! Which makes our abstractions **much more powerful**:

In [None]:
reduce(+, a)

In [None]:
map((x,y) -> x*y, a, b)

Generalized to `broadcast`, where shapes are extended:

In [None]:
c = CuArray(rand(2))
broadcast((x,y) -> x*y, a, c)

Convenient short-hand syntax:

In [None]:
a .* c