# CUDAnative.jl
(based on material from Tim Besard)

Just another package, no changes to Julia itself.

In [2]:
using CUDAnative

┌ Info: Precompiling CUDAnative [be33ccc6-a3ff-5ff2-a52e-74243cff1e17]
└ @ Base loading.jl:1186
│ Please run Pkg.build("CUDAdrv") and restart Julia.
└ @ CUDAdrv ~/.julia/packages/CUDAdrv/huk2n/src/init.jl:15
ERROR: LoadError: CUDAnative.jl has not been built, please run Pkg.build("CUDAnative").
Stacktrace:
 [1] top-level scope at none:0
 [2] top-level scope at none:2
in expression starting at /home/vchuravy/.julia/dev/CUDAnative/src/CUDAnative.jl:11


ErrorException: Failed to precompile CUDAnative [be33ccc6-a3ff-5ff2-a52e-74243cff1e17] to /home/vchuravy/.julia/compiled/v1.0/CUDAnative/4Zu2W.ji.

In [2]:
function vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
end

vadd (generic function with 1 method)

In [3]:
using CuArrays

In [4]:
a = CuArray([1,2,3])
b = CuArray([4,5,6])
c = zero(a)

3-element CuArray{Int64,1}:
 0
 0
 0

In [5]:
@cuda threads=length(a) vadd(a, b, c)
c

3-element CuArray{Int64,1}:
 5
 7
 9

It's fast! We outperform `nvcc` on Rodinia benchmark suite.

![CUDAnative performance](img/cudanative_perf.png)

# CuArrays.jl

Array-based abstractions of GPU computations:

In [7]:
a = CuArray(rand(2,2))
b = CuArray(rand(2,2))

2×2 CuArray{Float64,2}:
 0.560497   0.248382 
 0.0249621  0.0141561

In [8]:
a * b

2×2 CuArray{Float64,2}:
 0.478548  0.213071
 0.221448  0.098413

But we have a Julia to GPU compiler! Which makes our abstractions **much more powerful**:

In [9]:
reduce(+, a)

1.6450283402849906

In [10]:
map((x,y) -> x*y, a, b)

2×2 CuArray{Float64,2}:
 0.47045     0.0805833
 0.00976213  0.0012765

Generalized to `broadcast`, where shapes are extended:

In [11]:
c = CuArray(rand(2))
broadcast((x,y) -> x*y, a, c)

2×2 CuArray{Float64,2}:
 0.51599    0.199446  
 0.0200414  0.00462107

Convenient short-hand syntax:

In [12]:
a .* c

2×2 CuArray{Float64,2}:
 0.51599    0.199446  
 0.0200414  0.00462107