# Some special features of JACC

## Finer granularity

Defaults for `JACC.parallel_for`:
- Synchronize
- Use default stream
- Compute threads, blocks, and shmem_size

```julia
@kwdef mutable struct LaunchSpec{Backend}
    stream = default_stream(Backend)
    threads = 0
    blocks = 0
    shmem_size::Int = 0
    sync::Bool = false
end

launch_spec(; kw...) = LaunchSpec{typeof(default_backend())}(; kw...)

```

You can change these defaults using `JACC.launch_spec`.

Specify one or more of the keywords and call `JACC.parallel_for(spec, N, f, x…)`.

```julia
JACC.parallel_for(JACC.launch_spec(; threads = 1000), N,
    (i, a) -> begin
        @inbounds a[i] += 5.0
    end, a_device)
JACC.synchronize() # non-synchronized by default

```

There are more details about this and `parallel_reduce` but we'll skip those here.

## Shared Memory

See the [paper](https://ieeexplore.ieee.org/document/10938453).

The `JACC.shared` gives easy access to on-chip GPU shared memory. If you access an array many times per thread, you may get significant speedup by adding just one line.

<img src="../images/GPU-shared-memory.png" alt="GPU shared memory" style="width:45%;height:auto;">

```julia
function spectral(i, j, image, filter, num_bands)
    for b in 1:num_bands
        @inbounds image[b, i, j] *= filter[j]
    end
end

function spectral_shared(i, j, image, filter, num_bands)
    filter_shared = JACC.shared(filter) # init shared memory
    for b in 1:num_bands
        @inbounds image[b, i, j] *= filter_shared[j]
    end
end

num_bands = 60
num_voxel = 10_240
size_voxel = 64*64
image = init_image(Float32, num_bands, num_voxel, size_voxel)
filter = init_filter(Float32, size_voxel)
jimage = JACC.array(image)
jfilter = JACC.array(filter)

JACC.parallel_for((num_voxel,size_voxel), spectral[_shared], jimage, jfilter, num_bands)

```

Things to keep in mind about `JACC.shared`:
- X is a JACC.array and Y is a copy of X stored in on-chip shared memory
- X can be any dimension
- Y must be a unidimensional array (currently)
- To be used inside functions
- The array passed as argument must fit the capacity of the shared memory

<img src="../images/JACC-perfresults-shared.png" alt="JACC.shared Performance Results" style="width:90%;height:auto;">

## Exploiting multi-GPU nodes

Submodule `JACC.Multi`
- Deploy JACC’s specification to multi-GPU
- Same API (just add `Multi`)

JACC.Multi.array: `JX = JACC.Multi.array(X)`
- X is evenly distributed into the different GPUs
- JX is an backend-specific structure for managing a set of sub-arrays

JACC.Multi.parallel_for and JACC.Multi.parallel_reduce
- Same API, but...
- Launches portion of workload on all devices

```julia
# Unidimensional arrays
function axpy(i, alpha, x, y)
    x[i] += alpha * y[i]
end
function dot(i, x, y)
    return x[i] * y[i]
end
SIZE = 1_000_000
x = round.(rand(Float64, SIZE) * 100)
y = round.(rand(Float64, SIZE) * 100)
alpha = 2.5
dx = JACC.Multi.array(x)
dy = JACC.Multi.array(y)
JACC.Multi.parallel_for(SIZE, axpy, alpha, dx, dy)
res = JACC.Multi.parallel_reduce(SIZE, dot, dx, dy)

# Multidimensional arrays
function axpy_2d(i, j, alpha, x, y)
    x[i, j] += alpha * y[i, j]
end
function dot_2d(i, j, x, y)
    return x[i, j] * y[i, j]
end
SIZE = 1_000
x = round.(rand(Float64, SIZE, SIZE) * 100)
y = round.(rand(Float64, SIZE, SIZE) * 100)
alpha = 2.5
dx = JACC.Multi.array(x)
dy = JACC.Multi.array(y)
JACC.Multi.parallel_for((SIZE, SIZE), axpy_2d, alpha, dx, dy)
res = JACC.Multi.parallel_reduce((SIZE, SIZE), dot_2d, dx, dy)

```

<img src="../images/JACC-perfresults-multi.png" alt="JACC.Multi Performance Results" style="width:90%;height:auto;">

- There are additional functions for managing ghost elements between devices.
- _More thorough example in tests_


# Applications and Ongoing Efforts

Examples and applications using JACC
- https://github.com/tdehoff/JACC-7-point-stencil
- https://github.com/JuliaORNL/JACC-applications
- https://github.com/JuliaORNL/MiniVATES.jl
- https://github.com/JuliaORNL/GrayScott.jl

Ongoing Efforts
- Performance benchmarking (closing gaps)
- Expanded API (covering more use-cases)
- More intuitive kernel launch
- JACC.Async : Concurrency on multi-device nodes
- JACC.Auto : Autotuning
- Have ideas? Send us a message!

Other HPC Julia tutorial resources
- SC24 tutorial