# Extra JACC Topics

## Finer granularity

Defaults for `JACC.parallel_for`:
- Synchronize
- Use default stream
- Compute threads, blocks, and shmem_size

```julia
@kwdef mutable struct LaunchSpec{Backend}
    stream = default_stream(Backend)
    threads = 0
    blocks = 0
    shmem_size::Int = -1
    sync::Bool = true
end

launch_spec(; kw...) = LaunchSpec{typeof(default_backend())}(; kw...)

```

You can change these defaults using `JACC.launch_spec`.

Specify one or more of the keywords and call `JACC.parallel_for(spec, N, f, x…)`.

In [None]:
N = 500
a_d = JACC.ones(N)
JACC.parallel_for(JACC.launch_spec(; sync = false, threads = 1000), N, a_device) do i, a
    @inbounds a[i] += 5.0
end
JACC.synchronize() # because it was non-synchronizing
;

You can use the macro syntax to make this simpler in most cases.

In [None]:

f(i, a) = @inbounds a[i] += 5.0

JACC.@parallel_for range=N sync=false f(a_d)
JACC.synchronize()
;

There are more details about this and `parallel_reduce` but we'll skip those here.

## Shared Memory

See the [paper](https://ieeexplore.ieee.org/document/10938453).

The `JACC.shared` gives easy access to on-chip GPU shared memory. If you access an array many times per thread, you may get significant speedup by adding just one line.

<img src="https://github.com/JuliaORNL/JACC-applications/blob/main/tutorial/images/GPU-shared-memory.png?raw=1" alt="GPU shared memory" style="width:45%;height:auto;">

```julia
function spectral(i, j, image, filter, num_bands)
    for b in 1:num_bands
        @inbounds image[b, i, j] *= filter[j]
    end
end

function spectral_shared(i, j, image, filter, num_bands)
    filter_shared = JACC.shared(filter) # init shared memory
    for b in 1:num_bands
        @inbounds image[b, i, j] *= filter_shared[j]
    end
end

num_bands = 60
num_voxel = 10_240
size_voxel = 64*64
image = init_image(Float32, num_bands, num_voxel, size_voxel)
filter = init_filter(Float32, size_voxel)
jimage = JACC.array(image)
jfilter = JACC.array(filter)

JACC.parallel_for((num_voxel,size_voxel), spectral, jimage, jfilter, num_bands)
# (or)
JACC.parallel_for((num_voxel,size_voxel), spectral_shared, jimage, jfilter, num_bands)

```

Things to keep in mind about `Y = JACC.shared(X)`:
- Used inside kernel functions
- X is a JACC.array
- Y is a copy of X stored in on-chip shared memory
- X must fit the shared memory capacity

<img src="https://github.com/JuliaORNL/JACC-applications/blob/main/tutorial/images/JACC-perfresults-shared.png?raw=1" alt="JACC.shared Performance Results" style="width:90%;height:auto;">

## Exploiting multi-device nodes (EXPERIMENTAL)

### `JACC.Multi` : Parallelism on multi-device nodes

See the [paper](https://doi.ieeecomputersociety.org/10.1109/eScience65000.2025.00036)

Submodule `JACC.Multi`
- Deploy JACC’s specification to multi-GPU
- Same API (just add `Multi`)

JACC.Multi.array: `JX = JACC.Multi.array(X)`
- X is evenly distributed into the different GPUs
- JX is an backend-specific structure for managing a set of sub-arrays

JACC.Multi.parallel_for and JACC.Multi.parallel_reduce
- Same API, but...
- Launches portion of workload on all devices

JACC.Multi support API
- `ndev()` : number of available devices
- `device_id(a::ArrayPart)` (device) : device where array partition is allocated
- `copy!(dest::MultiArray, src::MultiArray)`
- `part_length(a::MultiArray)` : get length of each array partition
- `sync_ghost_elems!(a::MultiArray` : copy ghost elements between array partitions
- `ghost_shift(i::Integer, a::ArrayPart)` (device) get index shifted past starting ghost elements

```julia
# Unidimensional arrays
function axpy(i, alpha, x, y)
    x[i] += alpha * y[i]
end
function dot(i, x, y)
    return x[i] * y[i]
end
SIZE = 1_000_000
x = round.(rand(Float64, SIZE) * 100)
y = round.(rand(Float64, SIZE) * 100)
alpha = 2.5
dx = JACC.Multi.array(x)
dy = JACC.Multi.array(y)
JACC.Multi.parallel_for(SIZE, axpy, alpha, dx, dy)
res = JACC.Multi.parallel_reduce(SIZE, dot, dx, dy)

# Multidimensional arrays
function axpy_2d(i, j, alpha, x, y)
    x[i, j] += alpha * y[i, j]
end
function dot_2d(i, j, x, y)
    return x[i, j] * y[i, j]
end
SIZE = 1_000
x = round.(rand(Float64, SIZE, SIZE) * 100)
y = round.(rand(Float64, SIZE, SIZE) * 100)
alpha = 2.5
dx = JACC.Multi.array(x)
dy = JACC.Multi.array(y)
JACC.Multi.parallel_for((SIZE, SIZE), axpy_2d, alpha, dx, dy)
res = JACC.Multi.parallel_reduce((SIZE, SIZE), dot_2d, dx, dy)

```

<img src="https://github.com/JuliaORNL/JACC-applications/blob/main/tutorial/images/JACC-perfresults-multi.png?raw=1" alt="JACC.Multi Performance Results" style="width:90%;height:auto;">

- There are additional functions for managing ghost elements between devices.
- _More thorough example in tests_

### `JACC.Async` : Concurrency on multi-device nodes

Similar API, just add `Async` and a (user-defined) "queue" id

```julia
JACC.Async.parallel_for(1, ...)
JACC.Async.parallel_for(2, ...)

# parallel_reduce return value is on device
res_d = JACC.Async.parallel_reduce(1, ...)
res = JACC.to_host(res_d)[]
```

JACC.Async support API
- `ones(id, ...)`, `zeros(id, ...)`, `fill(id, ...)`
- `ndev()` : number of available devices
- `synchronize(id)` : finish kernels on specified "queue"
- `synchronize()` : synchronize all devices

_See tests for example._

# Applications and Ongoing Efforts

Examples and applications using JACC
- https://github.com/tdehoff/JACC-7-point-stencil
- https://github.com/JuliaORNL/JACC-applications
- https://github.com/JuliaORNL/MiniVATES.jl
- https://github.com/JuliaORNL/GrayScott.jl

Ongoing Efforts
- Performance benchmarking (closing gaps)
- Expanded API (covering more use-cases)
- More intuitive kernel launch
- JACC.Auto : Autotuning
- Have ideas? Send us a message!

Other HPC Julia tutorial resources
- SC24 tutorial