# SYCL

## Level 0: Hello World

We start with a serial CPU code printing a range of numbers:

```cpp
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

The full example is available in [print-numbers-base.cpp](../src/print-numbers/print-numbers-base.cpp), and can be compiled and executed using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/print-numbers/print-numbers-base ../src/print-numbers/print-numbers-base.cpp

In [None]:
!../build/print-numbers/print-numbers-base

SYCL also provides an abstraction for parallel loops.
Using it requires a *handler*, which in turn requires a work *queue*.

The latter can be initialized with the `in_order` property that ensures that kernels and other operations are executed in order.

```cpp
sycl::queue q(sycl::property::queue::in_order{});
```

Work can then be submitted to the queue which provides a handler

```cpp
q.submit([&](sycl::handler &h) {
    h.parallel_for(nx, [=](auto i0) {
        // this won't work yet
        printf("%ld\n", i0);
    });
});
```

`printf` is, in contrast to most other approaches, _not_ supported in SYCL.
One working alternative is using sycl streams.
Below is the complete snippet excluding the setup of the queue `q`.

```cpp
q.submit([&](sycl::handler &h) {
    sycl::stream str(8192, 1024, h); // use sycl stream instead of printf
    h.parallel_for(nx, [=](auto i0) {
        // printf("%ld\n", i0);
        str << (size_t)i0 << sycl::endl; // use sycl endl
    });
});
```

You can tune the workgroup size by specifying global and local sizes (the total number of threads and the number of threads per workgroup).
Note that these must be evenly divisible, and any extra threads may need to be masked.

```cpp
auto local_size = 256;
auto global_size = ceilingDivide(nx, local_size) * local_size;
q.submit([&](sycl::handler &h) {
    sycl::stream str(8192, 1024, h); // use sycl stream instead of printf
    h.parallel_for( sycl::nd_range<1>{ global_size, local_size }, [=](auto item) {
        auto i0 = item.get_global_id(0);
        if (i0 < nx) {
            str << (size_t)i0 << sycl::endl; // use sycl endl
        }
    });
});
```

In all cases, explicit synchronization with the GPU is performed by calling:

```cpp
q.wait();
```

The full example is available in [print-numbers-sycl.cpp](../src/print-numbers/print-numbers-sycl.cpp), and can be compiled and executed using the following cells.

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/print-numbers/print-numbers-sycl ../src/print-numbers/print-numbers-sycl.cpp

In [None]:
!../build/print-numbers/print-numbers-sycl

## Level 1: Adding Managed Memory

Our next application is increasing all elements of an array by one.

[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part and our entry point is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

Data allocation and deallocation is done explicitly in SYCL.
For the case of managed memory this works as follows (again assuming an initialized queue `q`).

```cpp
double *data;                  // unified allocation
data = sycl::malloc_shared<double>(nx, q);

/* ... */

sycl::free(data, q);           // unified de-allocation
```

Pre-fetching *to the device* can be performed as an additional optimization.

```cpp
q.prefetch(data, nx * sizeof(double));
```

The allocated (managed) arrays can be used directly in kernels.

```cpp
q.submit([&](sycl::handler &h) {
    h.parallel_for(nx, [=](auto i0) {
        data[i0] += 1;
    });
});
```

The complete example code is available in [increase-sycl-mm.cpp](../src/increase/increase-sycl-mm.cpp), and can be built and executed using the following cells.

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/increase/increase-sycl-mm ../src/increase/increase-sycl-mm.cpp

In [None]:
!../build/increase/increase-sycl-mm

## Level 2: Switching to Explicit Memory Management

Switching from managed memory to explicit memory management requires the following changes and additions:

**1.** Separate device and host allocations

```cpp
double *data;                  // host allocation
data = sycl::malloc_host<double>(nx, q);

double *d_data;                // device allocation    
d_data = sycl::malloc_device<double>(nx, q);

/* ... */

sycl::free(d_data, q);         // device de-allocation

sycl::free(data, q);           // host de-allocation
```

**2.** Explicit copies between host and device

```cpp
q.memcpy(d_data, data, sizeof(double) * nx);
```

The complete example code is available in [increase-sycl-expl.cpp](../src/increase/increase-sycl-expl.cpp).
Build and execute it using the following cells.

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/increase/increase-sycl-expl ../src/increase/increase-sycl-expl.cpp

In [None]:
!../build/increase/increase-sycl-expl

## Bonus Level: Using SYCL Buffers

In addition to the two variants of managing memory discussed so far, SYCL also offers a buffer and accessor system.
This, among other things, allows for setting up dependency graphs and automatic host-device data transfers.

```cpp
double *data;                  // host allocation
data = new double[nx];

{
    // device buffer allocation
    sycl::buffer b_data(data, sycl::range(nx));
} // implicit device to host -H copy of destroyed buffers
```

When using buffers, you must create additional accessors to access data.

```cpp
q.submit([&](sycl::handler &h) {
    auto data = b_data.get_access(h, sycl::read_write);
    h.parallel_for(nx, [=](auto i0) {
        data[i0] += 1;
    });
});
```

The complete example code is available in [increase-sycl-buffer.cpp](../src/increase/increase-sycl-buffer.cpp), and is built and executed using the following cells.

In [None]:
!icpx -O3 -march=native -std=c++17 -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_86 -o ../build/increase/increase-sycl-expl ../src/increase/increase-sycl-expl.cpp

In [None]:
!../build/increase/increase-sycl-expl