# Data Handling 

As described in the introduction, data organization on heterogeneous systems requires strategies for:
* Allocating and deallocating memory on the host and device (explicit memory, EM), _or_
* Allocating and deallocating memory in a unified virtual address space (managed memory, MM),
as well as for
* Copying data (EM) _or_ migrating data (MM) between host and device.

The following sections summarize how each approach addresses these requirements.

## CUDA / HIP

CUDA and HIP mainly use C-style pointers.
Where not noted otherwise, HIP API functions are identical to their CUDA counterparts with the exception of being prefixed with `hip` instead of `cuda`.

### EM

A frequently used pattern is having two separate pointers for host and device allocations, which share the same *data layout*.
To make this intent clear to code readers, often the same variable name is used but prefixed with a `d_` for the device version.
An additional prefixing with `h_` for host versions can be even more verbose, but is less often used in practice.

```cpp
double *data;                  // host allocation
cudaMallocHost((void **)&data, sizeof(double) * nx);        // HIP uses hipHostMalloc

double *d_data;                // device allocation    
cudaMalloc((void **)&d_data, sizeof(double) * nx);

/* ... */

cudaFree(d_data);              // device de-allocation

cudaFreeHost(data);            // host de-allocation        // HIP uses hipHostFree
```

Data transfer is performed with `cudaMemcpy(target, source, bytesToTransfer, direction)`, for example:

```cpp
cudaMemcpy(d_data, data, sizeof(double) * nx, cudaMemcpyHostToDevice);
```

### MM

```cpp
double *data;                  // unified allocation
cudaMallocManaged((void **)&data, sizeof(double) * nx);

/* ... */

cudaFree(data);                // unified de-allocation
```

Migration of data structures is implicit, but can be triggered explicitly to optimize performance.

```cpp
cudaMemPrefetchAsync(data, sizeof(double) * nx, 0 /* deviceId */);   // host to device
cudaMemPrefetchAsync(data, sizeof(double) * nx, cudaCpuDeviceId);    // device to host
```

## Thrust

Thrust provides GPU-aware C++ containers as alternatives to raw pointers.
Like thrust's algorithms, they closely mirroring the STL API.

### EM

Host and device data can be allocated similarly to `std::vector`.

```cpp
thrust::host_vector<double> data(nx);
thrust::device_vector<double> d_data(nx);
```

De-allocation is done implicitly when the destructor of the container is called.

Copying data between host and device is available via copy functions.

```cpp
thrust::copy(data.begin(), data.end(),
             d_data.begin());
```

### MM

Thrust also offers containers for managed memory allocations.

```cpp
thrust::universal_vector<double> data(nx);
```

## SYCL

SYCL supports two primary memory management models:
* A CUDA/HIP-like approach supporting both EM and MM
* SYCL buffers, which enable automatic dependency analysis

All examples assume an initialized SYCL queue `q`.

### EM

```cpp
double *data;                  // host allocation
data = sycl::malloc_host<double>(nx, q);

double *d_data;                // device allocation    
d_data = sycl::malloc_device<double>(nx, q);

/* ... */

sycl::free(d_data, q);         // device de-allocation

sycl::free(data, q);           // host de-allocation
```

```cpp
q.memcpy(d_data, data, sizeof(double) * nx);
```

### MM

```cpp
double *data;                  // unified allocation
data = sycl::malloc_shared<double>(nx, q);

/* ... */

sycl::free(data, q);           // unified de-allocation
```

Pre-fetching *to the device* can be performed as an additional optimization.

```cpp
q.prefetch(data, nx * sizeof(double));
```

### Buffer

```cpp
double *data;                  // host allocation
data = new double[nx];

{
    // device buffer allocation
    sycl::buffer b_data(data, sycl::range(nx));
} // implicit device to host -H copy of destroyed buffers
```

## Kokkos

Kokkos builds on allocations in memory spaces.
Depending on the configuration during compilation, this defaults to device allocations (i.e. if GPU support is enabled).
Host allocations can then be created as mirror views.

```cpp
{
    // device allocation
    Kokkos::View<double *> data("data", nx);

    // host allocation
    auto h_data = Kokkos::create_mirror_view(data);
} // implicit de-allocation
```

```cpp
Kokkos::deep_copy(data, h_data);
```

## OpenMP

OpenMP handles data staging, including (de-)allocation and transfers, via structured *target data* regions.
Note that the data pointer has been renamed to `field` to avoid confusion.

```cpp
auto field = new double[nx];   // host allocation

#pragma omp target data map(tofrom : field[0:nx])
{ // device allocation and H2D transfer

    /* ... */

} // device de-allocation and D2H transfer

delete[] field;                // host de-allocation
```

For improved flexibility, unstructured primitives are also available.

```cpp
#pragma omp target enter data map(to   : field[0:nx])

/* ... */

#pragma omp target exit  data map(from : field[0:nx])
```

With managed memory, target data regions are unnecessary.

## OpenACC

OpenACC follows the same principles as OpenMP for data management, but does so with a slightly different syntax.

```cpp
#pragma acc enter data copyin (field[0:nx])
#pragma acc exit  data copyout(field[0:nx])
```

## Next Step

Proceed to the [parallel computation](./parallel-computation.ipynb) notebook.