# Introduction

As described in the introductory slides, GPU programming requires strategies for data handling and parallel computation on the device.
Below is a summary of the key takeaways and connection points to the next notebooks.

## Data Handling 

Data organization on heterogeneous systems requires:
* Allocating and deallocating memory on the host and device (explicit memory, EM), _or_
* Allocating and deallocating memory in a unified virtual address space (managed memory, MM),
as well as for
* Copying data (EM) _or_ migrating data (MM) between host and device.

## Parallel Computation

Parallel computation on GPUs requires:

**1. Trigger execution on GPUs**

**2. Spawn threads**

GPUs, like other hardware components, are designed with a hierarchical structure.
To efficiently utilize the hardware, threads and their organization are also typically hierarchical.

* CUDA/HIP: *thread* > *block* > *grid*
* SYCL: *work item* > *workgroup* > *nd-range*
* OpenMP: *thread* > *team* > *league*
* OpenACC: *thread* > *vector* > *worker* > *gang*
* Kokkos: *thread* > *team* > *league*

On the hardware level, threads are further grouped as follows:
* *Warps* of 32 on NVIDIA GPUs
* *Wavefronts* of 64 on AMD GPUs
* *Sub-groups* or *sub-workgroups* on Intel GPUs

**3. Map threads**

Each thread executes the same set of operations.
To differentiate them, each thread is assigned one or more IDs or indices, which are used to calculate *globally unique thread indices*.
These indices are then used to map threads to specific portions of the work.

CUDA/HIP make this explicit by providing *built-in thread variables* that yield different values depending on the evaluating thread.
A global thread index is commonly computed from the block index, the block-local thread index, and the block size (number of threads per block) as follows:
```cpp
blockIdx.x * blockDim.x + threadIdx.x
```

SYCL and Kokkos provide a global index as a single lambda parameter.

OpenMP and OpenACC internally map existing loop indices onto threads.

Many standard algorithms do not expose indices directly, but instead operate on references to elements of the input/output data structures.

**4. Synchronization**

Waiting for the GPU to finish outstanding work can be done either:
* implicitly at the end of GPU code sections (OpenMP, OpenACC), or
* via specific API function calls.

### Implementation

Generally, there are three main approaches to implementing parallel computations on GPUs:
* Writing a dedicated GPU kernel (function) as a separate code section, launched from the host code
* Defining an inline kernel for better language integration, while still exposing a GPU-specific implementation
* Relying on automatic conversion of code originally written for CPU execution

## Example Codes

### Hello World

Our first example code is (almost) a simple hello world application.
Instead of simply printing a single pre-defined message, it prints a range of numbers:

```cpp
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

The full example is available in [print-numbers-base.cpp](../src/print-numbers/print-numbers-base.cpp), and can be compiled and executed using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/print-numbers/print-numbers-base ../src/print-numbers/print-numbers-base.cpp

In [None]:
!../build/print-numbers/print-numbers-base

### Vector Increase

The hello world example does not require handling memory.
Our next test case - increasing all elements of an array by one - is more complex in this respect.

[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

Other tasks performed by our application include:
* Parsing command line arguments:
    * `nx`: the number of elements in the vector to be processed
    * `nItWarmUp`: the number of warm-up iterations
    * `nIt`: the number of timed iterations
* Allocating an array with `nx` elements
* Initializing the array so that each element holds a value equal to its index
* Calling `increase` for `nItWarmUp` iterations
* Calling `increase` for `nIt` iterations and measuring the time taken
* Printing statistics and estimated performance metrics
* Verifying that all array elements have the expected value
* Deallocating the array

You can compile and execute the code using the following cells:

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/increase/increase-base ../src/increase/increase-base.cpp

In [None]:
!../build/increase/increase-base