# Thrust

## Level 0: Hello World

We start with a serial CPU code printing a range of numbers:

```cpp
for (size_t i0 = 0; i0 < nx; ++i0) {
    printf("%ld\n", i0);
}
```

The full example is available in [print-numbers-base.cpp](../src/print-numbers/print-numbers-base.cpp), and can be compiled and executed using the following cells.

In [None]:
!g++ -O3 -march=native -std=c++17 -o ../build/print-numbers/print-numbers-base ../src/print-numbers/print-numbers-base.cpp

In [None]:
!../build/print-numbers/print-numbers-base

For more control and support for additional computational patterns, Thrust is a strong alternative over 'standard' modern C++.

It provides GPU-accelerated versions of many STL algorithms, as well as additional ones.
Thrust algorithms also accept an *execution policy* argument, which specifies *where* the computation should be performed (note the difference to `std::execution`).

```cpp
thrust::for_each(thrust::device, thrust::make_counting_iterator<size_t>(0), thrust::make_counting_iterator<size_t>(nx),
                 [=] __device__ (const auto &i0) {
                    printf("%ld\n", i0);
                 });
```

As with STL algorithms, Thrust counterparts are synchronous with respect to the CPU, e.g. no explicit GPU synchronization is necessary.

The complete example code is available in [print-numbers-thrust.cu](../src/print-numbers/print-numbers-thrust.cu).
Build and execute them using the following cells.

In [None]:
!nvcc -O3 -std=c++17 --extended-lambda -arch=sm_86 -o ../build/print-numbers/print-numbers-thrust ../src/print-numbers/print-numbers-thrust.cu

In [None]:
!../build/print-numbers/print-numbers-thrust

## Level 1: Adding Managed Memory

Our next application is increasing all elements of an array by one.

[increase-base.cpp](../src/increase/increase-base.cpp) shows a serial CPU-only implementation.
Its key part and our entry point is the increase function.

```cpp
void increase(double* data, size_t nx) {
    for (size_t i0 = 0; i0 < nx; ++i0) {
        data[i0] += 1;
    }
}
```

Thrust provides GPU-aware C++ containers as alternatives to raw pointers.
Like Thrust's algorithms, they closely mirroring the STL API.

`universal_vector` relies on allocations of managed memory.

```cpp
thrust::universal_vector<double> data(nx);
```

De-allocation is done implicitly when the destructor of the container is called.

The parallelized computation can be implemented in different ways.

**1.** With a transform operation

```cpp
thrust::transform(thrust::device, data.begin(), data.end(), data.begin(),
                  [=] __host__ __device__ (double data_elem) {
                      return data_elem + 1;
                  }); // implicit synchronization
```

**2.** With a counting iterator

```cpp
double *data_ptr = thrust::raw_pointer_cast(data.data());
thrust::for_each(thrust::device, thrust::make_counting_iterator<size_t>(0), thrust::make_counting_iterator<size_t>(nx),
                 [=] __host__ __device__ (size_t i0) {
                     data_ptr[i0] += 1;
                 });
```

**3.** With tabulate

Going beyond the capabilities of the STL, Thrust provides the `tabulate` pattern.
It applies a transformation to the *index* of each element and stores the result in-place.

```cpp
double *data_ptr = thrust::raw_pointer_cast(data.data());
thrust::tabulate(thrust::device, data.begin(), data.end(),
                 [=] __host__ __device__ (size_t i0) {
                     return data_ptr[i0] + 1;
                 });
```

The complete example code is available in [increase-thrust-mm.cu](../src/increase/increase-thrust-mm.cu), and can be built and executed using the following cells.

In [None]:
!nvcc -O3 -std=c++17 --extended-lambda -arch=sm_86 -o ../build/increase/increase-thrust-mm ../src/increase/increase-thrust-mm.cu

In [None]:
!../build/increase/increase-thrust-mm 

## Level 2: Switching to Explicit Memory Management

Host and device data can be allocated similarly to `std::vector`.

```cpp
thrust::host_vector<double> data(nx);
thrust::device_vector<double> d_data(nx);
```

De-allocation is done implicitly when the destructor of the container is called.

Copying data between host and device is available via (synchronous) copy functions.

```cpp
thrust::copy(data.begin(), data.end(),
             d_data.begin());
```

The complete example code is available in [increase-thrust-expl.cu](../src/increase/increase-thrust-expl.cu).
Build and execute it using the following cells.

In [None]:
!nvcc -O3 -std=c++17 --extended-lambda -arch=sm_86 -o ../build/increase/increase-thrust-expl ../src/increase/increase-thrust-expl.cu

In [None]:
!../build/increase/increase-thrust-expl 