### GPU Offloading with OpenMP: The Simplest Example

OpenMP allows code blocks in C/C++ and Fortran application to be offloaded to accelerators. The latest OpenMP supports offloading to AMD, NVIDIA and Intel GPUs. In this tutorial, we will focus on NVIDIA GPUs only. 

#### Environement

First make sure that your enviroment is set of for GPU offloading. You may want to look at the following tutorials

 * [Setting up NVIDIA GPU offloading for OpenMP-GCC on Ubuntu](https://github.com/TeachingUndergradsCHC/modules/blob/master/Algorithms/hybrid_algorithms/resources/gpu_offloading.md) 
 * [OpenMP Device Check](../device_check.ipynb)
 
 
#### The `target` Directive 

The `omp target` directive can be used to offload tasks to the GPU. The general format for the `target` directive is shown below 

```C++

// ... code here executes on the CPU (host) 

// code in the succeeding block executes 
// on the GPU (device)
#pragma omp target      
{
   for (...)
     ...;
   ...
}

// ... code here executes on the CPU (host)
```

Using the template above we can execute any code block on the GPU. Consider the code below that scales the values in a floating-point array. 

In [12]:
%%writefile gpu_hello_world.cpp
#include<iostream>
#include<omp.h>

using namespace std;

int main() {
  unsigned N = 10000;
  float data[N];

  for (unsigned i = 0; i < N; i++) 
    data[i] = i;

  #pragma omp target
  for (unsigned i = 0; i < N; i++) 
    data[i] *= 3.14;
  
  cout << "Computation Done!" << endl; 
  
  // verify results
  for (unsigned i = 1; i < 2; i++) 
    cout << "data[1] = " << data[i] << endl;

  return 0;
}

Overwriting gpu_hello_world.cpp


The `omp target` directive has been inserted around the `for` that performs the scaling. This will result in the `for` loop being executed on the GPU. The rest of the code will execute on the CPU. Let's try it out. 

To compile an OpenMP program with GPU offloading, we need to pass two additional flags: `-fno-stack-protector` `-foffload=nvptx-none`. See [Setting up NVIDIA GPU offloading for OpenMP-GCC on Ubuntu](https://github.com/TeachingUndergradsCHC/modules/blob/master/Algorithms/hybrid_algorithms/resources/gpu_offloading.md) for why that's necessary.

In [13]:
!g++ -o gpu_hello_world gpu_hello_world.cpp -fno-stack-protector -foffload=nvptx-none -fopenmp

If you are doing this tutorial on your own machine and you get a compilation error, go through the set-up tutorials and make sure you have a CUDA-capable GPU that's being picked up by the NVIDIA driver and the device is "connected" to OpenMP. 

No extra steps are necessery to run an OpenMP application with offloading. So, we can run the application simply as follows 

In [14]:
!./gpu_hello_world

Computation Done!
data[1] = 3.14


The code seems to be working. But do we know if the task was actually offloaded to the GPU? No! In certain cases, OpenMP may ignore the directive and just run the code on the host. To check that a GPU kernel is running we can profile the code with `nvprof`. This will tell us how much time is being spent on the GPU, if any. 

In [17]:
!/usr/local/cuda/bin/nvprof ./gpu_hello_world

==24187== NVPROF is profiling process 24187, command: ./gpu_hello_world
Computation Done!
data[1] = 3.14
==24187== Profiling application: ./gpu_hello_world
==24187== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.88%  12.091ms         1  12.091ms  12.091ms  12.091ms  main$_omp_fn$0
                    0.07%  8.6720us         2  4.3360us     832ns  7.8400us  [CUDA memcpy HtoD]
                    0.05%  5.7280us         1  5.7280us  5.7280us  5.7280us  [CUDA memcpy DtoH]
      API calls:   72.92%  193.87ms         1  193.87ms  193.87ms  193.87ms  cuCtxCreate
                   17.93%  47.668ms         1  47.668ms  47.668ms  47.668ms  cuCtxDestroy
                    4.55%  12.098ms         1  12.098ms  12.098ms  12.098ms  cuCtxSynchronize
                    2.91%  7.7299ms         1  7.7299ms  7.7299ms  7.7299ms  cuModuleLoadData
                    0.95%  2.5219ms        20  126.10us  34.412us  1.1220ms  cuLink

Indeed the `for` loop has been offloaded and run on the GPU for 12 milliseconds. This of course doesn't buy as any performance since we haven't actually parallelized the code and so we are not taking advantage of the GPU parallel resources. We can run the CPU-only version of the code for comparison.   

In [18]:
%%writefile cpu_hello_world.cpp
#include<iostream>
#include<omp.h>

using namespace std;

int main() {
  unsigned N = 10000;
  float data[N];

  for (unsigned i = 0; i < N; i++) 
    data[i] = i;

  for (unsigned i = 0; i < N; i++) 
    data[i] *= 3.14;
  
  cout << "Computation Done!" << endl; 
  
  // verify results
  for (unsigned i = 1; i < 2; i++) 
    cout << "data[1] = " << data[i] << endl;

  return 0;
}

Writing cpu_hello_world.cpp


In [19]:
!g++ -o cpu_hello_world cpu_hello_world.cpp -fopenmp

In [23]:
!time ./cpu_hello_world

Computation Done!
data[1] = 3.14

real	0m0.003s
user	0m0.003s
sys	0m0.000s


Not surprisingly, CPU is much faster. 

### Offloading Parallel Code 

We almost never want to offload sequential tasks to the GPU. Any code that we want to offload to the GPU should be parallelized first. OpenMP makes this part easy as well. We can insert any `omp` pragma inside the offloaded region to paralellize the code (there are few exceptions, which we will discuss later). 

In our example, the `for` loop that scales the values in the `data` array can be parallelized with the `parallel for` pragma. 

In [1]:
%%writefile gpu_hello_world_par.cpp
#include<iostream>
#include<omp.h>

using namespace std;

int main() {
  unsigned N = 10000;
  float data[N];

  for (unsigned i = 0; i < N; i++) 
    data[i] = i;

  #pragma omp target
  #pragma omp parallel for 
  for (unsigned i = 0; i < N; i++) 
    data[i] *= 3.14;
  
  cout << "Computation Done!" << endl; 
  
  // verify results
  for (unsigned i = 1; i < 2; i++) 
    cout << "data[1] = " << data[i] << endl;

  return 0;
}

Overwriting gpu_hello_world_par.cpp


In this version, the parallelized for loop is offloaded to the GPU. Let's if this makes a difference. Let us check the  

In [2]:
!g++ -o gpu_hello_world_par gpu_hello_world_par.cpp -fno-stack-protector -foffload=nvptx-none -fopenmp

In [3]:
!./gpu_hello_world_par

Computation Done!
data[1] = 3.14


In [5]:
!/usr/local/cuda/bin/nvprof ./gpu_hello_world_par 2>&1 | grep main

 GPU activities:   99.15%  1.6688ms         1  1.6688ms  1.6688ms  1.6688ms  main$_omp_fn$0


The GPU kernel is now almost 12 times faster than before. This is version is also faster than the sequential CPU version. Let write a parallel version for the CPU (i.e., just take out the offload pragam).  

In [13]:
%%writefile cpu_hello_world_par.cpp
#include<iostream>
#include<omp.h>

using namespace std;

int main() {
  unsigned N = 10000;
  float data[N];

  for (unsigned i = 0; i < N; i++) 
    data[i] = i;

  #pragma omp parallel for 
  for (unsigned i = 0; i < N; i++) 
    data[i] *= 3.14;
  
  cout << "Computation Done!" << endl; 
  
  // verify results
  for (unsigned i = 1; i < 2; i++) 
    cout << "data[1] = " << data[i] << endl;

  return 0;
}

Overwriting cpu_hello_world_par.cpp


In [14]:
!g++ -o cpu_hello_world_par cpu_hello_world_par.cpp -fopenmp

In [15]:
!time ./cpu_hello_world_par

Computation Done!
data[1] = 3.14

real	0m0.008s
user	0m0.052s
sys	0m0.000s


On the CPU, the parallel version is actually slower than sequential. _Can we explain this behavior?_

We will notice that in this example, each thread is doing very little work (just one multiplication and store). On the CPU, which consists of complex powerful processing cores, this type of fine-grain parallelism often does not yield good results. The GPU on the other hand consists of _many_ simple cores and can execute this type of parallel code more efficiently. 

### Summary

We have learned how to offload parallel tasks to GPU using the `target` directive. This example is very simple, however. In this example, we let OpenMP make all the decisions about data mappin and thread creation and scheduling. To get better performance out of GPUs we will want control over these. We will look at the various clauses associated with the `target` that provides mechansims for explicit data mapping and creation of teams of threads. 