OpenACC: Analysis

Steps in porting code to GPUs

The three key steps in porting to high performance accelerated code:
1. Analyze/Identify parallelism
2. Express data movement and parallelism
3. Optimize data movement and loop performance
4. Go back to 1!

Analyze your code to determine most likely places needing parallelization or optimization.
Parallelize your code by starting with the most time consuming parts and check for correctness.
Optimize your code to improve observed speed-up from parallelization.
One should generally start the process at the top with the analyze step. For complex applications, it's useful to have a profiling tool available to learn where your application is spending its execution time and to focus your efforts there. Since our example code is quite a bit simpler than a full application, we'll skip profiling the code and simply analyze the code by reading it

Compiler diagnostics

Compiler diagnostics is usually the first thing to check when starting the OpenACC work
- It can tell you what operations were actually performed
- Data copies that were made
- If and how the loops were parallelized
The diagnostics is very compiler dependent
- Compiler flags
- Level and formatting of information
Diagnostics is controlled by compiler flag -Minfo=option
Useful options:
- accel – operations related to the accelerator
- all – print all compiler output
- intensity – print loop computational intensity info

Work sharing construct: `loop`

Define a loop to be parallelized
- C/C++: #pragma acc loop [clauses]
- Fortran: !$acc loop [clauses]
- Must be followed by a C/C++ or Fortran loop construct.
- Combined constructs with parallel and kernels
  
  #pragma acc kernels loop / !$acc kernels loop
  
  #pragma acc parallel loop / !$acc parallel loop
Similar in functionality to OpenMP for/do construct
Loop index variables are private variables by default

Example: Adding two vectors

Adding two vectors

CPU

./examples/sum.c

OpenACC parallel

./examples/sum_parallel.c

OpenACC kernels

./examples/sum_kernels.c

`-Minfo`

$ pgcc -g -O3 -acc -Minfo=acc sum_parallel.c -o sum

main:
   21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
   23, Generating Tesla code
       25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

$ pgcc -g -O3 -acc -Minfo=accel sum_kernels.c -o sum

main:
   21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
   23, Loop is parallelizable
       Generating Tesla code
       23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Example: `PGI_ACC_TIME=1`

$ cat slurm-13186502.out  #output of sum_parallel.c

Accelerator Kernel Timing data
main  NVIDIA  devicenum=0
  time(us): 451
  21: data region reached 2 times
      21: data copyin transfers: 3
           device time(us): total=245 max=100 min=71 avg=81
      29: data copyout transfers: 3
           device time(us): total=206 max=72 min=67 avg=68
  23: compute region reached 1 time
      23: kernel launched 1 time
          grid: [800]  block: [128]
          elapsed time(us): total=41 max=41 min=41 avg=41
 Reduction sum: 1.2020569031119108

$ cat slurm-13186514.out

Accelerator Kernel Timing data

main NVIDIA devicenum=0

time(us): 453 21: data region reached 2 times 21: data copyin transfers: 3 device time(us): total=247 max=100 min=72 avg=82 26: data copyout transfers: 3 device time(us): total=206 max=73 min=66 avg=68 23: compute region reached 1 time 23: kernel launched 1 time grid: [800] block: [128] elapsed time(us): total=40 max=40 min=40 avg=40

Reduction sum: 1.2020569031119108

NVIDIA nvprof

NVIDIA nvprof provides a simple interface to collect on a target without using the GUI. (new NSight system)
- GPU profiling capabilities: High-level usage statistics, Timeline collection, Analysis metrics
- basic CPU sampling

The Himeno stencil benchmark

Solving Poisson’s equation for incompressible fluid by the Jacobi iteration method.

(https://blogs.fau.de/hager/archives/7850)

himeno code

c

./examples/himeno_C.c

fortran

../examples/himeno_f77.f

$ srun -n 1 nvprof --cpu-profiling on --cpu-profiling-mode top-down ./himeno.x
======== CPU profiling result (top down):
Time(%)      Time  Name
85.74%    18.16s  jacobi
 7.46%     1.58s  initmt
 0.94%     200ms  | ???
 6.75%     1.43s  __c_mcopy4_sky
 0.05%      10ms  __c_mcopy4

Summary

Profiling is essential for optimization
NVPROF and NVVP for NVIDIA platform

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.01_openacc_profiling.rst

2.01_openacc_profiling.rst

OpenACC: Analysis

Steps in porting code to GPUs

Compiler diagnostics

Example: Adding two vectors

`-Minfo`

Example: `PGI_ACC_TIME=1`

NVIDIA nvprof

The Himeno stencil benchmark

Summary

Files

2.01_openacc_profiling.rst

Latest commit

History

2.01_openacc_profiling.rst

File metadata and controls

OpenACC: Analysis

Steps in porting code to GPUs

Compiler diagnostics

Work sharing construct: loop

Example: Adding two vectors

-Minfo

Example: PGI_ACC_TIME=1

NVIDIA nvprof

The Himeno stencil benchmark

Summary

Work sharing construct: `loop`

`-Minfo`

Example: `PGI_ACC_TIME=1`