Skip to content

Latest commit

 

History

History
198 lines (131 loc) · 5.55 KB

2.01_openacc_profiling.rst

File metadata and controls

198 lines (131 loc) · 5.55 KB

OpenACC: Analysis

Steps in porting code to GPUs

  • The three key steps in porting to high performance accelerated code:
    1. Analyze/Identify parallelism
    2. Express data movement and parallelism
    3. Optimize data movement and loop performance
    4. Go back to 1!

image

  • Analyze your code to determine most likely places needing parallelization or optimization.
  • Parallelize your code by starting with the most time consuming parts and check for correctness.
  • Optimize your code to improve observed speed-up from parallelization.
  • One should generally start the process at the top with the analyze step. For complex applications, it's useful to have a profiling tool available to learn where your application is spending its execution time and to focus your efforts there. Since our example code is quite a bit simpler than a full application, we'll skip profiling the code and simply analyze the code by reading it

Compiler diagnostics

  • Compiler diagnostics is usually the first thing to check when starting the OpenACC work
    • It can tell you what operations were actually performed
    • Data copies that were made
    • If and how the loops were parallelized
  • The diagnostics is very compiler dependent
    • Compiler flags
    • Level and formatting of information
  • Diagnostics is controlled by compiler flag -Minfo=option
  • Useful options:
    • accel – operations related to the accelerator
    • all – print all compiler output
    • intensity – print loop computational intensity info

Work sharing construct: loop

  • Define a loop to be parallelized
    • C/C++: #pragma acc loop [clauses]
    • Fortran: !$acc loop [clauses]
    • Must be followed by a C/C++ or Fortran loop construct.
    • Combined constructs with parallel and kernels
      • #pragma acc kernels loop / !$acc kernels loop
      • #pragma acc parallel loop / !$acc parallel loop
  • Similar in functionality to OpenMP for/do construct
  • Loop index variables are private variables by default

Example: Adding two vectors

Adding two vectors

CPU

./examples/sum.c

OpenACC parallel

./examples/sum_parallel.c

OpenACC kernels

./examples/sum_kernels.c

-Minfo

$ pgcc -g -O3 -acc -Minfo=acc sum_parallel.c -o sum

main:
   21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
   23, Generating Tesla code
       25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

$ pgcc -g -O3 -acc -Minfo=accel sum_kernels.c -o sum

main:
   21, Generating copy(vecA[:],vecB[:],vecC[:]) [if not already present]
   23, Loop is parallelizable
       Generating Tesla code
       23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Example: PGI_ACC_TIME=1

$ cat slurm-13186502.out  #output of sum_parallel.c

Accelerator Kernel Timing data
main  NVIDIA  devicenum=0
  time(us): 451
  21: data region reached 2 times
      21: data copyin transfers: 3
           device time(us): total=245 max=100 min=71 avg=81
      29: data copyout transfers: 3
           device time(us): total=206 max=72 min=67 avg=68
  23: compute region reached 1 time
      23: kernel launched 1 time
          grid: [800]  block: [128]
          elapsed time(us): total=41 max=41 min=41 avg=41
 Reduction sum: 1.2020569031119108

$ cat slurm-13186514.out

Accelerator Kernel Timing data
main NVIDIA devicenum=0

time(us): 453 21: data region reached 2 times 21: data copyin transfers: 3 device time(us): total=247 max=100 min=72 avg=82 26: data copyout transfers: 3 device time(us): total=206 max=73 min=66 avg=68 23: compute region reached 1 time 23: kernel launched 1 time grid: [800] block: [128] elapsed time(us): total=40 max=40 min=40 avg=40

Reduction sum: 1.2020569031119108

NVIDIA nvprof

  • NVIDIA nvprof provides a simple interface to collect on a target without using the GUI. (new NSight system)
    • GPU profiling capabilities: High-level usage statistics, Timeline collection, Analysis metrics
    • basic CPU sampling

The Himeno stencil benchmark

Solving Poisson’s equation for incompressible fluid by the Jacobi iteration method.

image

(https://blogs.fau.de/hager/archives/7850)

himeno code

c

./examples/himeno_C.c

fortran

../examples/himeno_f77.f

$ srun -n 1 nvprof --cpu-profiling on --cpu-profiling-mode top-down ./himeno.x
======== CPU profiling result (top down):
Time(%)      Time  Name
85.74%    18.16s  jacobi
 7.46%     1.58s  initmt
 0.94%     200ms  | ???
 6.75%     1.43s  __c_mcopy4_sky
 0.05%      10ms  __c_mcopy4

Summary

  • Profiling is essential for optimization
  • NVPROF and NVVP for NVIDIA platform