## NPB-openACC-C-FT Implementation
In this self-paced, hands-on lab, we will briefly explore some methods for OpenACC

Qichao Hong

---
Before we begin, let's verify [WebSockets](http://en.wikipedia.org/wiki/WebSocket) are working on your system.  To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above.  If all goes well, you should see get some output returned below the grey cell.  If not, please consult the [Self-paced Lab Troubleshooting FAQ](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) to debug the issue.

In [1]:
print ("The answer should be three: " + str(1+2))

The answer should be three: 3


First, run the cell below to get some info about the GPUs on the server.

In [2]:
!nvidia-smi

Tue May 23 08:35:23 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 780 Ti  Off  | 0000:01:00.0     N/A |                  N/A |
| 26%   39C    P8    N/A /  N/A |    346MiB /  3017MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 780 Ti  Off  | 0000:02:00.0     N/A |                  N/A |
| 26%   39C    P8    N/A /  N/A |      1MiB /  3020MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                            

CPU: Intel i7-4960x

---
<p class="hint_trigger">If you have never before taken an IPython Notebook based self-paced lab from NVIDIA, click this green box.
      <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it's usage.  If you've never taken a lab on this system before, it's highly encourage you watch this short video first.<br><br>
<div align="center"><iframe width="640" height="390" src="http://www.youtube.com/embed/ZMrDaLSFqpY" frameborder="0" allowfullscreen></iframe></div>
<br>
<h2 style="text-align:center;color:red;">Attention Firefox Users</h2><div style="text-align:center; margin: 0px 25px 0px 25px;">There is a bug with Firefox related to setting focus in any text editors embedded in this lab. Even though the cursor may be blinking in the text editor, focus for the keyboard may not be there, and any keys you press may be applying to the previously selected cell.  To work around this issue, you'll need to first click in the margin of the browser window (where there are no cells) and then in the text editor.  Sorry for this inconvenience, we're working on getting this fixed.</div></div></div></div></p>

## Introduction to OpenACC

Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide hints via compiler directives (or 'pragmas') to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator. 

If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran, C or C++ source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*

Watch the following short video introduction to OpenACC:

<div align="center"><iframe width="640" height="390" style="margin: 0 auto;" src="http://www.youtube.com/embed/c9WYCFEt_Uo" frameborder="0" allowfullscreen></iframe></div>

This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.

If you are confused now, or at any point in this lab, you can consult the <a href="#FAQ">FAQ</a> located at the bottom of this page.

# Step 1 - Characterize Your Application



The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to *identify opportunities* and *successfully* parallelize a piece of code. The first step in OpenACC programming then is to *characterize the application*. This includes:

+ benchmarking the single-thread, CPU-only version of the application
+ understanding the program structure and how data is passed through the call tree
+ profiling the application and identifying computationally-intense "hot spots"
    + which loop nests dominate the runtime?
    + what are the minimum/average/maximum tripcounts through these loop nests?
    + are the loop nests suitable for an accelerator?
+ insuring that the algorithms you are considering for acceleration are *safely* parallel

Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two directives at a time and see how things work: the compiler provides a *lot* of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.

## Step 1 Profiling and Benchmarking

Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the `FT` file without making any changes -- and see how fast the serial program executes. This will establish a baseline for future comparisons.  Execute the following two cells to compile and run the program.

In [7]:
!cd ./NPB-acc/FT-seq/NPB3.3-SER-C/ && make veryclean && make FT CLASS=A

rm -f core 
rm -f *~ */core */*~ */*.o */npbparams.h */*.obj */*.exe
rm -f sys/setparams sys/makesuite sys/setparams.h
rm -f {DC/,}ADC.{logf,view,dat,viewsz,groupby,chunks}.* 
rm -f bin/sp.* bin/lu.* bin/mg.* bin/ft.* bin/bt.* bin/is.*
rm -f bin/ep.* bin/cg.* bin/ua.* bin/dc.*
   =      NAS PARALLEL BENCHMARKS 3.3        =
   =      Serial Versions                    =
   =      C                                  =

cd FT; make CLASS=A
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/FT-seq/NPB3.3-SER-C/FT'
make[2]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/FT-seq/NPB3.3-SER-C/sys'
gcc  -o setparams setparams.c
make[2]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/FT-seq/NPB3.3-SER-C/sys'
../sys/setparams ft A
gcc  -c -I../common -g -Wall -O3 -mcmodel=medium appft.c
gcc  -c -I../common -g -Wall -O3 -mcmodel=medium auxfnct.c
[01m[Kauxfnct.c:[m[K In function ‘[01m[KCompExp[m[K’:
   int m, nu, ku, i, j, ln;
[01;32m[K  

In [8]:
!pgprof --cpu-profiling on --cpu-profiling-mode top-down ./NPB-acc/FT-seq/NPB3.3-SER-C/bin/ft.A.x



 NAS Parallel Benchmarks (NPB3.3-SER-C) - FT Benchmark

 Size                :  256x 256x 128
 Iterations          :              6

 T =    1     Checksum =    5.046735008193E+02    5.114047905510E+02
 T =    2     Checksum =    5.059412319734E+02    5.098809666433E+02
 T =    3     Checksum =    5.069376896287E+02    5.098144042213E+02
 T =    4     Checksum =    5.077892868474E+02    5.101336130759E+02
 T =    5     Checksum =    5.085233095391E+02    5.104914655194E+02
 T =    6     Checksum =    5.091487099959E+02    5.107917842803E+02
 Verification test for FT successful


 FT Benchmark Completed.
 Class           =                        A
 Size            =            256x 256x 128
 Iterations      =                        6
 Time in seconds =                     3.56
 Mop/s total     =                  2005.08
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              23 

### Quality Checking/Keeping a Record

*After each step*, we will record the results from our benchmarking and correctness tests in a table like this one: 

|Step| Execution       | ExecutionTime (s)     | Speedup vs. 1 CPU Thread       | Correct? | Programming Time |
|:--:| --------------- | ---------------------:| ------------------------------:|:--------:| -----------------|
|1   | CPU 1 thread    | 3.56           |                                | Yes      |                |  |



Swarztrauber takes most of the time to compute

## Step 2 - Add Compute Directives 

Things need to to before you add #pragma ...
    1. Initiate the GPU
        acc_init(acc_device_default);
    2. Create the variables on GPU needed to run
        #pragma acc data create(u0_real,u0_imag,u1_real,u1_imag,u_real,u_imag, twiddle,gty1_real,gty1_imag, gty2_real, gty2_imag)
        {
            ...
         }
    3. compute_initial_conditions(), this function calculate the initiate conditon in cpu. In order to use GPU, we need to update the variable in GPU too.
        
        Add #pragma acc update device(var) at the end of the function to update relative variable in GPU
        

#### Then locate for loops
##### Add '#pragma acc parallel loop' at the front
##### Compiler will tell if the loop is paralleiziable

In [9]:
!cd ./NPB-acc/FT-step1/ && make clean && make CC=pgcc CLASS=A

rm -f *.x *.i *.B *.spin *.s *.t *.w2c.c *.w2c.h *.w2c.cu *.ptx *.o *~ mputil* ../common/*.o
rm -f ft npbparams.h core
if [ -d rii_files ]; then rm -r rii_files; fi
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams ft A
pgcc  -c -I../common -O3 -acc -ta=nvidia,cc35,cuda8.0  -Minfo=accel -mcmodel=medium -DCRPL_COMP=0 ft.c
main:
    217, Generating create(gty1_imag[:][:][:],gty1_real[:][:][:],gty2_imag[:][:][:],gty2_real[:][:][:],twiddle[:],u0_imag[:],u0_real[:],u1_imag[:],u1_real[:],u_imag[:],u_real[:])
init_ui:
    280, Generating present(twiddle[:],u0_imag[:],u0_real[:],u1_imag[:],u1_real[:])
         Accelerator kernel generated
         Generating Tesla code
        281, #pragma acc loop gang /* blockIdx.x */
        282, #pragma acc loop seq
        283, #pragma acc loop vector(128) 

If you get any error please check your work and try re-compilling.

In [10]:
!ulimit -s unlimited && ./NPB-acc/FT-step1/ft.A.x



 NAS Parallel Benchmarks (NPB3.3-ACC-C) - FT Benchmark

 Size                :  256x 256x 128
 Iterations                  :      6

 T =    1     Checksum =    5.046735008193E+02    5.114047905510E+02
 T =    2     Checksum =    5.059412319734E+02    5.098809666433E+02
 T =    3     Checksum =    5.069376896287E+02    5.098144042213E+02
 T =    4     Checksum =    5.077892868474E+02    5.101336130759E+02
 T =    5     Checksum =    5.085233095391E+02    5.104914655194E+02
 T =    6     Checksum =    5.091487099959E+02    5.107917842803E+02
 Result verification successful
 class = A


 FT Benchmark Completed.
 Class           =                        A
 Size            =            256x 256x 128
 Iterations      =                        6
 Time in seconds =                     8.94
 Mop/s total     =                   798.12
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =           

Let's record our results in the table:

|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |
| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |
|1| CPU 1 thread |3.56      |                           |          | |
|2| Add parallel directive  |8.94      | 0.40X           | Yes      | ||


### Some loops running sequentially

Solution: add independent clauses


In [14]:
!cd ./NPB-acc/FT-step1/ && make clean && make CC=pgcc CLASS=A

rm -f *.x *.i *.B *.spin *.s *.t *.w2c.c *.w2c.h *.w2c.cu *.ptx *.o *~ mputil* ../common/*.o
rm -f ft npbparams.h core
if [ -d rii_files ]; then rm -r rii_files; fi
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams ft A
pgcc  -c -I../common -O3 -acc -ta=nvidia,cc35,cuda8.0  -Minfo=accel -mcmodel=medium -DCRPL_COMP=0 ft.c
main:
    217, Generating create(gty1_imag[:][:][:],gty1_real[:][:][:],gty2_imag[:][:][:],gty2_real[:][:][:],twiddle[:],u0_imag[:],u0_real[:],u1_imag[:],u1_real[:],u_imag[:],u_real[:])
init_ui:
    280, Generating present(twiddle[:],u0_imag[:],u0_real[:],u1_imag[:],u1_real[:])
         Accelerator kernel generated
         Generating Tesla code
        283, #pragma acc loop gang /* blockIdx.x */
        285, #pragma acc loop worker(4) /* threadIdx.y */
        287, #prag

In [15]:
!ulimit -s unlimited && ./NPB-acc/FT-step1/ft.A.x



 NAS Parallel Benchmarks (NPB3.3-ACC-C) - FT Benchmark

 Size                :  256x 256x 128
 Iterations                  :      6

 T =    1     Checksum =    5.046735008193E+02    5.114047905510E+02
 T =    2     Checksum =    5.059412319734E+02    5.098809666433E+02
 T =    3     Checksum =    5.069376896287E+02    5.098144042213E+02
 T =    4     Checksum =    5.077892868474E+02    5.101336130759E+02
 T =    5     Checksum =    5.085233095391E+02    5.104914655194E+02
 T =    6     Checksum =    5.091487099959E+02    5.107917842803E+02
 Result verification successful
 class = A


 FT Benchmark Completed.
 Class           =                        A
 Size            =            256x 256x 128
 Iterations      =                        6
 Time in seconds =                     0.78
 Mop/s total     =                  9198.01
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =           

Let's record our results in the table:

|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |
| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |
|1| CPU 1 thread |3.56      |                           |          | |
|2| Add parallel directive  |8.94      | 0.40X           | Yes      | |
|3| Add independent clause  |0.78      | 4.56X           | Yes      | ||


### optimization 
Manully control the block dimension

Like:
    #pragma acc parallel num_gangs(d3) num_workers(8) vector_length(128)

In [16]:
!cd ./NPB-acc/FT-final/ && make clean && make CC=pgcc CLASS=A

rm -f *.x *.i *.B *.spin *.s *.t *.w2c.c *.w2c.h *.w2c.cu *.ptx *.o *~ mputil* ../common/*.o
rm -f ft npbparams.h core
if [ -d rii_files ]; then rm -r rii_files; fi
make[1]: Entering directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
rm -f setparams setparams.h npbparams.h
rm -f *~ *.o
cc  -o setparams setparams.c
make[1]: Leaving directory '/home/qichao/Desktop/notebooks-acc/NPB-acc/sys'
../sys/setparams ft A
pgcc  -c -I../common -O3 -acc -ta=nvidia,cc35,cuda8.0  -Minfo=accel -mcmodel=medium -DCRPL_COMP=0 ft.c
main:
    217, Generating create(gty1_imag[:][:][:],gty1_real[:][:][:],gty2_imag[:][:][:],gty2_real[:][:][:],twiddle[:],u0_imag[:],u0_real[:],u1_imag[:],u1_real[:],u_imag[:],u_real[:])
init_ui:
    284, Generating present(twiddle[:],u0_imag[:],u0_real[:],u1_imag[:],u1_real[:])
    288, Loop is parallelizable
    290, Loop is parallelizable
    292, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        288, #pragma acc loop ga

In [17]:
!./NPB-acc/FT-final/ft.A.x



 NAS Parallel Benchmarks (NPB3.3-ACC-C) - FT Benchmark

 Size                :  256x 256x 128
 Iterations                  :      6

 T =    1     Checksum =    5.046735008193E+02    5.114047905510E+02
 T =    2     Checksum =    5.059412319734E+02    5.098809666433E+02
 T =    3     Checksum =    5.069376896287E+02    5.098144042213E+02
 T =    4     Checksum =    5.077892868474E+02    5.101336130759E+02
 T =    5     Checksum =    5.085233095391E+02    5.104914655194E+02
 T =    6     Checksum =    5.091487099959E+02    5.107917842803E+02
 Result verification successful
 class = A


 FT Benchmark Completed.
 Class           =                        A
 Size            =            256x 256x 128
 Iterations      =                        6
 Time in seconds =                     0.67
 Mop/s total     =                 10696.01
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =           

Let's record our results in the table:

|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |
| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |
|1| CPU 1 thread |3.56      |                           |          | |
|2| Add parallel directive  |8.94      | 0.40X           | Yes      | |
|3| Add independent clause  |0.78      | 4.56X           | Yes      | |
|4| Optimization  |0.67      | 5.31X           | Yes      | ||
