# Getting Started
Let's execute the cell below to display information about the GPUs running on the server by running the `nvidia-smi` command, which ships with the Nvidia GPU Drivers that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

Since the code will be run on Multicore as well try running the cell below and get details of the nnumber of core and CPU architecure on the system

In [None]:
!cat /etc/proc/cpuinfo

## Steps to follow
We will follow the Optimization cycle for porting and improving the code performance.

<img src="images/Optimization_Cycle.jpg" width="80%" height="80%">


Before starting to work on OpenACC let us copy the serial code to OpenACC fo

## Compile the serial code

In [None]:
!cd ../source_code/openacc && make clean && make

## Run the CPU code

In [None]:
!cd ../source_code/openacc && ./cfd

## Profiling

For this section, we will be using Nsight systems profiler and as the code is a CPU code, we will be tracing NVTX APIs (already integrated to the application). NVTX is useful for tracing of CPU events and time ranges. For more info on Nsight profiler, please see the __[profiler documentation](https://docs.nvidia.com/nsight-systems/)__.

### Viewing the profler output
There are two ways to look at profiled code: 

1) Command line based: Use `nsys` to collect and view profiling data from the command-line. Profiling results are displayed in the console after the profiling data is collected.

2) NVIDIA Nsight System: Open the Nsight System profiler and click on file > open, and choose the profiler output called `miniWeather_profile.qdrep`. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version. More details on where to download CUDA toolit can be found in the links in resources section below.

## Profile the CPU code to find hotspots

In [None]:
!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o minicfd_profile ./cfd

[Download the profiler output](../source_code/minicfd_profile.qdrep)

---

# Start adding OpenACC Pragmas

Before start modifying the serial code, let's make a copy of the serial code and rename it. Then, copy the output of the serial code `reference.nc` to the `checker` folder for later use (more info in later sections).

In [None]:
!cp ../source_code/serial/* ../source_code/openacc

Now, you can start modifying the C++ code and the `Makefile`:

From the top menu, click on *File*, and *Open* ` from the current directory at `C/source_code/openacc` directory. Remember to **SAVE** your code after changes, before running below cells.

#### Some Hints

1) Notice implicit and explicit copy of variables --> Add `-Minfo=accel` flag to `Makefile`.

2) Check if there is any data race in your code.( More details on data race is present in the Links and resources section below)

## Compile and run OpenACC enabled code


In [None]:
!cd ../source_code/openacc && make clean && make

Hint : Add `-Minfo=accel` to the `Makefile` to check that Kernel code indeed has been generated.

## Profile the OpenACC Code

In [None]:
!cd ../source_code/openacc && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o minicfdopenacc_profile ./cfd

You can examine the output on the terminal or you can download the file and view the timeline by opening the output with the NVIDIA Nsight Systems.

[Download the profiler output](../source_code/openacc/minicfdopenacc_profile.qdrep)

## Validating the Output



# Recommendations for adding OpenACC Pragmas

After finding the hotspot function take an incremental approach to add pargmas. 

1) Ignore the initialization, finalization and I/O functions

2) Take an incremental approach by adding pragmas one at a time

3) Unified Memory provides a good start point where you need not worry about the data transfers (`–ta=tesla:managed`)

4) Cross check the output after incremental changes to check algorithmic scalability

5) Move on to using data clauses for better performance 

6) Start with a small problem size that reduces the execution time. 

Example: `!./miniWeather 800 400 100`



**General tip:** Be aware of *Data Race* situation in which at least two threads access a shared variable at the same time. At least on thread tries to modify the variable. If data race happened, an incorrect result will be returned. So, make sure to validate your output against the serial version.

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

#

## Licensing 

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). 