# Getting Started
Let's execute the cell below to display information about the GPUs running on the server by running the `nvidia-smi` command, which ships with the Nvidia GPU Drivers that we will be using. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

Since the code will be run on Multicore as well try 

In [None]:
!cat /etc/proc/cpuinfo

# A MINI-WEATHER APPLICATION

In this lab we will accelerate a Fluid Simulation in the context of atmosphere and weather simulation.
The mini weather code mimics the basic dynamics seen in the atmspheric weather and climate.

The figure below demonstrates how a narrow jet of fast and slightly cold wind is injected into a balanced, neutral atmosphere at rest from the left domain near the model.

<img src="images/Time.jpg" width="80%" height="80%">

Simulation is a repetitive process from 0 to the desired simulated time, increasing by Δt on every iteration.
Each Δt step is practically the same operation. Each simulation is solving a differential equation that represents how the flow of the atmosphere (fluid) changes according to small perturbations. To simplify this solution the code uses dimensional splitting: Each dimension X and Z are treated independently.

<img src="images/X_Y.jpg" width="80%" height="80%">

The differential equation has a time derivative that needs integrating, and a simple low-storage Runge-Kutta ODE solver is used to integrate the time derivative. Each time step, the order in which the dimentions are solved is reversed, giving second-order accuracy. 

<img src="images/Range-Kutta.jpg" width="70%" height="70%">

### The objective of this exercise is not to dwell into the Maths part of it but to make use of OpenACC to parallelize and improve the performance.

The general flow of the code is as shown in diagram below. For each time step the differential equations are solved.

<img src="images/Outer_Loop.jpg" width="70%" height="70%">


```cpp
while (etime < sim_time) {
    //If the time step leads to exceeding the simulation time, shorten it for the last step
    if (etime + dt > sim_time) { dt = sim_time - etime; }
    //Perform a single time step
    perform_timestep(state,state_tmp,flux,tend,dt);
    //Inform the user
    if (masterproc) { printf( "Elapsed Time: %lf / %lf\n", etime , sim_time ); }
    //Update the elapsed time and output counter
    etime = etime + dt;
    output_counter = output_counter + dt;
    //If it's time for output, reset the counter, and do output
    if (output_counter >= output_freq) {
      output_counter = output_counter - output_freq;
      output(state,etime);
    }
  }
  
```

At every time step the direction is reversed to get second order derivative.

<img src="images/Time_Step.jpg" width="70%" height="70%">

```cpp
void perform_timestep( double *state , double *state_tmp , double *flux , double *tend , double dt ) {
  if (direction_switch) {
    //x-direction first
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_X , flux , tend );
    //z-direction second
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_Z , flux , tend );
  } else {
    //z-direction second
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_Z , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_Z , flux , tend );
    //x-direction first
    semi_discrete_step( state , state     , state_tmp , dt / 3 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state_tmp , dt / 2 , DIR_X , flux , tend );
    semi_discrete_step( state , state_tmp , state     , dt / 1 , DIR_X , flux , tend );
  }
  if (direction_switch) { direction_switch = 0; } else { direction_switch = 1; }
}
```

<img src="images/Semi_Discrete.jpg" width="70%" height="70%">

## Steps to follow
We will follow the Optimization cycle for porting and improving the code performance.

<img src="images/Optimization_Cycle.jpg" width="80%" height="80%">


### Understand and Analyze the code
Analyze the code present at:

[Serial Code](../source_code/ORIGINAL/miniWeather_serial.cpp) 

[Makefile](../source_code/ORIGINAL/Makefile)

Open the downloaded file and inspect it.

## Compile the code

In [None]:
!cd ../source_code && make clean && make

## Run the CPU code

In [None]:
!cd ../source_code && ./miniWeather

## Profiling

For this section, we will be using Nsight systems profiler and as the code is a CPU code, we will be tracing NVTX APIs (already integrated to the application). NVTX is useful for tracing of CPU events and time ranges. For more info on Nsight profiler, please see the __[profiler documentation](https://docs.nvidia.com/nsight-systems/)__.

### Viewing the profler output
There are two ways to look at profiled code: 

1) Command line based: Use `nsys` to collect and view profiling data from the command-line. Profiling results are displayed in the console after the profiling data is collected.

2) NVIDIA Nsight System: Open the Nsight System profiler and click on file > open, and choose the profiler output called `miniWeather_profile.qdrep`. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version. More details on where to download CUDA toolit can be found in the links in resources section below.

## Profile the CPU code to find hotspots

In [None]:
!cd ../source_code && nsys profile -t nvtx --stats=true --force-overwrite true -o miniWeather_profile ./miniWeather

[Download the profiler output](../source_code/miniWeather_profile.qdrep)

## Start adding OpenACC Pragmas

Before start modifying the serial code, let's make a copy of the serial code and rename it. Then, copy the output of the serial code `reference.nc` to the `checker` folder for later use (more info in later sections).

In [None]:
!cp ../source_code/miniWeather_serial.cpp ../source_code/miniWeather_openacc.cpp
!cp ../source_code/reference.nc ../source_code/checker/reference.nc

Now, you can start modifying the C++ code and the `Makefile`:

From the top menu, click on *File*, and *Open* `miniWeather_openacc.cpp` and `Makefile` from the current directory at `C/source_code` directory. Remember to **SAVE** your code after changes, before running below cells.

#### Some Hints

1) Which variables are by default declared private?

2) Notice implicit and explicit copy of variables --> Add `-Minfo=accel` flag to `Makefile`.

3) Check if there is any data race in your code.( More details on data race is present in the Links and resources section below)

## Compile and run OpenACC enabled code


In [None]:
!cd ../source_code && make clean && make

Hint : Add `-Minfo=accel` to the `Makefile` to check that Kernel code indeed has been generated.

## Profile the OpenACC Code

In [None]:
!cd ../source_code && nsys profile -t nvtx,openacc --stats=true --force-overwrite true -o miniWeather_profile ./miniWeather

You can examine the output on the terminal or you can download the file and view the timeline by opening the output with the NVIDIA Nsight Systems.

[Download the profiler output](../source_code/miniWeather_profile.qdrep)

## Validating the Output

This is a simple code written to “check” the output as you add directives. Rename the current output to `new.nc` and compare it to the “correct” output from the serial code `reference.nc`. The `checker.py` code, looks for largest error and largest “difference” – computes % difference.

**Note:** We have already made a copy of the `reference.nc`. Now, we copy the output of the parallel code to the `checker` folder to validate the ouput. Make sure to run the executable in default mode without any input arguments: `!./miniWeather`.

In [None]:
!cp ../source_code/reference.nc ../source_code/checker/new.nc
!ipython ../source_code/checker/checker.py

# Recommendations for adding OpenACC Pragmas

After finding the hotspot function take an incremental approach to add pargmas. 

1) Ignore the initialization, finalization and I/O functions

```cpp
//Declaring the functions defined after "main"
void init(int *argc, char ***argv);
void finalize();
void injection(double x, double z, double &r, double &u, double &w, double &t, double &hr, double &ht);
void hydro_const_theta(double z, double &r, double &t);
void output(double *state, double etime);
void ncwrap(int ierr, int line);
void perform_timestep(double *state, double *state_tmp, double *flux, double *tend, double dt);
void semi_discrete_step(double *state_init, double *state_forcing, double *state_out, double dt, int dir, double *flux, double *tend);
void compute_tendencies_x(double *state, double *flux, double *tend);
void compute_tendencies_z(double *state, double *flux, double *tend);
void set_halo_values_x(double *state);
void set_halo_values_z(double *state);
```

2) Take an incremental approach by adding pragmas one at a time

3) Unified Memory provides a good start point where you need not worry about the data transfers (`–ta=tesla:managed`)

4) Cross check the output after incremental changes to check algorithmic scalability

5) Move on to using data clauses for better performance ( Hint: For manual data movement, find out used variables, READ-WRITE patterns and where (host or device) the value is needed )

6) Start with a small problem size that reduces the execution time. Trying modifying the `nx_glob`, `nz_glob` , and `sim_time` values when running the miniWeather executable.

Example: `!./miniWeather 800 400 100`

**Note:** You can provide input values for `nx_glob`, `nz_glob` , and `sim_time` where,

* `nx_glob` and `nz_glob` is the number of total cells in the x and z directions
* `sim_time` is the simulation time in seconds

Try running the executable by providing arguments: `!./miniWeather 800 400 100`. The number of total cells in the x-direction must be twice as large as the total number of cells in the z directions. The default values are 400, 200, and 1500 seconds.


**General tip:** Be aware of *Data Race* situation in which at least two threads access a shared variable at the same time. At least on thread tries to modify the variable. If data race happened, an incorrect result will be returned. So, make sure to validate your output against the serial version using the `checker.py` function.

# Links and Resources

[OpenACC API Guide](https://www.openacc.org/sites/default/files/inline-files/OpenACC%20API%202.6%20Reference%20Guide.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

[CUDA Toolkit Download](https://developer.nvidia.com/cuda-downloads)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0). 