# Introduction to Parall Programming (NCI-ResTech 2023)

In [None]:
import os
# the jupyter notebook is launched from your $HOME, change the working directory provided a username directory is created under /scratch/vp91
os.chdir(os.path.expandvars("/scratch/vp91/$USER/Parallel-Programming"))

## 1. OpenMP
Our example ([monte-carlo-pi-serial](./monte-carlo-pi-serial.c)) for you to get a hang of  parallel programming is slightly more complicated than a helloword program. Nevertheless, it is a simple snippet showcasing a basic openmp program. 

The program approximates Pi by Monte-Carlo method. Run the next cell to compile and execute the serial code.


In [None]:
!make clean && make mc-serial && echo "Compilation Successful!" && ./monte-carlo-pi-serial

The multithreading version is implemented at ([monte-carlo-pi-openmp.c](./monte-carlo-pi-openmp.c)) by OpenMP. In essence, $N$ number of randowm numbers are distributed to multiple threads. 


Run the next cell to compile the OpenMP code.

In [None]:
!make clean && make mc-omp && echo "Compilation Successful!" 

Run the program with a fixed number of threads

In [None]:
!OMP_NUM_THREADS=12 ./monte-carlo-pi-openmp

## 2. OpenACC
Now we offload the computation to a GPU to accelerate the for-loop. To this end, firstly we need to load NVIDIA HPC STK module on Gadi.

**`TODO`**: Refactor [monte-carlo-pi-openacc.c](./monte-carlo-pi-openacc.c) by changing to OpenACC clauses. 

Since we will compile with managed memory, there's no need to include data transfer clauses. But this will come to an issue for gaining more performance.

The following flags are used in compiling the OpenACC code:

-Minfo=accel: Show the information about the accelerated code by OpenACC

-ta:telsa=mamaged: Target OpenACC to Nvidia GPUs with mamanged memory

We also use NVTX libray which provides annotations for profiling the code.

If you are getting stuck, peek the solution at ([solution](./solution/monte-carlo-pi-openacc.c))

Once you have rendered the code with correct OpenACC, run the next cell to compile and execute the program.

In [None]:
!make clean && make mc-acc && echo "Compilation Successful!" && ./monte-carlo-pi-openacc

Now we will demonstrate how to submit a batch job.

## 3. MPI
Our last parallel programming model uses MPI. The total $N$ number of random numbers are split into multiple processors. Each process independtely calculates the number of random numbers that are locally stored witthin the process. The results of each individual MPI rank are collected and summed at a root process (MPI rank 0). 

Look out for the following parts in the program ([monte-carlo-pi-mpi.c](./monte-carlo-pi-mpi.c)).
```cpp
#include <mpi.h>

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Wtime();

MPI_Reduce(&count, &count_tot, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

MPI_Barrier(MPI_COMM_WORLD);

MPI_Finalize();
```

Run the next cell to excute the MC_pi program.

In [None]:
!make clean && make mc-mpi && echo "Compilation Successful!" && mpiexec -np 4 ./monte-carlo-pi-mpi

### Profile with mpiP

Run the next cell to inspect the profiling results.

In [None]:
!cat *.mpiP
!rm -r *.mpiP