# Introduction to Parall Programming (NCI-ResTech 2023)

In [3]:
import os
# the jupyter notebook is launched from your $HOME, change the working directory provided a username directory is created under /scratch/vp91
os.chdir(os.path.expandvars("/scratch/vp91/$USER/intro-to-parallel-programming"))

## 1. OpenMP
Our example ([monte-carlo-pi-serial](./monte-carlo-pi-serial.c)) for you to get a hang of  parallel programming is slightly more complicated than a helloword program. Nevertheless, it is a simple snippet showcasing a basic openmp program. 

The program approximates Pi by Monte-Carlo method. Run the next cell to compile and execute the serial code.


In [3]:
!make clean && make mc-serial && echo "Compilation Successful!" && ./monte-carlo-pi-serial

rm -f *.o monte-carlo-pi-openacc monte-carlo-pi-openmp monte-carlo-pi-serial monte-carlo-pi-mpi
gcc -g -Wall -fopenmp -o monte-carlo-pi-serial monte-carlo-pi-serial.c -lm
Compilation Successful!
MATH Pi 3.141593
/////////////////////////////////////////////////////
Sampling points 4000000; Hit numbers 3140491; Approx Pi 3.140491, Total time in 0.062679 seconds 
Sampling points 8000000; Hit numbers 6280912; Approx Pi 3.140456, Total time in 0.097134 seconds 
Sampling points 16000000; Hit numbers 12564521; Approx Pi 3.141130, Total time in 0.194175 seconds 
Sampling points 32000000; Hit numbers 25129332; Approx Pi 3.141167, Total time in 0.387351 seconds 
Sampling points 64000000; Hit numbers 50264754; Approx Pi 3.141547, Total time in 0.771435 seconds 
Sampling points 128000000; Hit numbers 100528445; Approx Pi 3.141514, Total time in 1.545127 seconds 
Sampling points 256000000; Hit numbers 201063451; Approx Pi 3.141616, Total time in 3.090718 seconds 
Sampling points 512000000; Hit num

The multithreading version is implemented at ([monte-carlo-pi-openmp.c](./monte-carlo-pi-openmp.c)) by OpenMP. In essence, $N$ number of randowm numbers are distributed to multiple threads. 


Run the next cell to compile the OpenMP code.

In [4]:
!make clean && make mc-omp && echo "Compilation Successful!" 

rm -f *.o monte-carlo-pi-openacc monte-carlo-pi-openmp monte-carlo-pi-serial monte-carlo-pi-mpi
gcc  -g -fopenmp -Wall -o monte-carlo-pi-openmp monte-carlo-pi-openmp.c -lm
Compilation Successful!


Run the program with a fixed number of threads

In [7]:
!OMP_NUM_THREADS=12 ./monte-carlo-pi-openmp

MATH Pi 3.141593
/////////////////////////////////////////////////////
Sampling points 4000000; Hit numbers 3141252; Approx Pi 3.141252, Total time in 0.022828 seconds 
Sampling points 8000000; Hit numbers 6282531; Approx Pi 3.141265, Total time in 0.020827 seconds 
Sampling points 16000000; Hit numbers 12565102; Approx Pi 3.141275, Total time in 0.030763 seconds 
Sampling points 32000000; Hit numbers 25130302; Approx Pi 3.141288, Total time in 0.036118 seconds 
Sampling points 64000000; Hit numbers 50267098; Approx Pi 3.141694, Total time in 0.071238 seconds 
Sampling points 128000000; Hit numbers 100533295; Approx Pi 3.141665, Total time in 0.141738 seconds 
Sampling points 256000000; Hit numbers 201062526; Approx Pi 3.141602, Total time in 0.283192 seconds 
Sampling points 512000000; Hit numbers 402117088; Approx Pi 3.141540, Total time in 0.566267 seconds 


## 2. OpenACC
Now we offload the computation to a GPU to accelerate the for-loop. To this end, firstly we need to load NVIDIA HPC STK module on Gadi.

**`TODO`**: Refactor [monte-carlo-pi-openacc.c](./monte-carlo-pi-openacc.c) by changing to OpenACC clauses. 

Since we will compile with managed memory, there's no need to include data transfer clauses. But this will come to an issue for gaining more performance.

The following flags are used in compiling the OpenACC code:

-Minfo=accel: Show the information about the accelerated code by OpenACC

-ta:telsa=mamaged: Target OpenACC to Nvidia GPUs with mamanged memory

We also use NVTX libray which provides annotations for profiling the code.

If you are getting stuck, peek the solution at ([solution](./solution/monte-carlo-pi-openacc.c))

Once you have rendered the code with correct OpenACC, run the next cell to compile and execute the program.

In [9]:
!cd solution && make clean && make mc-acc && echo "Compilation Successful!" && ./monte-carlo-pi-openacc

rm -f *.o monte-carlo-pi-openacc monte-carlo-pi-openmp monte-carlo-pi-serial monte-carlo-pi-mpi
nvc -g -Wall -acc -Minfo=accel -ta=tesla:managed  -o monte-carlo-pi-openacc monte-carlo-pi-openacc.c -lm -lnvToolsExt
calc_pi:
     39, Generating NVIDIA GPU code
         39, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:count)
     39, Generating implicit copyin(random_array[:]) [if not already present]
         Generating implicit copy(count) [if not already present]
Compilation Successful!
MATH Pi 3.141593
Sampling points 4000000; Hit numbers 3140674; Approx Pi 3.140674, Total time in 0.234260 seconds 
Sampling points 8000000; Hit numbers 6283325; Approx Pi 3.141662, Total time in 0.160224 seconds 
Sampling points 16000000; Hit numbers 12566906; Approx Pi 3.141726, Total time in 0.306184 seconds 
Sampling points 32000000; Hit numbers 25132897; Approx Pi 3.141612, Total time in 0.575987 seconds 
Sampling points 64000000; Hit numbers 50

In [11]:
!cd solution && make clean && make mc-cuda && echo "Compilation Successful!" && ./monte-carlo-pi-cuda

rm -f *.o monte-carlo-pi-openacc monte-carlo-pi-openmp monte-carlo-pi-serial monte-carlo-pi-mpi
nvcc -g  -Xcompiler -Wall   -o monte-carlo-pi-cuda monte-carlo-pi-cuda.cu -lm -lnvToolsExt
Compilation Successful!
MATH Pi 3.141593
/////////////////////////////////////////////////////
Sampling points 4000000; Hit numbers 0; Approx Pi 0.000000
Sampling points 8000000; Hit numbers 0; Approx Pi 0.000000
Sampling points 16000000; Hit numbers 0; Approx Pi 0.000000
Sampling points 32000000; Hit numbers 0; Approx Pi 0.000000
Sampling points 64000000; Hit numbers 0; Approx Pi 0.000000
Sampling points 128000000; Hit numbers 0; Approx Pi 0.000000
Sampling points 256000000; Hit numbers 0; Approx Pi 0.000000
Sampling points 512000000; Hit numbers 0; Approx Pi 0.000000


Now we will demonstrate how to submit a batch job.

## 3. MPI
Our last parallel programming model uses MPI. The total $N$ number of random numbers are split into multiple processors. Each process independtely calculates the number of random numbers that are locally stored witthin the process. The results of each individual MPI rank are collected and summed at a root process (MPI rank 0). 

Look out for the following parts in the program ([monte-carlo-pi-mpi.c](./monte-carlo-pi-mpi.c)).
```cpp
#include <mpi.h>

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Wtime();

MPI_Reduce(&count, &count_tot, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

MPI_Barrier(MPI_COMM_WORLD);

MPI_Finalize();
```

Run the next cell to excute the MC_pi program.

In [None]:
!cd make clean && make mc-mpi && echo "Compilation Successful!" && mpiexec -np 4 ./monte-carlo-pi-mpi

### Profile with mpiP

Run the next cell to inspect the profiling results.

In [None]:
!cat *.mpiP
!rm -r *.mpiP