Accelerating portable HPC Applications with ISO Fortran
===

# Lab 1: Accelerated Math Intrinsics

This exercise demonstrates how to use the accelerated math intrinsics and compile them for CPU and GPU execution.

# Exercise 1: matrix-matrix multiply (`matmul`)

The exercise template [exercise1.f90](./exercise1.f90) implements a matrix-matrix multiply in Fortran as follows:

```fortran
do j= 1, nj
 do k= 1, nk
  do i = 1, ni
   d(i,j) = d(i,j) + a(i,k) * b(k,j)
  end do      
 end do
end do
```
The command line arguments for the binary are `./matmul ni nj nk niterations`.

The goal of this first exercise is to use Fortran math intrinsic [matmul](https://gcc.gnu.org/onlinedocs/gfortran/MATMUL.html) to implement this instead.

Replace the loops under the `! TODO`s in [exercise1.f90](./exercise1.f90) to use the `matmul` intrinsic.

The following cells compile the code and measure its performance.

In [None]:
!rm -f matmul
!nvfortran -Ofast exercise1.f90 -o matmul
!./matmul 1024 1024 1024 10

When you are done, let's use `nvfortran` accelerated math intrinsics.

These are selected using:

- `-stdpar=multicore` for multicore CPU execution
- `-stdpar=gpu` for GPU execution, which requires also passing the following flags:
  - `-cuda` for enabling CUDA Fortran
  - `-cudalib=cutensor,curand` for linking the `cuTensor` library, which the accelerated math intrinsics use
 
Let's try these out. We may pick a larger matrix size, since now performance is much higher than with the hand-rolled triple-nested loop.

In [None]:
!rm -f matmul
!nvfortran -Ofast -stdpar=gpu -cuda -cudalib=cutensor,curand exercise1.f90 -o matmul
!./matmul 1024 1024 1024 2

## Solution: Exercise 1

In [None]:
!rm -f matmul
!nvfortran -Ofast -stdpar=gpu -cuda -cudalib=cutensor,curand solutions/exercise1.f90 -o matmul
!./matmul 4096 4096 4096 10

We do not have a multi-threaded implementation of `matmul` in our runtime yet, so the following run sequentially:

In [None]:
!rm -f matmul
!nvfortran -Ofast -stdpar=multicore solutions/exercise1.f90 -o matmul
!./matmul 4096 4096 4096 2