Lab 1: DAXPY - Accelerating portable HPC Applications with ISO Fortran
===

In this tutorial we will familiarize ourselves with the Fortran `do concurrent` feature by implementing Double-precision AX Plus Y ([DAXPY](https://netlib.org/lapack/explore-html/de/da4/group__double__blas__level1_ga8f99d6a644d3396aa32db472e0cfc91c.html)): $A \cdot X + Y$, one of the main functions in the standard Basic Linear Algebra Subroutines (BLAS) library.

The operation is a combination of scalar multiplication and vector adition. It takes two vectors of 64-bit floats, `x` and `y` and a scalar value `a`.
It multiplies each element `x(i)` by `a` and adds the result to `y(i)`.

These exercise initialize `x(i) = i` and `y(i) = 2` and use the `check` subroutine once to verify the `daxpy` implementations.

The `daxpy` binaries produced take the following options: `./daxpy nx niterations`
* `nx`: number of elements in the `x` and `y` vectors.
* `niterations`: number of benchmark iterations.

Only the code around the `! TODO`s needs changing.

## Exercise 1 - `do concurrent`

Take a look at [exercise1.f90](./exercise1.f90). There are two `! TODO`s for parallezing with `do concurrent` the initialization and the `daxpy` subroutine. The sequential versions using raw-loops look like this:

```fortran
! Intialize vectors `x` and `y`
! TODO: parallelize with do-concurrent
do i = 1, n
  x(i)  = i
  y(i)  = 2.
end do
! DAXPY: Y + Y + A * X
subroutine daxpy(x, y, n, a)
  use, intrinsic :: iso_fortran_env
  implicit none
  real(kind=8), dimension(:) :: x, y
  real(kind=8) :: a
  integer :: n, i  
  ! TODO: parallelize with do-concurrent
  do i = 1, n
    y(i) = y(i) + a * x(i)
  end do  
end subroutine
```

Fortran 2008 added support for `do concurrent` to express that loop iterations are independent from each other and therefore safe to parallelize. Multi-dimensiona loops of this form:

```fortran
do i = 1,ni
  do j = 1,nj
     if (A(i, j) >= 0) then
       cycle
     end if
     ...
  end do
end do
```

can be written with `do concurrent` as follows, where the mask in the `(...)` describes the iteration space:

```fortran
do concurrent (i = 1:ni, j = 1:nj, A(i, j) < 0)
  ...
end do
```

The goal of this first exercise is to parallelize the loops indicated with `! TODO`s in [exercise1.f90](./exercise1.f90) using `do concurrent. 

The following cell compiles compile and run the [exercise1.f90](./exercise1.f90) template: 

In [None]:
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra exercise1.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!gfortran -Ofast -Wall -Wextra exercise1.f90 -o daxpy
!./daxpy 10000000 100

For `nvfortran`, the `-stdpar=multicore` and `-stdpar=gpu` options auto-parallelize do-concurrent loops on CPUs and GPUs.

In [None]:
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore exercise1.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu exercise1.f90 -o daxpy
!./daxpy 10000000 100

### Solution Exercise 1

The solution is available in [solutions/exercise1.f90](./solutions/exercise1.f90):

In [None]:
!rm -f daxpy
!gfortran -Ofast -Wall -Wextra solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu solutions/exercise1.f90 -o daxpy
!./daxpy 10000000 100

## Exercise 2 - `do concurrent` locality specifiers

The following _locality specifiers_ are part of:

* Fortran 2018:
  * `default(none)`: requires every variable used in the loop to have an explicit locality specifier except for loop indices.
  * `shared`: different iterations of the loop share the same variable memory
  * `local`: every _iteration_ of the loop gets an uninitialized private storage for the variable
  * `local_init`: `local` initialized with the variable's value outside the loop
* Fortran 2023:
  * `reduce(op:variable)` (e.g. `reduce(+:sum)`): different iterations share the same variable memory and reduce to it with the given operation.
 
> **Note**: `gfortran` does not support Fortran 2018 locality specifiers yet.

They are specified as part of the `do concurrent` loop:

```fortran
integer :: a, b, c

do concurrent(i = 1:ni, j = 1:nj) default(none) shared(a) local_init(b, c)
  ...
end do
```

The goal of [exercise2.f90] is to modify the loops indicated with `! TODO` comments to:
* Add the `default(none)` specifier to each `do concurrent` loop.
* Add the remaining locality specifiers for all other variables.

The following cell compiles compile and run the [exercise2.f90] template: 

[exercise2.f90]: ./exercise2.f90

In [None]:
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra exercise2.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore exercise2.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu exercise2.f90 -o daxpy
!./daxpy 10000000 100

### Solutions Exercise 2

The solution is available in [solutions/exercise2.f90](./solutions/exercise2.f90):

In [None]:
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu solutions/exercise2.f90 -o daxpy
!./daxpy 10000000 100

## Exercise 3 - reductions

In this exercise, we'll parallelize a variant of `daxpy` we'll call `daxpy_sum`: 

```fortran
sum = 0.
do i = 1, n
  y(i) = y(i) + a * x(i)
  sum = sum + y(i)
end do 
```

The goal of [exercise3.f90] is to modify the loops indicated with `! TODO` comments to:
* Add the `default(none)` specifier to each `do concurrent` loop.
* Add the remaining locality specifiers for all other variables.

The following cell compiles compile and run the [exercise3.f90] template: 

[exercise3.f90]: ./exercise3.f90


In [None]:
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra exercise3.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore exercise3.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu exercise3.f90 -o daxpy
!./daxpy 10000000 100

### Solutions Exercise 3

The solution is available in [solutions/exercise3.f90](./solutions/exercise3.f90):

In [None]:
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra solutions/exercise3.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=multicore solutions/exercise3.f90 -o daxpy
!./daxpy 10000000 100
!rm -f daxpy
!nvfortran -Ofast -Wall -Wextra -stdpar=gpu solutions/exercise3.f90 -o daxpy
!./daxpy 10000000 100