Accelerating portable HPC Applications with ISO Fortran
===

# Lab 1: DAXPY

In this tutorial we will familiarize ourselves with the Fortran DO CONCURRENT feature by implementing Double-precision AX Plus Y (DAXPY): $A \cdot X + Y$, one of the main functions in the standard Basic Linear Algebra Subroutines (BLAS) library.

The operation is a combination of scalar multiplication and vector adition. It takes two vectors of 64-bit floats, `x` and `y` and a scalar value `a`.
It multiplies each element `x(i)` by `a` and adds the result to `y(i)`.

A working implementation is provided in [daxpy.f90].
Please take 2-3 minutes to skim through it.

## Validating solutions

For all the exercises, we assume that initially the values are `x(i) = 1` and `y(i) = i`.
The `check` function then verifies the effect of applying `daxpy` to these two vectors.

We will run this check always once

## Sequential implementation

The "core" of the sequential implementation provided in [daxpy.f90] is split into two parts:


```c++
! Intialize vectors `x` and `y`: raw loop sequential version
    do i = 1, n
    x2(i) = 1
    y2(i)  = i
    enddo
! daxpy
  subroutine daxpy_do(x,y,n,a)
   real(kind=8),dimension(:) :: x, y
   real(kind=8) :: a
   integer :: n, i
   do i = 1, n
    y(i) = a*x(i)+y(i)
  enddo
 end subroutine
```

We initialize the vectors to the `x(i) = 1` and `y(i) = i` expressions covered above for testing purposes.

The `daxpy` subroutine implements a loop over all vector elements, reading from both `x` and `y` and writing the solution to `y`.

[daxpy.f90]: ./daxpy.f90

## Getting started

Let's start by checking the version of some of the compilers installed in the image:


In [None]:
!nvfortran --version

---

Now let's compile and run the example, [daxpy.f90](./daxpy.f90):

In [None]:
!nvfortran -stdpar -Minfo -o daxpy daxpy.f90
!./daxpy 1000000

You will notice that a CPU-sequential version is run along the 'to-be-parallelized' version.

## Exercise 1: Parallelize the DAXPY operation

The goal of this first exercise is to parallelize the raw DAXPY loop using a DO CONCURRENT construct.

The `TODO`s indicate the parts of the template that must be completed.

A solution is provided in [daxpy1.f90](./solutions/daxpy1.f90). 

Compare the performance of the GPU parallelized version to the CPU version.



# Exercise 2: Optimizing performance with memory placement

In Exercise 1 we focused on parallelizing the DAXPY loop, however the overall code runs slower than the serial CPU version.

Profile your GPU version with Nsight-systems.

(Hint: Look at CPU<->GPU data movements, including page faults) 

Implement a strategy to minimize data movements.

A solution is provided in [daxpy2.f90](./solutions/daxpy2.f90). 