Bringing Tensor Cores to Standard Fortran
===

# Lab 1: MATMUL

In this tutorial we will familiarize ourselves with how standard Fortran array intrinsic functions can be mapped to GPU-accelerated math libraries.

At the simplest level, only two Fortran statements are required to take advantage of the outstanding performance provided by the cuTENSOR library.


 <span style='background :lightgrey' > **use cutensorex** </span> 

 <span style='background :lightgrey' > **.......**        </span> 

<span style='background :lightgrey' > **c = matmul(a,b)** </span> 

A working implementation is provided in [main.f90](./main.f90) 

Please take 2-3 minutes to skim through it.

## Getting started

Let's start by checking the version of the NVIDIA Fortran compiler installed in the image:


In [2]:
!nvfortran --version


nvfortran 23.3-0 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.


---

Now let's compile and run (You may have a look at the Makefile for compiler options):

In [3]:
!make
!./a.out

nvfortran -acc -gpu=managed -cuda -cudalib main.f90
/usr/bin/ld: cannot find -lcusolverMp: No such file or directory
/usr/bin/ld: cannot find -lcal: No such file or directory
/usr/bin/ld: cannot find -lcutensor: No such file or directory
/usr/bin/ld: cannot find -lcutensorMg: No such file or directory
/usr/bin/ld: cannot find -lnccl: No such file or directory
/usr/bin/ld: cannot find -lnvshmem_device: No such file or directory
/usr/bin/ld: cannot find -lnvshmem_host: No such file or directory
pgacclnk: child process exit status 1: /usr/bin/ld
make: *** [Makefile:3: all] Error 2
/bin/bash: line 1: ./a.out: No such file or directory


The table below gives the performance of this code snippet on different architectures.

Table 1. real(8) performance on one NUMA node of a dual-socket AMD EPYC 7742 Rome CPU-based server, a single V100, and a single A100 GPU.

|Implementation / Processor	| TFLOPs
| :----: | :----: |
|nvfortran matmul on a single CPU core | 	0.010
|MKL DGEMM on 64 CPU cores |	1.674
|Naive OpenACC on V100|	0.235
|Naive OpenACC on A100 |	0.447
|nvfortran matmul on V100 |	6.866
|nvfortran matmul on A100 |	17.660


If your performance is not in the same range for your architecture , compile with the Makefile in the solution folder : it is forcing a static link of the cuTENSOR library which avoids to load it repetitively in case of dynamic linking, which could be a costly operation.

Supported standard Fortran operations
===

The cuTENSOR library contains general permutation and contraction operations. The result of the permutation can optionally be operated on by an elemental function, and optionally scaled.

The nvfortran compiler can recognize and map a variety of Fortran transformational intrinsics and elemental intrinsic functions used in combination with general array syntax to cuTENSOR functionality. A few of the more straightforward translations include the following:



d = transpose(a)

d = func(transpose(a))

d = alpha * func(transpose(a)

d = reshape(a,shape=[...])

d = reshape(a,shape=[...],order=[...])

d = func(reshape(a,...))

d = alpha * func(reshape(a,...))

d = spread(a,dim=k,ncopies=n)

d = func(spread(a,dim=k,ncopies=n))

d = alpha * func(spread(a,dim=k,ncopies=n))

The inputs to matmul() can also be permuted in cuTENSOR, and the result can be scaled and accumulated. That leads to several possible combinations, such as the following statements:

c = matmul(a,b)

c = c + matmul(a,b)

c = c - matmul(a,b)

c = c + alpha * matmul(a,b)

d = alpha * matmul(a,b) + beta * c

 
c = matmul(transpose(a),b)

c = matmul(reshape(a,shape=[...],order=[...]),b)

c = matmul(a,transpose(b))

c = matmul(a,reshape(b,shape=[...],order=[...]))