In [None]:
# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings = lambda *a, **kw: None
from IPython.core.display import HTML

HTML(open("../documents/custom.html", "r").read())

<br/>
<span style="background:#f0f0e0;padding:1em">Copyright (c) 2020-2021 ETH Zurich, Scientific IT Services. This work is licensed under <a href="https://creativecommons.org/licenses/by-nc/4.0/">CC BY-NC 4.0</a></span><br/>
<br/>

<p style="font-size: 2.5em; font-weight: bold;">Section 7: Graphics Processing Units (GPUs)</p>

<table style="width: 100%;">
    <tr>
        <td style="width: 30%; text-align:center">
            <figure>
                <div style="overflow:hidden; display:inline-block;">
               <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/AMD%4014nm%40GCN_5th_gen%40Vega10%40Radeon_RX_Vega_64%40ES-Sample%40_Stack-DSC09254-DSC09287_-_ZS-retouched.jpg/1280px-AMD%4014nm%40GCN_5th_gen%40Vega10%40Radeon_RX_Vega_64%40ES-Sample%40_Stack-DSC09254-DSC09287_-_ZS-retouched.jpg">
                </div>
               <figcaption style="text-align: center">
                   <a href="https://commons.wikimedia.org/wiki/File:AMD@14nm@GCN_5th_gen@Vega10@Radeon_RX_Vega_64@ES-Sample@_Stack-DSC09254-DSC09287_-_ZS-retouched.jpg">
                       Wikimedia.org
                   </a>
                   :
                   <a href="https://creativecommons.org/publicdomain/zero/1.0/deed.en">
                       CC0 1.0
                   </a>
               </figcaption>
            </figure>
        </td>
        <td style="width: 70%; text-align: left; font-size: 1.2em; line-height:160%;">
            
Graphics Processing Units (GPUs) were initially designed to do just that: <b>process graphics</b>. 
            
As demands grew, an increasing number of tasks was moved onto the GPU, and developers needed more and more control over the graphics pipeline. Of course, one of the main drivers behind the development of both hardware and the ecosystem was <a href=https://en.wikipedia.org/wiki/DirectX>gaming</a>, but also engineering and cinematic applications <a href="https://www.pearson.com/us/higher-education/program/Sanders-CUDA-by-Example-An-Introduction-to-General-Purpose-GPU-Programming/PGM200291.html">played an important role</a>.            
        </td>
    </tr>
</table>

# History

The field standardized around graphics APIs like [OpenGL](https://en.wikipedia.org/wiki/OpenGL) or [DirectX]() that permit developers to control the graphics that are rendered onto the monitor, and those were continuously expanded. An important milestone was reached when DirectX 8.0, released in the year 2000, [gave developers more direct control over the shaders that are executed on the GPU](https://www.pearson.com/us/higher-education/program/Sanders-CUDA-by-Example-An-Introduction-to-General-Purpose-GPU-Programming/PGM200291.html).

## Pioneers

This sparked the interest of researchers who started to "<b>misuse</b>" the capability for computations. In the beginning, this was an inconvenient task: GPU programming still required a use of graphics APIs like the ones mentioned above, and data input and output needed to be done using graphics primitives like pixel colors.

# The Faster The Better

Traditional strategies that increase the performance of applications like boosting clock speeds have [hit limits](http://www.gotw.ca/publications/concurrency-ddj.htm) ("the free lunch is over", see Section 1).
As was discussed in the previous sections, writing software that takes advantage of current state-of-the art CPUs like AMD's 64 core processor [available at the ETH HPC Euler cluster](https://scicomp.ethz.ch/wiki/Euler) means exploiting parallelism. This is particularly true for GPUs.

<br/>
<figure>
   <img src="https://scicomp.ethz.ch/w/images/7/76/ETH_Zurich_Euler_II_and_I_in_LCA.jpg" style="max-width: 600px"/>
       <figcaption style="text-align: center">
           <a href="https://scicomp.ethz.ch/wiki/File:ETH_Zurich_Euler_II_and_I_in_LCA.jpg">
               scicomp.ethz.ch
           </a>
           :
           © 2015 Olivier Byrde, ETH Zurich
       </figcaption>
</figure>

## Cores

Drawing from their heritage in computer graphics, GPUs have been designed to tackle [highly parallelizable workloads](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units). This also shows in <b>core count</b>. For instance, Nvidia's recently released A100 GPU provides around [7000 CUDA cores](https://en.wikipedia.org/wiki/Ampere_(microarchitecture)). If exploited by an application, this permits a speed-up of up to [around an order of magnitude](https://www.karlrupp.net/2016/08/flops-per-cycle-for-cpus-gpus-and-xeon-phis/#more-676).

## Performance

Of course, the actual performance increase depends on the concrete hardware and the application. GPUs will not replace CPUs anytime soon, but offloading work to this coprocessor has the additional advantage of freeing up CPU resources for other tasks.

# Deep Learning

Deep learning has been able to improve the performance of computers in many interesting fields of application. However, working with neural networks often requires significant computational resources. Very often, some of the necessary compute power is provided by GPUs. For instance, versions of [AlphaGo used up to 280 GPUs and thousands of CPUs](https://en.wikipedia.org/wiki/AlphaGo).

<figure>
   <img src="https://upload.wikimedia.org/wikipedia/commons/c/c2/MultiLayerNeuralNetworkBigger_english.png" style="max-width: 600px"/>
       <figcaption style="text-align: center">
           <a href="https://commons.wikimedia.org/wiki/File:MultiLayerNeuralNetworkBigger_english.png">
               Wikimedia.org
           </a>
           :
           <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en">
               CC BY-SA 3.0
           </a>
       </figcaption>
</figure>

# CPU versus GPU

CPUs and GPUs are designed for different purposes and the architects have therefore made different tradeoffs. We'll try to explain those use cases with an analog in real life.

## CPU: The Racing Car

Let's say you live in Zurich and you want to go skiing in Laax, which is about 80 kilometers from Zurich. Environmental concerns aside, which mode of transportation would you choose?

Most likely, you would choose the car.

<figure>
   <img src="https://upload.wikimedia.org/wikipedia/commons/9/9a/Bugatti_Chiron_%2823628630038%29.jpg" style="max-width: 600px"/>
       <figcaption style="text-align: center">
           <a href="https://commons.wikimedia.org/wiki/File:Bugatti_Chiron_(23628630038).jpg">
               Wikimedia.org
           </a>
           :
           <a href="https://creativecommons.org/licenses/by/2.0/deed.en">
               CC BY 2.0
           </a>
       </figcaption>
</figure>

The reason is that it is designed for this type of use:

* Fast
* Multipurpose
* Large storage space per person

In this sense the car is like the CPU:

* High clock speed
* Most applications work
* Large main memory per core

CPUs are great.

## GPU: The Bus
However, what if you want to go together with your local skiing club? In this case, using the car is very inconvenient. It is still the fastest way to go to Laax once, but you'll have to drive back and forth to transport everybody. If you have access to one, you would likely choose to go by bus instead.

<figure>
   <img src="https://upload.wikimedia.org/wikipedia/commons/e/e0/Van-Hool-Bus_in_München.jpg" style="max-width: 600px"/>
       <figcaption style="text-align: center">
           <a href="https://commons.wikimedia.org/wiki/File:Van-Hool-Bus_in_München.jpg">
               Wikimedia.org
           </a>
           :
           <a href="https://creativecommons.org/licenses/by/3.0/de/deed.en">
               CC BY 3.0 DE
           </a>
       </figcaption>
</figure>

Of course, a bus is designed for this type of use:

* Transports a group at once
* Car can be used for something else

Note that this is true even if the bus is slower than the car, the group has to be at the stop at the same time, the group has to walk to the stop, and there's less storage space per person.

Again, this example translates to properties of a GPU:

* Many cores
* Work can be offloaded from the CPU to the GPU
* Lower clock speed than the CPU
* Many similar tasks should be available at the same time
* Memory needs to be transferred to the GPU
* Smaller GPU memory

GPUs are great, too.

# GPU: Use Cases

Based on these properties, there are some workloads that are particularly suitable for the GPU. In fact, GPUs have been used in a [wide field of applications](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units).

* Neural Networks, machine learning
* Video processing, computer vision
* Scientific computing (climate research, molecular dynamics, ...)
* ...

However, applications need to respect the different programming model on GPUs.

# SIMT Programming Model

Many CPUs provide SIMD ([single instruction multiple data](https://en.wikipedia.org/wiki/SIMD)) instructions that permit the developer (or compiler) to perform the same operation on a batch of data to improve performance. GPUs usually take a different approach. This model is called SIMT: [single instruction multiple threads](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads). 

In this model developers control [individual threads](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture). To get a first impression of that concept, let's consider a function that adds some numbers in a list `source` to a list `target`. In Python, this function could look like this:

In [None]:
def add_to(source, target):
    """Adds the elements of `source` to the elements of `target` at the same index."""
    for index in range(len(source)):
        target[index] += source[index]

In the SIMD world, we would make sure that [chunks of the addition operations are performed in parallel](https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html). On the other hand, in the SIMT programming model, we specify the work of a single thread (pseudo-code, we will see a complete example later). This function also produces the expected result if we have `len(source)` threads and each of them calls the function. However, since many of us are more used to loops, low-level GPU programming can take some getting used to.

$$ \textrm{target} = \begin{pmatrix} [\textrm{computed by thread$_0$ using source$_0$}] \\ \vdots \\ [\textrm{computed by thread$_n$ using source$_n$}] \end{pmatrix} $$

In [None]:
def add_to_simt(source, target):
    """Adds the element of `source` at `thread_index` to the element of `target` at the same index."""
    target[thread_index] += source[thread_index]

Note that the SIMT programming model gives a lot of control to the developer. However, applications will only perform well if [the threads move in lockstep](https://hdms.bsz-bw.de/frontdoor/deliver/index/docId/4500/file/gpgpu-origins-and-gpu-hardware-architecture.pdf). Performance is [suboptimal if the threads diverge](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture) (i.e. execute different branches), which is why domain decomposition (in the sense of performing the same operation on a chunk of data in every thread) is usually a promising approach. In particular, if-statements (branches) on GPUs are [problematic](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads). 

# Different Vendors, Different Platforms

Moreover, the current GPU market differs from the CPU market considerably. When programming for CPUs, and especially when using Python, a program written for a desktop Intel CPU will (usually) also work on a desktop AMD CPU. Unfortunately, the GPU market has not converged to this extent.

There are [three main vendors of GPUs](https://www.statista.com/statistics/754557/worldwide-gpu-shipments-market-share-by-vendor/). Each of these vendors pushes their own platform for GPU programming (including compilers):

* Intel: [oneAPI](https://www.oneapi.com)
* Nvidia: [CUDA](https://en.wikipedia.org/wiki/CUDA)
* AMD: [ROCm](https://www.amd.com/de/graphics/servers-solutions-rocm)

This makes porting code between vendors challenging, even if there are (limited) ways to accomplish this (e.g. [OpenCL](https://de.wikipedia.org/wiki/OpenCL)).

Fortunately, higher level packages exist in the Python ecosystem. These packages (e.g. [TensorFlow](https://www.tensorflow.org), [PyTorch](https://pytorch.org), [Numba](https://numba.pydata.org), ...) support different computational platforms and therefore allow the user to work on system-independent code.

Since only Nvidia/CUDA cards are available in the [Euler cluster](https://scicomp.ethz.ch/wiki/Getting_started_with_GPUs), we will focus on this platform.

# GPU-accelerated Python Libraries

Python programmers can directly access the Nvidia CUDA and CUDA toolkit APIs with:
* [PyCUDA](https://documen.tician.de/pycuda/) to access Nvidia CUDA's parallel computing API.
* [Scikit-CUDA](https://scikit-cuda.readthedocs.io/en/latest/) to access the Nvidia CUDA programming toolkit libraries including CUBLAS, CUFFT and CUSOLVER.
* [CUDA Python](https://developer.nvidia.com/cuda-python)

Fortunately, there are also many Python libraries that offer convenient ways to perform mathematical operations on the GPU. For instance, we've already mentioned [TensorFlow](https://www.tensorflow.org/) and [PyTorch](https://pytorch.org/) which allow creating deep learning models in Python and perform training on the GPU. There are also general purpose GPU-accelerated Python libraries which often mimmick the API of well-known Python libraries.

## RAPIDS

[RAPIDS](https://www.rapids.ai) is a data science framework ([incubated by Nvidia](https://rapids.ai/about.html)) which offers GPU-accelerated libraries for executing end-to-end data science pipelines, from data preparation to machine learning. The libraries available in the RAPIDS framework include:

* cuDF: a dataframe manipulation library that mimmicks [pandas](https://pandas.pydata.org)
* cuML: a collection of machine learning libraries similar to [scikit-Learn](https://scikit-learn.org/stable/)
* cuSignal: a direct port of [SciPy](https://scipy.org) Signal to the GPU
* cuGraph: a collection of graph algorithms that matches the API of [NetworkX](https://networkx.org)

RAPIDS is available as a Conda package or as a Docker image, but it can also be built from source. It integrates well with other open source projects such as [Apache Arrow](https://arrow.apache.org), [Dask](https://dask.org), [XGBoost](https://xgboost.ai), [scikit-Learn](https://scikit-learn.org/stable/) to provide a GPU-accelerated data science ecosystem.

## CuPy

[CuPy](https://cupy.dev) is [developed by Preferred Networks](https://cupy.dev/) and provides a feature set similar to NumPy on Nvidia GPUs. We'll have a closer look at how CuPy can be used to port an algorithm to the GPU in the demo section below.

CuPy has the following hardware and software requirements (see the [documentation](https://docs.cupy.dev/en/stable/install.html#install-cupy) for details)

* Nvidia CUDA GPU
* Nvidia CUDA Toolkit
* Python3

You can install CuPy using `pip install cupy-cuda<version>`; see the [CuPy documentation](https://docs.cupy.dev/en/stable/install.html#installing-cupy) for more details.

Note that, in particular, Euler satisfies the requirements above, and CuPy is already installed.

# Demos

In this section we will first look at two examples that show how to create custom GPU-accelerated Python scripts. The first example uses CuPy as a drop-in replacement as this is the easiest way to use a GPU with Python. We will already see a performance gain; however, we are restricted by the functions provided by the library. The second example shows how you can gain more flexibility and control by using Numba for CUDA. Finally we will look at training a TensorFlow model.

Before showing the demo, we would like to explain how to request a GPU node on the Euler cluster, load modules and run Jupyter on the cluster. Please note that only the members of shareholder groups have access to GPUs on the cluster.

**Request a GPU node on the cluster**

When you have logged in to the cluster, you can request 1 GPU node by using the `bsub` option `-R "rusage[ngpus_excl_p=1]"`. Here is a command example to request an interactive session on a compute node with 1 GPU:

```bash
$ bsub -n 4 -W 01:00 -R "rusage[mem=2048, ngpus_excl_p=1]" -Is bash
```

**Load Python module for GPU**

The `python_gpu` module on the cluster includes TensorFlow and CuPy. By loading this Python module, the CUDA toolkit, the CuDNN and the NCCL library are also made available. The loading command reads:

```bash
$ module load gcc/6.3.0 python_gpu/3.8.5
```

**Run Jupyter on the cluster**

From your local computer, you can start a Jupyter Notebook in a batch job on the cluster with GPUs by using the script [provided by cluster support](https://gitlab.ethz.ch/sfux/Jupyter-on-Euler-or-Leonhard-Open). To run this script, follow the instructions in the README of the linked repository for the initial setup. Then use a command similar to the following example (replacing "\<USERNAME>" by your ETH username):

```bash
$ ./start_jupyter_nb.sh --username <USERNAME> --numgpu 1 --numcores 4 --memory 2048 --extra-modules gcc/6.3.0 --extra-modules python_gpu/3.8.5
```

## Demo: CuPy

For this example we use the NumPy version of the Euclidean distance matrix example as a starting point. We will replace NumPy with CuPy to offload calculations to the GPU. 

Let's recall the NumPy version before we move on to the CuPy version.

**Numpy version**

We use NumPy for the setup and focus using CuPy in the calculation of a distance matrix.

In [None]:
import numpy as np


def create_random_points(number_of_points: int) -> np.array:
    """
    Returns number_of_points random points in 3D space.
    """
    return np.random.rand(number_of_points, 3)


random_points = create_random_points(4096)

As before, we use `%timeit` to measure the runtime of the NumPy version to compare with that of the CuPy version.

In [None]:
%pycat demos/demo_numpy.py

In [None]:
from demos.demo_numpy import distance_matrix as distance_matrix_numpy

%timeit distance_matrix_numpy(random_points)
M = distance_matrix_numpy(random_points)
print(M[:5, :5])

**CuPy version**

First, import cupy. Note that `cp` is a widely used abbreviation for `cupy`, just as `np` is for `numpy`.

In [None]:
try:
    import cupy as cp
except Exception as error:
    print(error)

To start manipulating arrays and calculating on the GPU, we create an array on the current GPU device by using the command `cp.array`. CuPy makes using GPUs simple by mimicking NumPy functions, so we can directly replace `np.einsum` with `cp.einsum` and `np.matmul` with `cp.matmul`. After the calculation on the GPU, we return an array to the host memory by using the command `cp.asnumpy`.

In [None]:
%pycat demos/demo_cupy.py

In [None]:
try:
    from demos.demo_cupy import distance_matrix as distance_matrix_cupy

    %timeit distance_matrix_cupy(random_points)
    M = distance_matrix_cupy(random_points)
    print(M[:5, :5])
except Exception as error:
    print(error)

As you can see from the time measurement, with very little effort, we have sped up the computation by around 4 times for `n=4096`! Also, we still get the same result.

### Comparing the Numpy to the GPU version
Does the GPU always perform better in this example? Let's take a closer look at the special case where the problem size is small (number of points less than 800). To do so, we have created a benchmark you can find in the subdirectory `demos/`.

In [None]:
%matplotlib inline
import json

import matplotlib.pyplot as plt

with open("demos/measurements.json") as measurements_file:
    measurements = json.load(measurements_file)

for label in ["numpy", "cupy"]:
    transform = {"numpy": "numpy (1 core)", "cupy": "cupy"}
    plt.plot(
        [int(key) for key in measurements.keys()],
        [value[label] for value in measurements.values()],
        label=transform[label],
    )

plt.xlabel("Number of points")
plt.ylabel("Average runtime [s]")
plt.title("Comparison of NUMPY versus CUPY implementations")
plt.legend()

plt.ylim(0.0, 0.002)
plt.xlim(10, 800);

The plot shows that, for a number of points not large enough, the CPU is faster than GPU!

We used `line_profiler` to check which function causes this behavior. As `line_profiler` is not provided in the centrally-installed Python packages on the cluster, it can be installed locally in the home directory on the cluster with the command `pip install --user line-profiler` (or alternatively in a [virtual environment](https://docs.python.org/3/library/venv.html)). Please see Section 2 for how to use `line_profiler` and how to interpret the output.

The command line to profile the numpy and cupy versions of the function reads:
```bash
$ kernprof -vl benchmark_cupy.py 10
```

where 10 is the number of points in the array. In this test, we varied the number of points from 10 to 800.

In the following, "ops" refers to the line that computes `x_2 - 2 * x_y + y_2`.

In [None]:
%matplotlib inline
import json

import matplotlib.pyplot as plt

with open("demos/measurements_ops.json") as measurements_file:
    measurements = json.load(measurements_file)

number_of_points = [int(key) for key in measurements.keys()]

for library, style in zip(["np", "cp"], ["--", "-"]):
    for tag in ["einsum", "matmul", "ops"]:
        label = f"{library}.{tag}"
        plt.plot(
            number_of_points,
            [value[label] for value in measurements.values()],
            style,
            label=label,
        )

plt.xlabel("Number of points")
plt.ylabel("Average runtime [microseconds]")
plt.title("Comparison of NUMPY versus CUPY operations")
plt.legend()

plt.xlim(10, 800);

As you can see from the comparison of the runtimes between the NumPy and the CuPy operations, the CPU computed the smaller tasks faster than the GPU. However, when the problem size increased, the runtime of NumPy operations (such as `np.matmul`, multiplication, and addition) increased steeply with the problem size. On the other hand, GPU (lower clock speed, highly parallel) was slower for the small problem sizes but the runtime stayed almost the same as the problem size increased. 

**GPU memory limit**

The GPU is fast but a limitation of the GPU is the memory size. When running the CuPy version with `n=30^3` in this notebook the peak memory consumption of the function `distance_matrix` was around 16 GB. This exceeded the available memory size of the GPU we used (NVIDIA GeForce RTX 2080 Ti, 11GB). Therefore, for larger problems, we would need to use the CPU or break up a dataset into chunks so that each chunk can fit into the GPU memory and adapt the algorithm to process data chunk-wise on the GPU (see Outlook below).

## Demo: Numba for CUDA

While using drop-in replacement libraries is the way to go wherever this is feasible, this approach is limited to the algorithms and building blocks provided by these libraries. We need to dive a bit deeper if more flexibility or control is required.

As was already discussed in previous sections of this course, Numba can be used to compile a subset of Python to machine code. Sometimes, adding a Numba annotation is enough to improve the performance considerably. Fortunately, Numba can also compile code for GPUs. In fact, Numba supports both [CUDA and ROCm](https://numba.pydata.org/numba-doc/dev/cuda/index.html). In the following we will focus on CUDA.

Let's try to reproduce the previous example, i.e. compute the Euclidean distance matrix, using Numba for CUDA. Unfortunately, Numba for CUDA does not support many useful features. For instance, the following standard Python features are [missing](https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html) as of early 2021:

* Exception handling
* Comprehensions
* Generators

For our use case it's important to note that Numpy arrays are in fact supported, but only with a subset of features. We can use Numpy arrays to transfer data.

Here's a possible first implementation of the Euclidean distance matrix algorithm for GPUs. Following the programming model required by CUDA (see the section on SIMT above), the idea is to write code that computes a single entry of the final matrix. Recall that this entry contains the squared distance between two points.

$$ \textrm{distance_matrix} = \begin{pmatrix} [\textrm{computed by thread$_{0,0}$}] & \dots & [\textrm{computed by thread$_{0, n}$}] \\ \vdots & \vdots & \vdots \\ [\textrm{computed by thread$_{n, 0}$}] & \dots & [\textrm{computed by thread$_{n, n}$}]\end{pmatrix} $$

The formula for each entry depends on its (row, column)-coordinates, which we obtain from the location of executing thread in the "CUDA grid".

In [None]:
%pycat demos/demo_numba_gpu.py

Note that we've added a `cuda.jit` decorator to request CUDA compilation. 

Say we have `number_of_points` points that we want to compute the Euclidean distance matrix for. Note that the `distance_matrix_gpu` function obtains the row and the column that it is working on from the location of its executing thread in the "CUDA grid". Therefore we'll launch a two-dimensional square grid of size `number_of_points * number_of_points`. This can be done when invoking the function (commonly called "kernel").

For instance, we could invoke it like this:

In [None]:
from demos.demo_numba_gpu import distance_matrix as distance_matrix_numba_gpu

try:
    number_of_points = random_points.shape[0]
    result = np.zeros((number_of_points, number_of_points))
    distance_matrix_numba_gpu[(number_of_points, number_of_points), (1, 1)](
        random_points, result
    )
    print(result[:5, :5])
except Exception as error:
    print(error)

Of course, the code snippet requires CUDA to run. However, debugging on a CPU is possible by setting the [environment variable `NUMBA_ENABLE_CUDASIM=1`](https://numba.pydata.org/numba-doc/dev/cuda/simulator.html).


The most important take away here is that Numba for CUDA can be used to implement arbitrary algorithms that run on GPUs. However, keep in mind that GPU programming is challenging, and that drop-in replacements often provide better performance. In fact, there are many aspects that can be improved in our algorithm. For instance, we are recomputing `result[col][row]` which we already know because it's the same as `result[row][col]`. Also, the hardware architecture of GPUs comes with a complex memory hierarchy (e.g. shared memory) that we are ignoring at the moment. This is beyond the scope of this introduction, but the [CUDA C Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/) provides more information if you are interested.

To conclude this section we look at a benchmark of the above algorithm on the cluster. For scale, we are comparing against a sequential (i.e. single-threaded) Numba implementation running on the CPU.

In [None]:
%matplotlib inline
import json

import matplotlib.pyplot as plt

with open("demos/measurements.json") as measurements_file:
    measurements = json.load(measurements_file)

for label in ["cpu", "gpu", "numpy", "cupy"]:
    transform = {
        "cpu": "numba-cpu (1 core)",
        "gpu": "numba-gpu",
        "numpy": "numpy (1 core)",
        "cupy": "cupy",
    }
    plt.plot(
        [int(key) for key in measurements.keys()],
        [value[label] for value in measurements.values()],
        label=transform[label],
    )

plt.xlabel("Number of points")
plt.ylabel("Average runtime [s]")
plt.title("Comparison of Numba Targets")
plt.legend()

plt.ylim(0.0, 0.005)
plt.xlim(10, 800);

For a sufficiently large number of points, the `cupy` based solution outperforms the other solutions. However, `numba` can be used to to implement algorithms not directly supported by `cupy`.

# Demo: Training a TensorFlow model on the cluster

To round off our overview of using GPUs to power Python applications we'll have a look at the optimal case where there exists an established GPU-enabled Python package to perform the desired computations.
More precisely, we'll look at the common use case of traininig a TensorFlow model.

As mentioned before, the `python_gpu` module comes with TensorFlow and neccessary libraries included. Here is the module load command again:

```bash
$ module load gcc/6.3.0 python_gpu/3.8.5
```

Now we can try to import the TensorFlow package in Python.

In [None]:
try:
    import tensorflow as tf
except Exception as error:
    print(error)

To try it out, we have used an [official TensorFlow example](https://www.tensorflow.org/tutorials/images/cnn) of convolutional neural network training on the [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset on an AMD EPYC 7742, and a RTX 2080 Ti. The CIFAR10 dataset consists of 60,000 images which were split into 50,000 training images and 10,000 test images. The batch size was fixed to 32 and the number of epochs was 10 which gave a test accuracy of 70%. The performance results were averaged over 10 experiments on each system and configuration.

In [None]:
%matplotlib inline
import json

import matplotlib.pyplot as plt

with open("demos/measurements_tf_cnn_cifar10.json") as measurements_file:
    measurements = json.load(measurements_file)

for value in measurements.values():
    devices = [device for device in value.keys()]
    speed = [value[device] for device in devices]

plt.barh(devices, speed)
plt.title("Training CNN on CIFAR10 dataset")
plt.xlabel("Training speed [images/sec]");

In a convolutional neural network (CNN) model, the inputs of each convolutional layer are multiplied by its weight. These are simple tasks but a CNN model can have a large number of parameters. In this example, the CNN model has a total of 122,570 parameters. To perform multiplication operations one after another on a single CPU is very time consuming, and therefore, performing these tasks in parallel accelerates the computation. For the training dataset of 50,000 images, the EPYC 7742 CPU took 9.4 minutes on a single thread to train on the images and 1.7 minutes on 32 threads. An Nvidia GPU RTX 2080 Ti, which has 4352 Nvidia CUDA cores, took only 1 minute to solve the same problem. 


# Outlook

The compute resources available to a single GPU, i.e. cores and memory, are not infinite. Multi-GPU setups make it possible to run even larger scale computations. There are basically two ways to use more than one GPU at once:

* A single process acesses more than one GPU.
* Multiple processes use GPUs.

The former can be achieved using technology like [NVLink](https://www.nvidia.com/de-de/design-visualization/nvlink-bridges/) and [NCCL](https://developer.nvidia.com/nccl). The latter solution can be implemented using [MPI](https://www.mpi-forum.org) and e.g. domain decomposition (see Section 5).

If you don't need control over the details, there higher level packages that can help with simplifying the process. Some are designed for a specific use case (e.g. [TensorFlow](https://www.tensorflow.org) has [multi-GPU support](https://www.tensorflow.org/guide/gpu)), but there are also more general libraries and frameworks. For instance, [Dask](https://dask.org) has [multi-GPU support](https://docs.dask.org/en/latest/gpu.html). There's also a promising library by Nvidia ([Legate](https://developer.nvidia.com/legate-early-access)) in the works.

... So you might actually rent an entire fleet of double-decker buses to go skiing.

<figure>
   <img src="https://upload.wikimedia.org/wikipedia/commons/4/49/Line-up_of_RT_buses_inside_Barking_bus_garage_%28geograph_6106538%29.jpg" style="max-width: 600px"/>
   <figcaption style="text-align: center">
       <a href="https://commons.wikimedia.org/wiki/File:Line-up_of_RT_buses_inside_Barking_bus_garage_(geograph_6106538).jpg">
           Wikimedia.org
       </a>
       :
       <a href="https://creativecommons.org/licenses/by-sa/2.0/deed.en">
           CC BY-SA 2.0
       </a>
   </figcaption>
</figure>
