## KSETA Topical Courses
### GPU Computing with PyTorch - High Performance Computing at KIT
----

#### Part 1 - Getting to Know PyTorch

PyTorch is tensor computation library originally designed 

In [None]:
import torch
torch.__version__

Let us first create a vector from some user-defined data first.

In [None]:
vector = torch.tensor([0.0, 1.0, -1.0, 3.0])
vector

It is also possible to initialize matrices, volumes and higher order tensors. Below you will find a two-dimensional, i.e. matrix, tensor example.

In [None]:
matrix = torch.tensor([
    [1, 2, 3],
    [6, 5, 4]
])
matrix

PyTorch provides several functions to streamline tensor initialization. It is for example possible to create tensors with uninitialized memory, filled with constant values or random data.

In [None]:
random_volume = torch.randn(size=(3, 4, 5))
random_volume

Each PyTorch tensor has metadata associated with it that cannot only be queried, but also be modified in various calls. Some of the most commonly used metadata are a tensors `shape`, i.e. its dimensions, its `dtype`, i.e. the datatype of the elements, as well as the `device`, i.e. the processing device it is allocated on.

In [None]:
random_volume.shape, random_volume.dtype, random_volume.device

In [None]:
matrix.shape, matrix.dtype, matrix.device

The metadata values can also be manipulated, e.g. by changing the datatype or adjusting the shape. In the following code snippet, we change the dimensionality of a one-dimensional vector into a two dimensional matrix.

In [None]:
vector = torch.arange(10)
vector.reshape(5, 2)

**Task 1:** try creating a tensor of data with the following dimensions `(100, 2, 2, 3)` and fill it with uniformly distributed `float64` values. Make use of PyTorch's `rand()` function for this.

In [None]:
### TODO

----
#### Part 2 - Operations and Equations

PyTorch supports several dozens of tensor operations, including transposing, indexing, slicing, mathematical operations, linear algebra, and more. In the following examples we will have a brief look at them.

In [None]:
masses = torch.arange(10, dtype=torch.float32)
masses

We can add two vectors of same length, resulting in a element-wise operation of the individual vector elements.

In [None]:
ones = torch.ones(size=(10,))
masses + ones

**Slicing** allows us to index only parts of the data and continue working with it. The used indices are zero-based, left-inclusive and right-exclusive.

In [None]:
masses[3:5]

We can also formulate conditions, resulting in a boolean mask, which we can use to index data as demonstrated below.

In [None]:
masses > 7, masses[masses > 7]

Vectors, matrices, volumes and so forth can also be combined with **scalars**. In this case the scalar in applied element-wise to each tensor element. Let us calculate Earth's gravitational force at ground-level for the previously defined masses.

In [None]:
gravitational_force = masses * 9.81
gravitational_force

PyTorch generally repeats operands if their shapes match, i.e. they have the same exact same dimension or the dimension is equal to one. This approach is called **broadcasting**.

In [None]:
broadcast = torch.ones(size=(3, 10)) + masses
broadcast

PyTorch also provides reduction operations that reduce entire tensors or subsets, e.g. columns or row, to singular values. Commonly used reduction operations are `min()`, `max()` or `sum()` for example. Let us have a look at an example:

In [None]:
broadcast.sum(dim=0)

Equally higher level operations are available like computing norms, matrix decompositions, or matrix multiplication.

In [None]:
torch.arange(10) @ torch.arange(10)

**Task 2:** calculate mean and standard deviation along the first dimension for a normal-distributed data of dimensions `(100, 3)` 

In [None]:
### TODO

----
#### Part 3 - Using the GPU

PyTorch enables you to leverage GPUs to accelerate computations. Particularly well suited-are numerical problems, e.g. linear algebra, with identical operations. Let us get to know PyTorch's `.cuda` submodule a little. First, we should make sure that PyTorch has been properly loaded and initialized the software, here: CUDA, to interact with GPUs.

In [None]:
torch.cuda.is_available()

Everything seems to be in order. CUDA is available to PyTorch. Let us know check how many and what kind of GPUs we can use:

In [None]:
torch.cuda.device_count(), torch.cuda.get_device_name()

Let us now create a vector of data and move it from CPU to GPU.

In [None]:
m = torch.arange(10)
m

Let us manually move the data to the GPU now.

In [None]:
m_gpu = m.cuda()
m_gpu

Analogous to our previous usage of PyTorch, we can now GPU-accelerate computation by using the exact same interface.

In [None]:
m_gpu.sum()

PyTorch's library call that make it necessary that the data resides in the CPU's main memory, e.g. printing out values, move data automatically around. Yet, we can also do so manually.

In [None]:
m_gpu.cpu().device

PyTorch offers several other ways of initializing data directly on the GPU. Below you will find the most common approaches.

In [None]:
torch.arange(2, device='cuda')

In [None]:
torch.cuda.FloatTensor([1.0, 2.0])

In [None]:
torch.set_default_tensor_type('torch.cuda.FloatTensor')
torch.randn(5).device

Mixing devices is not possible and will result in an error.

In [None]:
torch.arange(10, device='cuda') + torch.arange(10, device='cpu')

**Task 3:** we can now put all introduced elements together to make actual meaningful computations. For the following example, we load particle decays, simulated using the `phasespace` Python package, from disk and subsequently use PyTorch to compute the *thrust* for each event. The thrust is defined as:

$$T=\max\limits_{\vec{n}}\frac{\sum_j |\vec{p}_j\cdot\vec{n}|}{\sum_j |\vec{p}_j|}$$

Where $\vec{p}_j$ are the particles' momenta and $\vec{n}$ a vector with norm 1. The vector $\vec{n}_T$ that maximizes the thrust is called the *thrust axis*. A thrust of $T\approx\frac{1}{2}$ implies a spherical momenta distribution, where as $T\approx 1$ indicates strong jets.

Let us download the data first.

In [None]:
import util
decays = util.download_data()
decays.shape, decays.device

In this example we have fifty million events, each consisting out of six final-state particles and their three x-, y- and z-momenta. According the formula above we will compute their thrust on the CPU first.

In [None]:
def compute_thrust(events):
    # simplified candidate estimation for n as average of particles
    ### TODO
    # normalize n to be a unit vector
    n_norms = torch.linalg.norm(n, dim=1, keepdim=True)
    ### TODO
    
    # calculate both fraction components
    nominator = torch.bmm(events, n.unsqueeze(dim=2)).sum(dim=(1, 2))
    denominator = ### TODO
    
    # calculate thrust
    thrust = nominator / denominator
    
    return thrust

In [None]:
%%time
thrust = compute_thrust(decays)
thrust.min(), thrust.max()

In [None]:
%%time
thrust = compute_thrust(decays.cuda())
thrust.min(), thrust.max()