# 1 Preliminaries

This notebook will give you an introduction to the basics needed for understanding Universal Physics Transformers (UPT).
- Sparse tensors
- Architecture Overview
- Perceiver Pooling
- Perceiver Decoder

## 1.1 Sparse tensors

When handling data from irregular grid, the concept of sparse tensors is commonly used because it allows for efficient
computations compared to a dense counterpart.

What is a dense tensor? Consider the example of a pointcloud. Lets say we have a set of 5 pointclouds with
10, 100, 1000, 10000 and 100000 points respectively (each point has 3 coordinates).
In order to train a neural network on it we need to convert it into a pytorch tensor and process multiple pointclouds
at once.

A dense representation would produce a 3D tensor `(batch_size, max_num_points, 3)` where max_num_points is padded to the
largest number of points in a single pointcloud for each batch. This would produce a `(5, 100000, 3)` tensor which consists
mostly of padded values (our 5 pointclouds consist of 111.110 points, but the dense representation adds 388.890
additional points for padding).

A sparse representation would not require this additional overhead as it simply produces a 2D tensor where the
first dimension "squashes" the `batch_size` and `num_points` dimension into one, i.e. it would produce a
tensor of shape (111.110, 3). Much smaller better than the dense representation and no additional padding needed.
But which point belongs to which pointcloud? In order to preserve this representation, an additional tensor is created
that stores which index of the sparse tensor belongs to which pointcloud. This tensor is commonly called `batch_idx`

In [None]:
# create pointclouds
import torch
point_clouds = [
    torch.randn(10, 3),
    torch.randn(100, 3),
    torch.randn(1000, 3),
    torch.randn(10000, 3),
    torch.randn(100000, 3),
]

In [None]:
# make a dense tensor representing of all point clouds
from torch.nn.utils.rnn import pad_sequence
dense = pad_sequence(point_clouds)
print(f"dense.shape: {dense.shape}")

In [None]:
# make a sparse tensor representation of all point clouds
sparse = torch.concat(point_clouds)
print(f"sparse.shape: {sparse.shape}")

In [None]:
# create batch_idx tensor to assign indices of the sparse tensor to indices of the pointcloud
batch_idx = torch.tensor([[i] * len(point_cloud) for i, point_cloud in enumerate(point_clouds)])
print(f"batch_idx.shape: {batch_idx.shape}")
print(f"the first 10 samples belong to the first pointcloud (i.e. point_clouds[0]): {batch_idx[:10]}")
print(f"the 11th point in the sparse tensor belongs to the second pointcloud (i.e. point_clouds[1]): {batch_idx[10]}")

Dense tensors are very nice for things like regular grid data (images, videos, ...) but for data from irregular grids
sparse tensors are needed for efficient processing.

In UPTs, the input and output are both sparse tensors but within the model, the tensors are converted from sparse
to dense and then back to sparse. That is because message passing layers use a sparse representation and 
transformers use a dense representation.

0.1 Architecture Overview

First, lets have a more detailed look at the architecture. In this tutorial, we will not consider inverse
encoding/decoding (to enable the latent rollout) as it adds a lot of complexity to the model.

We have a simple transformer based model which we conceptually split into 3 components: encoder, approximator and decoder.

The **encoder** processes input features (e.g. velocities, pressure, ...) and input positions at timestep $t$ and
encodes it into a latent representation $\text{latent}_t$. Input features and input positions

![title]("schematics/architecture.svg")


In [None]:
# 0.1 Perceiver Pooling


In [None]:
print("test")