# einops.pack and einops.unpack

https://github.com/arogozhnikov/einops/blob/master/docs/4-pack-and-unpack.ipynb

einops 0.6 introduces two more functions to the family: `pack` and `unpack`.

Here is what they do:

- `unpack` reverses `pack`
- `pack` reverses `unpack`

Enlightened with this exhaustive description, let's move to examples.



In [1]:
# we'll use numpy for demo purposes
# operations work the same way with other frameworks
import numpy as np
from einops import pack, unpack

## Stacking data layers

Assume we have RGB image along with a corresponding depth image that we want to stack:

In [2]:
h, w = 100, 200
# image_rgb is 3-dimensional (h, w, 3) and depth is 2-dimensional (h, w)
image_rgb = np.random.random([h, w, 3])
image_depth = np.random.random([h, w])
image_rgb.shape, image_depth.shape

((100, 200, 3), (100, 200))

In [3]:
# but we can stack them
image_rgbd, ps = pack([image_rgb, image_depth], "h w *")

## How to read packing patterns

pattern `h w *` means that
- output is 3-dimensional
- first two axes (`h` and `w`) are shared across all inputs and also shared with output
- inputs, however do not have to be 3-dimensional. They can be 2-dim, 3-dim, 4-dim, etc. <br/>
  Regardless of inputs dimensionality, they all will be packed into 3-dim output, and information about how they were packed is stored in `PS`

---

模式 `h w *` 意味着
- 输出是3维的
- 前两个轴（`h` 和 `w`）在所有输入之间共享，也与输出共享
- 但输入不必是 3 维的。 它们可以是 2 维、3 维、4 维等。 <br/>
  无论输入维度如何，它们都将被打包成 3 维输出，并且有关它们如何打包的信息存储在 `PS` 中

In [4]:
# as you see, pack properly appended depth as one more layer
# and correctly aligned axes!
# this won't work off the shelf with np.concatenate or torch.cat or alike
image_rgb.shape, image_depth.shape, image_rgbd.shape

((100, 200, 3), (100, 200), (100, 200, 4))

In [5]:
# now let's see what PS keeps.
# PS means Packed Shapes, not PlayStation or Post Script
ps

[(3,), ()]

which reads: first tensor had shape `h, w, 3`, while second tensor had shape `h, w`.
That's just enough to reverse packing:

---

其中内容为：第一个张量的形状为 `h, w, 3`，而第二个张量的形状为 `h, w`。
这足以反转打包：

In [6]:
# remove 1-axis in depth image during unpacking. Results are (h, w, 3) and (h, w)
unpacked_rgb, unpacked_depth = unpack(image_rgbd, ps, "h w *")
unpacked_rgb.shape, unpacked_depth.shape

((100, 200, 3), (100, 200))

we can unpack tensor in different ways manually:

In [7]:
# simple unpack by splitting the axis. Results are (h, w, 3) and (h, w, 1)
rgb, depth = unpack(image_rgbd, [[3], [1]], "h w *")
rgb.shape, depth.shape

((100, 200, 3), (100, 200, 1))

In [8]:
# different split, both outputs have shape (h, w, 2)
rg, bd = unpack(image_rgbd, [[2], [2]], "h w *")
rg.shape, bd.shape

((100, 200, 2), (100, 200, 2))

In [9]:
# unpack to 4 tensors of shape (h, w). More like 'unstack over last axis'
[r, g, b, d] = unpack(image_rgbd, [[], [], [], []], "h w *")
r.shape, g.shape, b.shape, d.shape

((100, 200), (100, 200), (100, 200), (100, 200))

### 处理单个array

In [10]:
h, w = 100, 200
# image_rgb is 3-dimensional (h, w, 3) and depth is 2-dimensional (h, w)
image_rgb = np.random.random([h, w, 3])

In [14]:
# 处理单个array,放入一个数组中
image_bhwc, ps = pack([image_rgb], "* h w c")
print(image_bhwc.shape)
print(ps)

print(unpack(image_bhwc, ps, "* h w c")[0].shape)

(1, 100, 200, 3)
[()]
(100, 200, 3)


### Short summary so far

- `einops.pack` is a 'more generic concatenation' (that can stack too)
- `einops.unpack` is a 'more generic split'

And, of course, `einops` functions are more verbose, and *reversing* concatenation now is *dead simple*

Compared to other `einops` functions, `pack` and `unpack` have a compact pattern without arrow, and the same pattern can be used in `pack` and `unpack`. These patterns are very simplistic: just a sequence of space-separated axes names.
One axis is `*`, all other axes are valid identifiers.

Now let's discuss some practical cases

---

- `einops.pack` 是一个“更通用的concatenation”（也可以stack）
- `einops.unpack` 是一个“更通用的分割”

当然，`einops` 函数更加冗长，并且 **反转** 连接现 **非常简单**

与其他 `einops` 函数相比， `pack` 和 `unpack` 使用没有箭头的紧凑模式，并且可以在 `pack` 和 `unpack` 中使用相同的模式。 这些模式非常简单：只是一系列以空格分隔的轴名称。
一个轴是 `*`，所有其他轴都是有效标识符。

现在我们来讨论一些实际案例

## Auto-batching

ML models by default accept batches: batch of images, or batch of sentences, or batch of audios, etc.

During debugging or inference, however, it is common to pass a single image instead (and thus output should be a single prediction) <br />
In this example we'll write `universal_predict` that can handle both cases.

---

ML 模型默认接受批次：批次图像、批次句子、批次音频等。

然而，在调试或推理过程中，通常会传递单个图像（因此输出应该是单个预测）<br />
在这个例子中，我们将编写可以处理这两种情况的 `universal_predict`。

In [23]:
from einops import reduce


def image_classifier(images_bhwc):
    # mock for image classifier
    predictions = reduce(images_bhwc, "b h w c -> b c", "mean", h=h, w=w, c=3)
    return predictions


def universal_predict(x):
    x_packed, ps = pack([x], "* h w c")  # make any shape to [b, c, h, w]
    print(x_packed.shape, ps)
    predictions_packed = image_classifier(x_packed)
    [predictions] = unpack(predictions_packed, ps, "* cls")  # revert shape
    return predictions

In [24]:
# works with a single image
print(universal_predict(np.zeros([h, w, 3])).shape)
# works with a batch of images
batch = 5
print(universal_predict(np.zeros([batch, h, w, 3])).shape)
# or even a batch of videos
n_frames = 7
print(universal_predict(np.zeros([batch, n_frames, h, w, 3])).shape)

(1, 100, 200, 3) [()]
(3,)
(5, 100, 200, 3) [(5,)]
(5, 3)
(35, 100, 200, 3) [(5, 7)]
(5, 7, 3)


**what we can learn from this example**:

- `pack` and `unpack` play nicely together. That's not a coincidence :)
- patterns in `pack` and `unpack` may differ, and that's quite common for applications
- unlike other operations in `einops`, `(un)pack` does not provide arbitrary reordering of axes

**我们可以从这个例子中学到什么**：

- `pack` 和 `unpack` 配合得很好。 这不是巧合:)
- `pack` 和 `unpack` 中的模式可能不同，这对于应用程序来说很常见
- 与 `einops` 中的其他操作不同，`(un)pack` 不提供轴的任意重新排序

## Class token in VIT

Let's assume we have a simple transformer model that works with `BTC`-shaped tensors.

---

假设我们有一个简单的 transformer 模型，可以使用 `BTC` 形状的张量。

In [25]:
def transformer_mock(x_btc):
    # imagine this is a transformer model, a very efficient one
    assert len(x_btc.shape) == 3
    return x_btc

Let's implement vision transformer (ViT) with a class token (i.e. static token, corresponding output is used to classify an image)

In [10]:
# below it is assumed that you already
# 1) split batch of images into patches 2) applied linear projection and 3) used positional embedding.

# We'll skip that here. But hey, here is an einops-style way of doing all of that in a single shot!
# from einops.layers.torch import EinMix
# patcher_and_posembedder = EinMix('b (h h2) (w w2) c -> b h w c_out', weight_shape='h2 w2 c c_out',
#                                  bias_shape='h w c_out', h2=..., w2=...)
# patch_tokens_bhwc = patcher_and_posembedder(images_bhwc)

In [26]:
# preparations
batch, height, width, c = 6, 16, 16, 256
patch_tokens = np.random.random([batch, height, width, c])
class_tokens = np.zeros([batch, c])

In [29]:
def vit_einops(class_tokens, patch_tokens):
    input_packed, ps = pack([class_tokens, patch_tokens], "b * c")
    print(input_packed.shape, ps)
    output_packed = transformer_mock(input_packed)
    return unpack(output_packed, ps, "b * c_out")


class_token_emb, patch_tokens_emb = vit_einops(class_tokens, patch_tokens)

class_token_emb.shape, patch_tokens_emb.shape

(6, 257, 256) [(), (16, 16)]


((6, 256), (6, 16, 16, 256))

At this point, let's make a small pause and understand conveniences of this pipeline, by contrasting it to more 'standard' code

---

此时，让我们稍作停顿，通过将其与更“标准”的代码进行对比来了解该管道的便利性

In [31]:
def vit_vanilla(class_tokens, patch_tokens):
    b, h, w, c = patch_tokens.shape
    class_tokens_b1c = class_tokens[:, None, :]  # [b, c] -> [b, 1, c]
    patch_tokens_btc = np.reshape(
        patch_tokens, [b, -1, c]
    )  # [b, h, w, c] -> [b, h*w, c]
    input_packed = np.concatenate(
        [class_tokens_b1c, patch_tokens_btc], axis=1
    )  # [b, 1, c] cat [b, h*w, c] = [b, 1+h*w, c]
    output_packed = transformer_mock(input_packed)
    class_token_emb = np.squeeze(
        output_packed[:, :1, :], 1
    )  # [b, 1+h*w, c] get [b, 1, c] -> [b, c]
    patch_tokens_emb = np.reshape(
        output_packed[:, 1:, :], [b, h, w, -1]
    )  # [b, 1+h*w, c] get [b, h*w, c] -> [b, h, w, c]
    return class_token_emb, patch_tokens_emb


class_token_emb2, patch_tokens_emb2 = vit_vanilla(class_tokens, patch_tokens)
assert np.allclose(class_token_emb, class_token_emb2)
assert np.allclose(patch_tokens_emb, patch_tokens_emb2)

Notably, we have put all packing and unpacking, reshapes, adding and removing of dummy axes into a couple of lines.

---

值得注意的是，我们将所有打包和拆包、重塑、添加和删除虚拟轴放入几行中。

## Packing different modalities together

We can extend the previous example: it is quite common to mix elements of different types of inputs in transformers.

The simples one is to mix tokens from all inputs:

```python
all_inputs = [text_tokens_btc, image_bhwc, task_token_bc, static_tokens_bnc]
inputs_packed, ps = pack(all_inputs, 'b * c')
```

and you can `unpack` resulting tokens to the same structure.

## Packing data coming from different sources together

Most notable example is of course GANs:

```python
input_ims, ps = pack([true_images, fake_images], '* h w c')
true_pred, fake_pred = unpack(model(input_ims), ps, '* c')
```
`true_pred` and `fake_pred` are handled differently, that's why we separated them

## Predicting multiple outputs at the same time

It is quite common to pack prediction of multiple target values into a single layer.

This is more efficient, but code is less readable. For example, that's how detection code may look like:

---

将多个目标值的预测打包到单个层中是很常见的。

这样效率更高，但代码可读性较差。 例如，检测代码可能如下所示：

In [14]:
def loss_detection(model_output_bhwc, mask_h: int, mask_w: int, n_classes: int):
    output = model_output_bhwc

    confidence = output[..., 0].sigmoid()
    bbox_x_shift = output[..., 1].sigmoid()
    bbox_y_shift = output[..., 2].sigmoid()
    bbox_w = output[..., 3]
    bbox_h = output[..., 4]
    mask_logits = output[..., 5 : 5 + mask_h * mask_w]
    mask_logits = mask_logits.reshape([*mask_logits.shape[:-1], mask_h, mask_w])
    class_logits = output[..., 5 + mask_h * mask_w :]
    assert class_logits.shape[-1] == n_classes, class_logits.shape[-1]

    # downstream computations
    return (
        confidence,
        bbox_x_shift,
        bbox_y_shift,
        bbox_h,
        bbox_w,
        mask_logits,
        class_logits,
    )

When the same logic is implemented in einops, there is no need to memorize offsets. <br />
Additionally, reshapes and shape checks are automatic:

---

当在einops中实现相同的逻辑时，不需要记住偏移量。 <br/>
此外，重塑和形状检查是自动的：

In [15]:
def loss_detection_einops(model_output, mask_h: int, mask_w: int, n_classes: int):
    (
        confidence,
        bbox_x_shift,
        bbox_y_shift,
        bbox_w,
        bbox_h,
        mask_logits,
        class_logits,
    ) = unpack(model_output, [[]] * 5 + [[mask_h, mask_w], [n_classes]], "b h w *")

    confidence = confidence.sigmoid()
    bbox_x_shift = bbox_x_shift.sigmoid()
    bbox_y_shift = bbox_y_shift.sigmoid()

    # downstream computations
    return (
        confidence,
        bbox_x_shift,
        bbox_y_shift,
        bbox_h,
        bbox_w,
        mask_logits,
        class_logits,
    )

In [16]:
# check that results are identical
import torch

dims = dict(mask_h=6, mask_w=8, n_classes=19)
model_output = torch.randn(
    [3, 5, 7, 5 + dims["mask_h"] * dims["mask_w"] + dims["n_classes"]]
)
for a, b in zip(
    loss_detection(model_output, **dims), loss_detection_einops(model_output, **dims)
):
    assert torch.allclose(a, b)

Or maybe **reinforcement learning** is closer to your mind?

If so, predicting multiple outputs is valuable there too:

```python
action_logits, reward_expectation, q_values, expected_entropy_after_action = \
    unpack(predictions_btc, [[n_actions], [], [n_actions], [n_actions]], 'b step *')


```


## That's all for today!

happy packing and unpacking!