# Named Tensor Notation

## Introduction

**This is a translation of the [Named Tensor Notation (Chiang, Rush, Barak 2021)](https://namedtensor.github.io/) example from the [funsor](https://funsor.pyro.ai/) library to `effectful`. Much of the expository text is taken directly from the [original](https://github.com/pyro-ppl/funsor/blob/master/tutorials/named_tensor_notation_i.ipynb).**

The mathematical notation with *named axes* introduced in [Named Tensor Notation (Chiang, Rush, Barak 2021)](https://namedtensor.github.io/) improves the readability of mathematical formulas involving multidimensional arrays. This includes tensor operations such as elementwise operations, reductions, contractions, renaming, indexing, and broadcasting. Part 1 covers examples from [2 Informal Overview](https://namedtensor.github.io/#sec:overview), [3.4.2 Advanced Indexing](https://namedtensor.github.io/#sec:examples), and [5 Formal Definitions](https://namedtensor.github.io/#sec:definitions).

In [1]:
import functools

import torch
from torch import tensor

from effectful.ops.syntax import Bound, defop
from effectful.ops.semantics import evaluate, fvsof, handler
from effectful.handlers.torch import Indexable, sizesof, to_tensor


def subst(term, substs):
    with handler(
        {k: functools.partial(lambda vv: vv, v) for (k, v) in substs.items()},
    ):
        return evaluate(term)


def reduce(indexes, indexed_tensor, reducer):
    """Reduce an indexed tensor along one or more named dimensions.

    Args:
    - indexes: Names of dimensions to reduce.
    - indexed_tensor: The tensor to reduce.
    - reducer: A reduction function like `torch.sum`. Must take `tensor`, `dim`, and `keepdim` arguments.

    Returns: A new indexed tensor with the specified dimensions reduced.

    Example:
    >>> width, height = defop(int, name='width'), defop(int, name='height')
    >>> t = indexed(torch.ones(2, 3))[width(), height()]
    >>> reduce([width], t, "sum")
    indexed(tensor([2., 2., 2.]))[height()]
    """
    fvars = fvsof(indexed_tensor)
    indexes = [i for i in indexes if i in fvars]

    # convert indexed dimensions to positional and flatten all new positional dims
    t = to_tensor(indexed_tensor, indexes)
    t_flat = torch.flatten(t, 0, len(indexes) - 1)

    # reduce dim 0 into the first index of dim 0, then return reduction
    return reducer(t_flat, 0, keepdim=True)[0]


def gensyms(*names, type_=int):
    return tuple(defop(int, name=n) for n in names)

## Named Tensors

Each tensor axis is given a name:

$$
\begin{aligned}
  A &\in \mathbb{R}^{\mathsf{\vphantom{fg}height}[3] \times \mathsf{\vphantom{fg}width}[3]} = \mathbb{R}^{\mathsf{\vphantom{fg}width}[3] \times \mathsf{\vphantom{fg}height}[3]} \\
  A &= \mathsf{\vphantom{fg}height}
  \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\\begin{bmatrix}
    3 & 1 & 4 \\
    1 & 5 & 9 \\
    2 & 6 & 5
  \end{bmatrix}\end{array} =
  \mathsf{\vphantom{fg}width}
  \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}height}\\\begin{bmatrix}
    3 & 1 & 2 \\
    1 & 5 & 6 \\
    4 & 9 & 5
  \end{bmatrix}\end{array}.
\end{aligned}
$$

In [2]:
height, width = defop(int, name="height"), defop(int, name="width")
t = tensor([[3, 1, 4], [1, 5, 9], [2, 6, 5]])
A = Indexable(tensor([[3, 1, 4], [1, 5, 9], [2, 6, 5]]))[height(), width()]
A

Indexable(tensor([[3, 1, 4],
                  [1, 5, 9],
                  [2, 6, 5]]))[height(, ), width(, )]

Access elements of $A$ using named indices:

$$
A_{\mathsf{\vphantom{fg}height}(1), \mathsf{\vphantom{fg}width}(3)} = A_{\mathsf{\vphantom{fg}width}(3), \mathsf{\vphantom{fg}height}(1)} = 4
$$

In [3]:
subst(A, {height: 0, width: 2})

tensor(4)

Partial indexing:

$$
\begin{aligned}
A_{\mathsf{\vphantom{fg}height}(1)} &= \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\
\begin{bmatrix}
  3 & 1 & 4
\end{bmatrix}\end{array}
&
A_{\mathsf{\vphantom{fg}width}(3)} &= \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}height}\\
\begin{bmatrix}
  4 & 9 & 5
\end{bmatrix}\end{array}.
\end{aligned}
$$

In [4]:
subst(A, {height: 0})

Indexable(tensor([3, 1, 4]))[width(, )]

In [5]:
subst(A, {width: 2})

Indexable(tensor([4, 9, 5]))[height(, )]

## Named Tensor Operations

### Elementwise Operations and Broadcasting

Elementwise operations:

$$
\frac1{1+\exp(-A)} = \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\
\begin{bmatrix}
  \frac 1{1+\exp(-3)} & \frac 1{1+\exp(-1)} & \frac 1{1+\exp(-4)} \\[1ex]
  \frac 1{1+\exp(-1)} & \frac 1{1+\exp(-5)} & \frac 1{1+\exp(-9)} \\[1ex]
  \frac 1{1+\exp(-2)} & \frac 1{1+\exp(-6)} & \frac 1{1+\exp(-5)}
\end{bmatrix}\end{array}.
$$

In [6]:
1 / (1 + (-A).exp())

Indexable(tensor([[0.9526, 0.7311, 0.9820],
                  [0.7311, 0.9933, 0.9999],
                  [0.8808, 0.9975, 0.9933]]))[height(, ), width(, )]

Tensors with different shapes are automatically broadcasted against each other before an operation is applied. Let

$$
\begin{aligned}
  x &\in \mathbb{R}^{\mathsf{\vphantom{fg}height}[3]} & y &\in \mathbb{R}^{\mathsf{\vphantom{fg}width}[3]} \\
  x &= \mathsf{\vphantom{fg}height}
  \begin{array}[b]{@{}c@{}}\\
  \begin{bmatrix}
    2 \\ 7 \\ 1
  \end{bmatrix}\end{array} & 
  y &= 
  \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\\begin{bmatrix}
    1 & 4 & 1
  \end{bmatrix}\end{array}.
\end{aligned}
$$

In [7]:
x = Indexable(tensor([2, 7, 1]))[height()]
y = Indexable(tensor([1, 4, 1]))[width()]

Binary addition operation:

$$
\begin{aligned}
A + x &= \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\\begin{bmatrix}
  3+2 & 1+2 & 4+2 \\
  1+7 & 5+7 & 9+7 \\
  2+1 & 6+1 & 5+1
\end{bmatrix}\end{array} &
A + y &= \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\\begin{bmatrix}
  3+1 & 1+4 & 4+1 \\
  1+1 & 5+4 & 9+1 \\
  2+1 & 6+4 & 5+1
\end{bmatrix}\end{array}.
\end{aligned}
$$

In [8]:
A + x

Indexable(tensor([[ 5,  3,  6],
                  [ 8, 12, 16],
                  [ 3,  7,  6]]))[height(, ), width(, )]

In [9]:
A + y

Indexable(tensor([[ 4,  5,  5],
                  [ 2,  9, 10],
                  [ 3, 10,  6]]))[height(, ), width(, )]

Binary multiplication operation:

$$
A \odot x = \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\\begin{bmatrix}
  3\cdot2 & 1\cdot2 & 4\cdot2 \\
  1\cdot7 & 5\cdot7 & 9\cdot7 \\
  2\cdot1 & 6\cdot1 & 5\cdot1
\end{bmatrix}\end{array}
$$

In [10]:
A * x

Indexable(tensor([[ 6,  2,  8],
                  [ 7, 35, 63],
                  [ 2,  6,  5]]))[height(, ), width(, )]

Binary maximum operation:

$$
\max(A, y) = \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\\begin{bmatrix}
  \max(3, 1) & \max(1, 4) & \max(4, 1) \\
  \max(1, 1) & \max(5, 4) & \max(9, 1) \\
  \max(2, 1) & \max(6, 4) & \max(5, 1)
\end{bmatrix}\end{array}.
$$

In [11]:
torch.max(A, y)

Indexable(tensor([[3, 4, 4],
                  [1, 5, 9],
                  [2, 6, 5]]))[height(, ), width(, )]

### Reductions

Named axes can be reduced over by calling the `.reduce` method and specifying the [reduction operator](https://en.wikipedia.org/wiki/Reduction_Operator) and names of reduced axes. Note that reduction is defined only for operators that are associative and commutative.

$$
\sum\limits_{\substack{\mathsf{\vphantom{fg}height}}} A = \sum_i A_{\mathsf{\vphantom{fg}height}(i)} = \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\
\begin{bmatrix}
  3+1+2 & 1+5+6 & 4+9+5
\end{bmatrix}\end{array}.
$$

In [12]:
reduce([height], A, torch.sum)

Indexable(tensor([ 6, 12, 18]))[width(, )]

$$
\sum\limits_{\substack{\mathsf{\vphantom{fg}width}}} A = \sum_j A_{\mathsf{\vphantom{fg}width}(j)} = \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}height}\\
\begin{bmatrix}
  3+1+4 & 1+5+9 & 2+6+5
\end{bmatrix}\end{array}.
$$

In [13]:
reduce([width], A, torch.sum)

Indexable(tensor([ 8, 15, 13]))[height(, )]

Reduction over multiple axes:

$$
\sum\limits_{\substack{\mathsf{\vphantom{fg}height}\\
 \mathsf{\vphantom{fg}width}}} A = \sum_i \sum_j A_{\mathsf{\vphantom{fg}height}(i),\mathsf{\vphantom{fg}width}(j)} = 3+1+4+1+5+9+2+6+5.
 $$

In [14]:
reduce([height, width], A, torch.sum)

tensor(36)

Multiplication reduction:

$$
\prod\limits_{\substack{\mathsf{\vphantom{fg}height}}} A = \prod_i A_{\mathsf{\vphantom{fg}height}(i)} = \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\
\begin{bmatrix}
  3\cdot1\cdot2 & 1\cdot5\cdot6 & 4\cdot9\cdot5
\end{bmatrix}\end{array}.
$$

In [15]:
reduce([height], A, torch.prod)

Indexable(tensor([  6,  30, 180]))[width(, )]

Max reduction:

$$
\max\limits_{\substack{\mathsf{\vphantom{fg}height}}} A = \max \{A_{\mathsf{\vphantom{fg}height}(i)} \mid 1 \leq i \leq n\} = \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}\\
\begin{bmatrix}
  \max(3, 1, 2) & \max(1, 5, 6) & \max(4, 9, 5)
\end{bmatrix}\end{array}.
$$

In [16]:
reduce([height], A, torch.amax)

Indexable(tensor([3, 6, 9]))[width(, )]

### Contraction

Contraction operation can be written as elementwise multiplication followed by summation over an axis:

$$
A \mathbin{\underset{\substack{\mathsf{\vphantom{fg}width}}}{\vphantom{fg}\odot}} y = \sum_j A_{\mathsf{\vphantom{fg}width}(j)} \, y_{\mathsf{\vphantom{fg}width}(j)} = \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\\\begin{bmatrix}
  3\cdot 1 + 1\cdot 4 + 4\cdot 1 \\
  1\cdot 1 + 5\cdot 4 + 9\cdot 1 \\
  2\cdot 1 + 6\cdot 4 + 5\cdot 1
\end{bmatrix}\end{array}.
$$

In [17]:
reduce([width], A * y, torch.sum)

Indexable(tensor([11, 30, 31]))[height(, )]

Some other operations from linear algebra:

$$
x \mathbin{\underset{\substack{\mathsf{\vphantom{fg}height}}}{\vphantom{fg}\odot}} x = \sum_i x_{\mathsf{\vphantom{fg}height}(i)} \, x_{\mathsf{\vphantom{fg}height}(i)} \qquad \text{inner product}
$$

In [18]:
reduce([height], x * x, torch.sum)

tensor(54)

$$
[x \odot y]_{\mathsf{\vphantom{fg}height}(i), \mathsf{\vphantom{fg}width}(j)} = x_{\mathsf{\vphantom{fg}height}(i)} \, y_{\mathsf{\vphantom{fg}width}(j)} \qquad \text{outer product}
$$

In [19]:
x * y

Indexable(tensor([[ 2,  8,  2],
                  [ 7, 28,  7],
                  [ 1,  4,  1]]))[height(, ), width(, )]

$$
A \mathbin{\underset{\substack{\mathsf{\vphantom{fg}width}}}{\vphantom{fg}\odot}} y = \sum_i A_{\mathsf{\vphantom{fg}width}(i)} \, y_{\mathsf{\vphantom{fg}width}(i)} \qquad \text{matrix-vector product}
$$

In [20]:
reduce([width], A * y, torch.sum)

Indexable(tensor([11, 30, 31]))[height(, )]

$$
x \mathbin{\underset{\substack{\mathsf{\vphantom{fg}height}}}{\vphantom{fg}\odot}} A = \sum_i x_{\mathsf{\vphantom{fg}height}(i)} \, A_{\mathsf{\vphantom{fg}height}(i)} \qquad \text{vector-matrix product} \\
$$

In [21]:
reduce([height], x * A, torch.sum)

Indexable(tensor([15, 43, 76]))[width(, )]

$$
A \mathbin{\underset{\substack{\mathsf{\vphantom{fg}width}}}{\vphantom{fg}\odot}} B = \sum_i A_{\mathsf{\vphantom{fg}width}(i)} \odot B_{\mathsf{\vphantom{fg}width}(i)} \qquad \text{matrix-matrix product}~(B \in \mathbb{R}^{\mathsf{\vphantom{fg}width}\times \mathsf{\vphantom{fg}width2}})
$$

In [22]:
width2 = defop(int, name="width2")
B = Indexable(
    tensor([[3, 2, 5], [5, 4, 0], [8, 3, 6]]),
)[width(), width2()]

reduce([width], A * B, torch.sum)

Indexable(tensor([[ 46,  22,  39],
                  [100,  49,  59],
                  [ 76,  43,  40]]))[height(, ), width2(, )]

Contraction can be generalized to other binary and reduction operations:

$$
\max_{\mathsf{\vphantom{fg}width}} (A + y) = \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\\\begin{bmatrix}
  \max(3+1, 1+4, 4+1) \\
  \max(1+1, 5+4, 9+1) \\
  \max(2+1, 6+4, 5+1)
\end{bmatrix}\end{array}.
$$

In [23]:
reduce([width], A + y, torch.amax)

Indexable(tensor([ 5, 10, 10]))[height(, )]

### Renaming and Reshaping

Renaming named dimensions is simple:

$$
A_{\mathsf{\vphantom{fg}height}\rightarrow\mathsf{\vphantom{fg}height2}} = \mathsf{\vphantom{fg}height2}
\begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}
\\\begin{bmatrix}
  3 & 1 & 4 \\
  1 & 5 & 9 \\
  2 & 6 & 5 \\
\end{bmatrix}\end{array}.
$$

In [24]:
height2 = defop(int, name="height2")
subst(A, {height: height2()})

Indexable(tensor([[3, 1, 4],
                  [1, 5, 9],
                  [2, 6, 5]]))[height2(, ), width(, )]

$$
A_{(\mathsf{\vphantom{fg}height},\mathsf{\vphantom{fg}width})\rightarrow\mathsf{\vphantom{fg}layer}} = \begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}layer}\\
\begin{bmatrix}
    3 & 1 & 4 & 1 & 5 & 9 & 2 & 6 & 5
\end{bmatrix}\end{array}
$$

In [25]:
layer = defop(int, name="layer")
A_layer = subst(A, {height: layer() // 3, width: layer() % 3})
print(subst(A_layer, {layer: 2}))

tensor(4)


$$
A_{\mathsf{\vphantom{fg}layer}\rightarrow(\mathsf{\vphantom{fg}height},\mathsf{\vphantom{fg}width})} = \mathsf{\vphantom{fg}height}
\begin{array}[b]{@{}c@{}}\mathsf{\vphantom{fg}width}
\\\begin{bmatrix}
  3 & 1 & 4 \\
  1 & 5 & 9 \\
  2 & 6 & 5 \\
\end{bmatrix}\end{array}.
$$

In [26]:
print(subst(A_layer, {layer: height() * 3 + width() % 3}))

_torch_op(tensor([[3, 1, 4],
        [1, 5, 9],
        [2, 6, 5]]), ['floordiv(add(mul(height(, ), 3, ), mod(width(, ), 3, ), ), 3, )', 'mod(add(mul(height(, ), 3, ), mod(width(, ), 3, ), ), 3, )'], )


## Advanced Indexing

All of advanced indexing can be achieved through name substitutions.

$$
\mathop{\underset{\substack{\mathsf{\vphantom{fg}ax}}}{\vphantom{fg}\mathrm{index}}} \colon \mathbb{R}^{\mathsf{\vphantom{fg}ax}[n]} \times [n] \rightarrow \mathbb{R}\\
\mathop{\underset{\substack{\mathsf{\vphantom{fg}ax}}}{\vphantom{fg}\mathrm{index}}}(A, i) = A_{\mathsf{\vphantom{fg}ax}(i)}.
$$

$$
\begin{aligned}
  E &\in \mathbb{R}^{\mathsf{\vphantom{fg}vocab}[n] \times \mathsf{\vphantom{fg}emb}} \\
  i &\in [n] \\
  I &\in [n]^{\mathsf{\vphantom{fg}seq}} \\
  P &\in \mathbb{R}^{\mathsf{\vphantom{fg}seq}\times \mathsf{\vphantom{fg}vocab}[n]}
\end{aligned}
$$

Partial indexing $\mathop{\underset{\substack{\mathsf{\vphantom{fg}vocab}}}{\vphantom{fg}\mathrm{index}}}(E,i)$:

In [27]:
vocab, emb = defop(int, name="vocab"), defop(int, name="emb")
E = Indexable(
    tensor([[2, 1, 5], [3, 4, 2], [1, 3, 7], [1, 4, 3], [5, 9, 2]]),
)[vocab(), emb()]

subst(E, {vocab: 2})

Indexable(tensor([1, 3, 7]))[emb(, )]

Integer array indexing $\mathop{\underset{\substack{\mathsf{\vphantom{fg}vocab}}}{\vphantom{fg}\mathrm{index}}}(E,I)$:

In [28]:
seq = defop(int, name="seq")
I = Indexable(tensor([3, 2, 4, 0]))[seq()]

subst(E, {vocab: I})

Indexable(tensor([[1, 4, 3],
                  [1, 3, 7],
                  [5, 9, 2],
                  [2, 1, 5]]))[seq(, ), emb(, )]

Gather operation $\mathop{\underset{\substack{\mathsf{\vphantom{fg}vocab}}}{\vphantom{fg}\mathrm{index}}}(P,I)$:

In [29]:
P = Indexable(
    tensor([[6, 2, 4, 2], [8, 2, 1, 3], [5, 5, 7, 0], [1, 3, 8, 2], [5, 9, 2, 3]]),
)[vocab(), seq()]

subst(P, {vocab: I})

Indexable(tensor([1, 5, 2, 2]))[seq(, )]

Indexing with two integer arrays:

$$
\begin{aligned}
  |\mathsf{\vphantom{fg}seq}| &= m \\
  I_1 &= [m]^\mathsf{\vphantom{fg}subseq}\\
  I_2 &= [n]^\mathsf{\vphantom{fg}subseq}\\
  S &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}vocab}}}{\vphantom{fg}\mathrm{index}}}(\mathop{\underset{\substack{\mathsf{\vphantom{fg}seq}}}{\vphantom{fg}\mathrm{index}}}(P, I_1), I_2) \in \mathbb{R}^{\mathsf{\vphantom{fg}subseq}} \\
  S_{\mathsf{\vphantom{fg}subseq}(i)} &= P_{\mathsf{\vphantom{fg}seq}(I_{\mathsf{\vphantom{fg}subseq}(i)}), \mathsf{\vphantom{fg}vocab}(I_{\mathsf{\vphantom{fg}subseq}(i)})}.
\end{aligned}
$$

In [30]:
subseq = defop(int, name="subseq")
I1 = Indexable(tensor([1, 2, 0]))[subseq()]
I2 = Indexable(tensor([3, 0, 4]))[subseq()]

subst(P, {seq: I1, vocab: I2})

Indexable(tensor([3, 4, 5]))[subseq(, )]

## Constructing Neural Networks

### Feedforward

\begin{aligned}
  X^0 &\in \mathbb{R}^{\mathsf{\vphantom{fg}input}} \\
  X^1 &= \sigma(W^1 \mathbin{\underset{\substack{\mathsf{\vphantom{fg}input}}}{\vphantom{fg}\odot}} X^0 + b^1) & W^1 &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}_1 \times \mathsf{\vphantom{fg}input}} & b^1 &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}_1} \\
  X^2 &= \sigma(W^2 \mathbin{\underset{\substack{\mathsf{\vphantom{fg}hidden}_1}}{\vphantom{fg}\odot}} X^1 + b^2) & W^2 &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}_2 \times \mathsf{\vphantom{fg}hidden}_1} & b^2 &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}_2} \\
  X^3 &= \sigma(W^3 \mathbin{\underset{\substack{\mathsf{\vphantom{fg}hidden}_2}}{\vphantom{fg}\odot}} X^2 + b^3) & W^3 &\in \mathbb{R}^{\mathsf{\vphantom{fg}out}\times \mathsf{\vphantom{fg}hidden}_2} & b^3 &\in \mathbb{R}^{\mathsf{\vphantom{fg}out}}
\end{aligned}

$$
\begin{aligned}
x &\in \mathbb{R}^{\mathsf{\vphantom{fg}layer}[n_0]} \\
W^l &\in \mathbb{R}^{\mathsf{\vphantom{fg}layer^2}[n_l] \times \mathsf{\vphantom{fg}layer}[n_{l-1}]} \\
  b^l &\in \mathbb{R}^{\mathsf{\vphantom{fg}layer^2}[n_l]} \\
  \text{FullConn}^l(x) &= \sigma\left(W^l \mathbin{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\odot}} x + b^l\right)_{\mathsf{\vphantom{fg}layer^2}\rightarrow\mathsf{\vphantom{fg}layer}}
\end{aligned}
$$

In [31]:
@defop
def FullConn(x: torch.Tensor, W: torch.Tensor, b: torch.Tensor, layer: Bound):
    return reduce([layer], torch.sigmoid(torch.mul(W, x)), torch.sum) + b

In [None]:
input_size = 100
output_size = 32
input_, output = defop(int, name="input"), defop(int, name="output")

W = Indexable(torch.randn(input_size, output_size))[input_(), output()]
b = Indexable(torch.randn(output_size))[output()]
X = Indexable(torch.randn(input_size))[input_()]

FullConn(X, W, b, input_)

Indexable(tensor([52.0414, 48.0876, 51.2920, 49.6409, 52.7343, 48.4719, 50.6315, 49.3779,
                  48.6501, 50.5334, 47.9143, 50.4687, 49.8043, 48.4915, 52.6357, 52.7586,
                  50.4781, 51.8051, 49.5011, 50.8009, 49.9178, 48.5212, 51.2081, 48.3488,
                  51.1481, 49.7558, 49.7767, 51.4157, 53.4918, 49.9883, 51.4995, 49.2293]))[output(, )]

### Recurrent

$$
\begin{aligned}
x^{t} &\in \mathbb{R}^{\mathsf{\vphantom{fg}input}} & t &= 1, \ldots, n \\
W^{\text{h}} &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}\times \mathsf{\vphantom{fg}hidden}^\prime} & |\mathsf{\vphantom{fg}hidden}| &= |\mathsf{\vphantom{fg}hidden}^\prime| \\
W^{\text{i}} &\in \mathbb{R}^{\mathsf{\vphantom{fg}input}\times \mathsf{\vphantom{fg}hidden}^\prime} \\
b &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}^\prime} \\
h^{0} &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}} \\
h^{t} &= \sigma\left( W^{\text{h}} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}hidden}}}{\vphantom{fg}\odot}} h^{t-1} + W^{\text{i}} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}input}}}{\vphantom{fg}\odot}} x^{t} + b \right)_{\mathsf{\vphantom{fg}hidden}^\prime\rightarrow\mathsf{\vphantom{fg}hidden}} & t &= 1, \ldots, n
\end{aligned}
$$

In [33]:
@defop
def RNN(
    x: torch.Tensor,
    Wh: torch.Tensor,
    Wi: torch.Tensor,
    b: torch.Tensor,
    h: torch.Tensor,
    hidden: Bound,
    layer: Bound,
) -> torch.Tensor:
    return torch.sigmoid(
        reduce([hidden], Wh * h, torch.sum) + reduce([layer], Wi * x, torch.sum) + b
    )

In [34]:
input_size = 100
hidden_size = 32
input_, hidden, hidden2 = gensyms("input", "hidden", "hidden2")

Wh = Indexable(torch.randn(hidden_size, hidden_size))[hidden(), hidden2()]
Wi = Indexable(torch.randn(input_size, hidden_size))[input_(), hidden2()]
b = Indexable(torch.randn(hidden_size))[hidden2()]
h = Indexable(torch.randn(hidden_size))[hidden()]
x = Indexable(torch.randn(input_size))[input_()]

RNN(x, Wh, Wi, b, h, hidden, input_)

Indexable(tensor([6.3224e-06, 3.7210e-02, 4.9650e-02, 5.7229e-04, 1.0000e+00, 8.5436e-01,
                  2.3875e-01, 4.8705e-06, 9.9986e-01, 9.9778e-01, 8.2405e-01, 9.5162e-01,
                  9.9980e-01, 3.0399e-06, 8.0775e-01, 1.9093e-03, 9.9877e-01, 9.9939e-01,
                  4.7164e-05, 1.0523e-01, 7.6891e-01, 6.0549e-03, 1.3915e-05, 9.9999e-01,
                  2.6568e-06, 3.9610e-06, 3.6016e-10, 9.7404e-01, 9.8865e-01, 1.0000e+00,
                  1.0000e+00, 6.8541e-01]))[hidden2(, )]

### Attention

In [35]:
@defop
def Softmax(x: torch.Tensor, ax: Bound, ax2: Bound) -> torch.Tensor:
    x = subst(x, {ax: ax2()})
    y = x - reduce([ax2], x, torch.logsumexp)
    return y.exp()

\begin{aligned}
  \text{Attention} \colon \mathbb{R}^{\mathsf{\vphantom{fg}key}} \times \mathbb{R}^{\mathsf{\vphantom{fg}seq}\times\mathsf{\vphantom{fg}key}} \times \mathbb{R}^{\mathsf{\vphantom{fg}seq}\times\mathsf{\vphantom{fg}val}} \times \mathbb{R}^{\mathsf{\vphantom{fg}seq}} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}val}} \\
\text{Attention}(Q, K, V, M) &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}seq}}}{\vphantom{fg}\mathrm{softmax}}} \left( \frac{Q \mathbin{\underset{\substack{\mathsf{\vphantom{fg}key}}}{\vphantom{fg}\odot}} K}{\sqrt{|\mathsf{\vphantom{fg}key}|}} + M \right) \mathbin{\underset{\substack{\mathsf{\vphantom{fg}seq}}}{\vphantom{fg}\odot}} V.
\end{aligned}

In [36]:
@defop
def Attention(
    Q: torch.Tensor,
    K: torch.Tensor,
    V: torch.Tensor,
    M: torch.Tensor,
    key: Bound,
    seq: Bound,
    seq2: Bound,
) -> torch.Tensor:
    x = reduce([key], Q * K, torch.sum) / sizesof(Q)[key] + M
    return reduce([seq], Softmax(x, seq, seq2) * V, torch.sum)

In [37]:
key_size = 10
val_size = 5
seq_size = 3

key, val, seq, seq2 = gensyms("key", "val", "seq", "seq2")
Q = Indexable(torch.randn(key_size))[key()]
K = Indexable(torch.randn(key_size, seq_size))[key(), seq()]
V = Indexable(torch.randn(seq_size, val_size))[seq(), val()]
M = Indexable(torch.randn(seq_size))[seq()]

Attention(Q, K, V, M, key, seq, seq2)

Indexable(tensor([[ 0.4209, -2.0661,  0.6506,  0.8293, -0.2160],
                  [ 0.2951, -1.4488,  0.4562,  0.5815, -0.1515],
                  [ 0.1431, -0.7024,  0.2212,  0.2819, -0.0734]]))[seq2(, ), val(, )]

### Convolution

\begin{aligned}
  \mathop{\underset{\substack{\mathsf{\vphantom{fg}seq}\\ \mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{unroll}}} \colon \mathbb{R}^{\mathsf{\vphantom{fg}seq}[n]} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}seq}[n-|\mathsf{\vphantom{fg}kernel}|+1], \mathsf{\vphantom{fg}kernel}} \\
  \mathop{\underset{\substack{\mathsf{\vphantom{fg}seq}\\ \mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{unroll}}} X &= Y,\ \text{where} \\
  Y_{\mathsf{\vphantom{fg}seq}(i), \mathsf{\vphantom{fg}kernel}(j)} &= X_{\mathsf{\vphantom{fg}seq}(i+j - 1)}.
\end{aligned}

In [38]:
@defop
def Unroll(
    x: torch.Tensor, seq: Bound, k: int, kernel: Bound, seq2: Bound
) -> torch.Tensor:
    return Indexable(to_tensor(x, [seq]).unfold(0, k, 1))[seq2(), kernel()]

\begin{aligned}
\text{Conv1d} \colon \mathbb{R}^{\mathsf{\vphantom{fg}chans}\times \mathsf{\vphantom{fg}seq}[n]} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}seq}[n^\prime]} \\
\text{Conv1d}(X; W, b) &= W \mathbin{\underset{\substack{\mathsf{\vphantom{fg}chans}\\ \mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\odot}} \mathop{\underset{\substack{\mathsf{\vphantom{fg}seq}\\ \mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{unroll}}} X + b
\end{aligned}

\begin{aligned}
W &\in \mathbb{R}^{\mathsf{\vphantom{fg}chans}\times \mathsf{\vphantom{fg}kernel}} \\
b &\in \mathbb{R}\\
\end{aligned}

In [39]:
@defop
def Conv1d(
    X: torch.Tensor,
    W: torch.Tensor,
    b: torch.Tensor,
    chans: Bound,
    k: int,
    kernel: Bound,
    seq: Bound,
    seq2: Bound,
) -> torch.Tensor:
    y = W * Unroll(X, seq, k, kernel, seq2)
    return reduce([chans, kernel], y, torch.sum) + b

In [40]:
chans_size = 3
seq_size = 10
kernel_size = 3

chans, kernel, seq, seq2 = gensyms("chans", "kernel", "seq", "seq2")

X = Indexable(torch.randn(chans_size, seq_size))[chans(), seq()]
W = Indexable(torch.randn(chans_size, kernel_size))[chans(), kernel()]
b = torch.randn(tuple())

Conv1d(X, W, b, chans, 3, kernel, seq, seq2)

Indexable(tensor([-0.8187,  3.8129, -8.1824,  4.4517,  0.1597,  1.9873, -1.4903, -0.6730]))[seq2(, )]

$$
\begin{aligned}
  \text{Conv2d} \colon \mathbb{R}^{\mathsf{\vphantom{fg}chans}\times \mathsf{\vphantom{fg}height}[h] \times \mathsf{\vphantom{fg}width}[w]}
  &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}height}[h2] \times \mathsf{\vphantom{fg}width}[w2]} \\
  \text{Conv2d}(X; W, b) &= W \mathbin{\underset{\substack{\mathsf{\vphantom{fg}chans}\\ \mathsf{\vphantom{fg}kh}, \mathsf{\vphantom{fg}kw}}}{\vphantom{fg}\odot}} \mathop{\underset{\substack{\mathsf{\vphantom{fg}height}\\ \mathsf{\vphantom{fg}kh}}}{\vphantom{fg}\mathrm{unroll}}} \mathop{\underset{\substack{\mathsf{\vphantom{fg}width}\\\mathsf{\vphantom{fg}kw}}}{\vphantom{fg}\mathrm{unroll}}} X + b\end{aligned}
$$

$$
\begin{aligned}
W &\in \mathbb{R}^{\mathsf{\vphantom{fg}chans}\times \mathsf{\vphantom{fg}kh}\times \mathsf{\vphantom{fg}kw}} \\
b &\in \mathbb{R}.
\end{aligned}
$$

In [41]:
@defop
def Conv2d(
    X: torch.Tensor,
    W: torch.Tensor,
    b: torch.Tensor,
    chans: Bound,
    kh_size: int,
    kh: Bound,
    height: Bound,
    height2: Bound,
    kw: Bound,
    width: Bound,
    width2: Bound,
) -> torch.Tensor:
    y = W * Unroll(Unroll(X, width, kw_size, kw, width2), height, kh_size, kh, height2)
    return reduce([chans, kh, kw], y, torch.sum) + b

In [42]:
chans_size = 3
kh_size = 3
kw_size = 4
height_size = 10
width_size = 8

chans, kh, kw, height, width, height2, width2 = gensyms(
    "chans", "kh", "kw", "height", "width", "height2", "width2"
)

X = Indexable(torch.randn(chans_size, height_size, width_size))[
    chans(), height(), width()
]
W = Indexable(torch.randn(chans_size, kh_size, kw_size))[chans(), kh(), kw()]
b = torch.randn(tuple())

Conv2d(X, W, b, chans, kh_size, kh, height, height2, kw, width, width2)

Indexable(tensor([[-4.2300, -1.1482,  6.9712, -4.3390, -9.7533,  1.6445, -6.6997,  5.4496],
                  [ 7.8161, -6.9765,  2.7075, -2.0119, -1.1272, -1.2894,  2.1975, -2.1915],
                  [ 3.2826, -7.6144, -3.5375, -4.7079,  8.5355, -2.6946, -4.9916, -2.5339],
                  [-5.3109, 11.8790, -6.4113,  0.0694, -6.1430,  7.6022, -4.8850,  7.1101],
                  [-4.0429, -1.5105,  1.5986,  8.6606,  1.0992, -8.5570, -2.4783,  2.5876]]))[width2(, ), height2(, )]

### Max Pooling

$$
\begin{aligned}
  \mathop{\underset{\substack{\mathsf{\vphantom{fg}seq},\mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{pool}}} \colon \mathbb{R}^{\mathsf{\vphantom{fg}seq}[n]} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}seq}[n/|\mathsf{\vphantom{fg}kernel}|],\mathsf{\vphantom{fg}kernel}} \\
  \mathop{\underset{\substack{\mathsf{\vphantom{fg}seq},\mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{pool}}} X &= Y,\ \text{where} \\
  Y_{\mathsf{\vphantom{fg}seq}(i), \mathsf{\vphantom{fg}kernel}(j)} &= X_{\mathsf{\vphantom{fg}seq}((i-1) \cdot |\mathsf{\vphantom{fg}kernel}| + j)}.
\end{aligned}
$$

In [43]:
@defop
def Pool(
    x: torch.Tensor, seq: Bound, k: int, kernel: Bound, seq2: Bound
) -> torch.Tensor:
    xp = to_tensor(x, [seq])
    return Indexable(xp.reshape((xp.shape[0] // k, k) + xp.shape[1:]))[seq2(), kernel()]

In [44]:
seq_size = 10
seq, seq2, kernel = gensyms("seq", "seq2", "kernel")

X = Indexable(torch.randn(seq_size))[seq()]
Y = Pool(X, seq, 2, kernel, seq2)
Y

Indexable(tensor([[-0.6976,  0.5874],
                  [ 0.8979, -0.1797],
                  [-1.6208,  0.9514],
                  [ 0.3502,  0.3947],
                  [ 0.3442, -1.6253]]))[seq2(, ), kernel(, )]

$$
\begin{aligned}
\text{MaxPool1d}_{k} \colon \mathbb{R}^{\mathsf{\vphantom{fg}seq}[n]} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}seq}[n/k]} \\
\text{MaxPool1d}_{k}(X) &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{max}}} \mathop{\underset{\substack{\mathsf{\vphantom{fg}seq},\mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{pool}}} X \\
|\mathsf{\vphantom{fg}kernel}| &= k \\
\text{MaxPool2d}_{kh,kw} \colon \mathbb{R}^{\mathsf{\vphantom{fg}height}[h] \times \mathsf{\vphantom{fg}width}[w]} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}height}[h/kh] \times \mathsf{\vphantom{fg}width}[w/kw]} \\
\text{MaxPool2d}_{kh,kw}(X) &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}kh},\mathsf{\vphantom{fg}kw}}}{\vphantom{fg}\mathrm{max}}} \mathop{\underset{\substack{\mathsf{\vphantom{fg}height},\mathsf{\vphantom{fg}kh}}}{\vphantom{fg}\mathrm{pool}}} \mathop{\underset{\substack{\mathsf{\vphantom{fg}width},\mathsf{\vphantom{fg}kw}}}{\vphantom{fg}\mathrm{pool}}} X \\
|\mathsf{\vphantom{fg}kh}| &= kh \\
|\mathsf{\vphantom{fg}kw}| &= kw.
\end{aligned}
$$

In [45]:
@defop
def MaxPool1d(
    X: torch.Tensor, seq: Bound, k: int, kernel: Bound, seq2: Bound
) -> torch.Tensor:
    return reduce([kernel], Pool(X, seq, k, kernel, seq2), torch.max)

In [46]:
seq_size = 10

seq, seq2, kernel = gensyms("seq", "seq2", "kernel")

X = Indexable(torch.randn(seq_size))[seq()]
MaxPool1d(X, seq, 2, kernel, seq2)

Indexable(tensor([[ 0.5311],
                  [-0.0196],
                  [ 1.1695],
                  [ 1.2667],
                  [ 0.5657]]))[seq2(, ), slice(None, None, None)]

In [47]:
@defop
def MaxPool2d(
    X: torch.Tensor,
    height: Bound,
    kh_size: int,
    kh: Bound,
    height2: Bound,
    width: Bound,
    kw_size: int,
    kw: Bound,
    width2: Bound,
) -> torch.Tensor:
    y = Pool(Pool(X, height, kh_size, kh, height2), width, kw_size, kw, width2)
    return reduce([kh, kw], y, torch.max)

In [48]:
width_size = 9
height_size = 4

width, width2, height, height2, kw, kh = gensyms(
    "width", "width2", "height", "height2", "kw", "kh"
)

X = Indexable(torch.randn(width_size, height_size))[width(), height()]
MaxPool2d(X, height, 2, kh, height2, width, 3, kw, width2)

Indexable(tensor([[[1.7193],
                   [0.1245],
                   [0.6246]],
          
                  [[1.0551],
                   [0.6995],
                   [1.4418]]]))[height2(, ), width2(, ), slice(None, None, None)]

### Normalization Layers

$$
\begin{aligned}
  \mathop{\underset{\substack{\mathsf{\vphantom{fg}ax}}}{\vphantom{fg}\mathrm{standardize}}} \colon \mathbb{R}^{\mathsf{\vphantom{fg}ax}} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}ax}} \\
  \mathop{\underset{\substack{\mathsf{\vphantom{fg}ax}}}{\vphantom{fg}\mathrm{standardize}}}(X) &= \frac{X - \mathop{\underset{\substack{\mathsf{\vphantom{fg}ax}}}{\vphantom{fg}\mathrm{mean}}}(X)}{\sqrt{\mathop{\underset{\substack{\mathsf{\vphantom{fg}ax}}}{\vphantom{fg}\mathrm{var}}}(X) + \epsilon}}
\end{aligned}
$$

In [49]:
@defop
def Mean(X: torch.Tensor, ax: Bound) -> torch.Tensor:
    return reduce([ax], X, torch.sum) / sizesof(X)[ax]


@defop
def Mean2(X: torch.Tensor, ax: Bound, ax2: Bound) -> torch.Tensor:
    sizes = sizesof(X)
    return reduce([ax, ax2], X, torch.sum) / (sizes[ax] * sizes[ax2])


@defop
def Variance(X: torch.Tensor, ax: Bound) -> torch.Tensor:
    return Mean((X - Mean(X, ax)) ** 2, ax)


@defop
def Variance2(X: torch.Tensor, ax: Bound, ax2: Bound) -> torch.Tensor:
    return Mean2((X - Mean2(X, ax, ax2)) ** 2, ax, ax2)


@defop
def Standardize(X: torch.Tensor, ax: Bound, new_ax: Bound) -> torch.Tensor:
    y = subst(X, {ax: new_ax()})
    return (y - Mean(X, ax)) / (Variance(X, ax) + torch.finfo(X.dtype).eps).sqrt()


@defop
def Standardize2(
    X: torch.Tensor,
    ax: Bound,
    ax2: Bound,
    new_ax: Bound,
    new_ax2: Bound,
) -> torch.Tensor:
    y = subst(X, {ax: new_ax(), ax2: new_ax2()})
    return (y - Mean2(X, ax, ax2)) / (
        Variance2(X, ax, ax2) + torch.finfo(X.dtype).eps
    ).sqrt()

$$
\begin{aligned}
\text{BatchNorm}(X; \gamma, \beta) &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}batch},\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\mathrm{standardize}}}(X) \mathbin{\underset{\substack{}}{\vphantom{fg}\odot}} \gamma + \beta & \gamma, \beta &\in \mathbb{R}^{\mathsf{\vphantom{fg}chans}} \\
\text{InstanceNorm}(X; \gamma, \beta) &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\mathrm{standardize}}}(X) \mathbin{\underset{\substack{}}{\vphantom{fg}\odot}} \gamma + \beta & \gamma, \beta &\in \mathbb{R}^{\mathsf{\vphantom{fg}chans}} \\
\text{LayerNorm}(X; \gamma, \beta) &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}layer},\mathsf{\vphantom{fg}chans}}}{\vphantom{fg}\mathrm{standardize}}}(X) \mathbin{\underset{\substack{}}{\vphantom{fg}\odot}} \gamma + \beta & \gamma, \beta &\in \mathbb{R}^{\mathsf{\vphantom{fg}chans},\mathsf{\vphantom{fg}layer}}
\end{aligned}
$$

In [50]:
@defop
def BatchNorm(
    X: torch.Tensor,
    gamma: torch.Tensor,
    beta: torch.Tensor,
    batch: Bound,
    layer: Bound,
    batch2: Bound,
    layer2: Bound,
) -> torch.Tensor:
    return Standardize2(X, batch, layer, batch2, layer2) * gamma + beta


@defop
def InstanceNorm(
    X: torch.Tensor,
    gamma: torch.Tensor,
    beta: torch.Tensor,
    layer: Bound,
    layer2: Bound,
) -> torch.Tensor:
    return Standardize(X, layer, layer2) * gamma + beta


# same as BatchNorm
@defop
def LayerNorm(
    X: torch.Tensor,
    gamma: torch.Tensor,
    beta: torch.Tensor,
    chans: Bound,
    layer: Bound,
    chans2: Bound,
    layer2: Bound,
) -> torch.Tensor:
    return Standardize2(X, chans, layer, chans2, layer2) * gamma + beta

In [51]:
batch_size, chans_size, layer_size = 4, 3, 5
batch, batch2, chans, layer, layer2 = gensyms(
    "batch", "batch2", "chans", "layer", "layer2"
)

x = Indexable(torch.randn(batch_size, chans_size, layer_size))[
    batch(), chans(), layer()
]
g = Indexable(torch.randn(chans_size))[chans()]
b = Indexable(torch.randn(chans_size))[chans()]

BatchNorm(x, g, b, batch, layer, batch2, layer2)

Indexable(tensor([[[ 4.3936,  4.9286,  0.4746,  4.7885, -1.4872],
                   [ 0.0328,  0.1472,  0.2054, -0.5323,  0.4721],
                   [-2.0453, -1.7645, -1.5576, -1.6044, -1.8301]],
          
                  [[-1.5173,  5.5533, -1.9767,  3.4703,  3.4684],
                   [-0.1590,  0.4532, -0.0448,  0.2933,  0.5648],
                   [-1.6796, -1.9828, -1.6863, -2.1494, -1.9257]],
          
                  [[ 1.1506,  1.7988,  2.3360, -0.4363,  2.7942],
                   [-0.1444,  0.3988,  0.3489, -0.2729,  0.6021],
                   [-1.6539, -1.5938, -1.7503, -1.7140, -1.5574]],
          
                  [[ 3.8867,  0.6117,  4.4763,  0.7027,  1.0946],
                   [ 0.8929,  0.1545,  0.6493, -0.9831,  0.3118],
                   [-1.7654, -1.7197, -1.7034, -1.6142, -1.7811]]]))[batch2(, ), chans(, ), layer2(, )]

$$
\begin{aligned}
\text{GroupNorm}_k(X; \gamma, \beta) &= \left[ \mathop{\underset{\substack{\mathsf{\vphantom{fg}kernel},\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\mathrm{standardize}}} \mathop{\underset{\substack{\mathsf{\vphantom{fg}chans}, \mathsf{\vphantom{fg}kernel}}}{\vphantom{fg}\mathrm{pool}}} X \right]_{(\mathsf{\vphantom{fg}chans},\mathsf{\vphantom{fg}kernel})\rightarrow \mathsf{\vphantom{fg}chans}} \mathbin{\underset{\substack{}}{\vphantom{fg}\odot}} \gamma + \beta \\
\end{aligned}
$$

$$
\begin{aligned}
|\mathsf{\vphantom{fg}kernel}| &= k\\
\gamma, \beta &\in \mathbb{R}^{\mathsf{\vphantom{fg}chans}}.
\end{aligned}
$$

### Transformer

$$
\begin{aligned}
  I &\in \{0, 1\}^{\mathsf{\vphantom{fg}seq}\times \mathsf{\vphantom{fg}vocab}} & \sum\limits_{\substack{\mathsf{\vphantom{fg}vocab}}} I &= 1 \\
  W &= (E \mathbin{\underset{\substack{\mathsf{\vphantom{fg}vocab}}}{\vphantom{fg}\odot}} I)\sqrt{|\mathsf{\vphantom{fg}layer}|} & E &\in \mathbb{R}^{\mathsf{\vphantom{fg}vocab}\times \mathsf{\vphantom{fg}layer}} \\
  P &\in \mathbb{R}^{\mathsf{\vphantom{fg}seq}\times \mathsf{\vphantom{fg}layer}} \\
  P_{\mathsf{\vphantom{fg}seq}(p), \mathsf{\vphantom{fg}layer}(i)} &= \begin{cases}
    \sin((p-1) / 10000^{(i-1) / |\mathsf{\vphantom{fg}layer}|}) & \text{$i$ odd} \\ 
    \cos((p-1) / 10000^{(i-2) / |\mathsf{\vphantom{fg}layer}|}) & \text{$i$ even.}
  \end{cases}
\end{aligned}
$$

$$
\begin{aligned}
X^0 &= W+P \\
T^1 &= \text{LayerNorm}^1(\text{SelfAtt}^1(X^0)) + X^0\\
X^1 &= \text{LayerNorm}^{1^\prime}(\text{FFN}^1(T^1)) + T^1\\
&\vdotswithin{=} \\
T^{L} &= \text{LayerNorm}^L(\text{SelfAtt}^L(X^{L-1})) + X^{L-1}\\
X^{L} &= \text{LayerNorm}^{L^\prime}(\text{FFN}^L(T^L)) + T^L\\
O &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}vocab}}}{\vphantom{fg}\mathrm{softmax}}}(E \mathbin{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\odot}} X^L)
\end{aligned}
$$

$$
\begin{aligned}
  \text{LayerNorm}^l \colon \mathbb{R}^{\mathsf{\vphantom{fg}layer}} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}layer}} \\
  \text{LayerNorm}^l(X) &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\mathrm{XNorm}}}(X; \beta^l, \gamma^l).
\end{aligned}
$$

$$
\begin{aligned}
  \text{SelfAtt}^l \colon \mathbb{R}^{\mathsf{\vphantom{fg}seq}\times \mathsf{\vphantom{fg}layer}} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}seq}\times \mathsf{\vphantom{fg}layer}} \\
  \text{SelfAtt}^l(X) &= Y
\end{aligned}
$$

$$
\begin{aligned}
  |\mathsf{\vphantom{fg}seq}| &= |\mathsf{\vphantom{fg}seq2}| \\
  |\mathsf{\vphantom{fg}key}| = |\mathsf{\vphantom{fg}val}| &= |\mathsf{\vphantom{fg}layer}|/|\mathsf{\vphantom{fg}heads}| \\
  Q &= W^{l,Q} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\odot}} X_{\mathsf{\vphantom{fg}seq}\rightarrow\mathsf{\vphantom{fg}seq2}} & W^{l,Q} &\in \mathbb{R}^{\mathsf{\vphantom{fg}heads}\times \mathsf{\vphantom{fg}layer}\times \mathsf{\vphantom{fg}key}} \\
  K &= W^{l,K} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\odot}} X & W^{l,K} &\in \mathbb{R}^{\mathsf{\vphantom{fg}heads}\times \mathsf{\vphantom{fg}layer}\times \mathsf{\vphantom{fg}key}} \\
  V &= W^{l,V} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\odot}} X & W^{l,V} &\in \mathbb{R}^{\mathsf{\vphantom{fg}heads}\times \mathsf{\vphantom{fg}layer}\times \mathsf{\vphantom{fg}val}} \\
  M & \in \mathbb{R}^{\mathsf{\vphantom{fg}seq}\times \mathsf{\vphantom{fg}seq2}} \\
  M_{\mathsf{\vphantom{fg}seq}(i), \mathsf{\vphantom{fg}seq2}(j)} &= \begin{cases}
    0 & i \leq j\\
    -\infty & \text{otherwise}
  \end{cases} \\
  Y &= W^{l,O} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}heads}\\ \mathsf{\vphantom{fg}val}}}{\vphantom{fg}\odot}} \text{Attention}(Q, K, V, M)_{\mathsf{\vphantom{fg}seq2}\rightarrow\mathsf{\vphantom{fg}seq}} & W^{l,O} &\in \mathbb{R}^{\mathsf{\vphantom{fg}heads}\times \mathsf{\vphantom{fg}val}\times \mathsf{\vphantom{fg}layer}}
\end{aligned}
$$

$$
\begin{aligned}
  \text{FFN}^l \colon \mathbb{R}^{\mathsf{\vphantom{fg}layer}} &\rightarrow \mathbb{R}^{\mathsf{\vphantom{fg}layer}} \\
  \text{FFN}^l(X) &= X^2
\end{aligned}
$$

$$
\begin{aligned}
  X^1 &= \text{relu}(W^{l,1} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\odot}} X + b^{l,1}) & W^{l,1} &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}\times \mathsf{\vphantom{fg}layer}} & b^{l,1} &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}} \\
  X^2 &= \text{relu}(W^{l,2} \mathbin{\underset{\substack{\mathsf{\vphantom{fg}hidden}}}{\vphantom{fg}\odot}} X^1 + b^{l,2}) & W^{l,2} &\in \mathbb{R}^{\mathsf{\vphantom{fg}layer}\times \mathsf{\vphantom{fg}hidden}} & b^{l,2} &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}}.
\end{aligned}
$$

### LeNet

$$
\begin{aligned}
X^0 &\in \mathbb{R}^{\mathsf{\vphantom{fg}batch}\times \mathsf{\vphantom{fg}chans}[c_0] \times \mathsf{\vphantom{fg}height}\times \mathsf{\vphantom{fg}width}} \\
T^1 &= \text{relu}(\text{Conv}^1(X^0)) \\
X^1 &= \text{MaxPool}^1(T^1) \\
T^2 &= \text{relu}(\text{Conv}^2(X^1)) \\
X^2 &= \text{MaxPool}^2(T^2)_{(\mathsf{\vphantom{fg}height},\mathsf{\vphantom{fg}width},\mathsf{\vphantom{fg}chans})\rightarrow\mathsf{\vphantom{fg}layer}} \\
X^3 &= \text{relu}(W^3 \mathbin{\underset{\substack{\mathsf{\vphantom{fg}layer}}}{\vphantom{fg}\odot}} X^2 + b^3) & W^3 &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}\times \mathsf{\vphantom{fg}layer}} & b^3 &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}} \\
O &= \mathop{\underset{\substack{\mathsf{\vphantom{fg}classes}}}{\vphantom{fg}\mathrm{softmax}}} (W^4 \mathbin{\underset{\substack{\mathsf{\vphantom{fg}hidden}}}{\vphantom{fg}\odot}} X^3 + b^4) & W^4 &\in \mathbb{R}^{\mathsf{\vphantom{fg}classes}\times \mathsf{\vphantom{fg}hidden}} & b^4 &\in \mathbb{R}^{\mathsf{\vphantom{fg}classes}}\end{aligned}
$$

$$
\begin{aligned}
X^2 &= \text{MaxPool}^2(T^2) \\
X^3 &= \text{relu}(W^3 \mathbin{\underset{\substack{\mathsf{\vphantom{fg}height}\\ \mathsf{\vphantom{fg}width}\\ \mathsf{\vphantom{fg}chans}}}{\vphantom{fg}\odot}} X^2 + b^3) & W^3 &\in \mathbb{R}^{\mathsf{\vphantom{fg}hidden}\times \mathsf{\vphantom{fg}height}\times \mathsf{\vphantom{fg}width}\times \mathsf{\vphantom{fg}chans}}.
\end{aligned}
$$

$$
\begin{aligned}
\text{Conv}^l(X) &= \text{Conv2d}(X; W^l, b^l)_{\mathsf{\vphantom{fg}chans2}\rightarrow\mathsf{\vphantom{fg}chans}}
\end{aligned}
$$

$$
\begin{aligned}
W^l & \in \mathbb{R}^{\mathsf{\vphantom{fg}chans2}[c_l] \times \mathsf{\vphantom{fg}chans}[c_{l-1}] \times \mathsf{\vphantom{fg}kh}[kh_l] \times \mathsf{\vphantom{fg}kw}[kw_l]} \\
b^l &\in \mathbb{R}^{\mathsf{\vphantom{fg}chans2}[c_l]}
\end{aligned}
$$

$$
\begin{aligned}
\text{MaxPool}^l(X) &amp;= \text{MaxPool2d}_{ph^l,ph^l}(X).
\end{aligned}
$$

In [52]:
@defop
def Relu(X: torch.Tensor) -> torch.Tensor:
    return torch.maximum(X, torch.tensor(0))

In [53]:
Relu(x)

Indexable(tensor([[[0.0000e+00, 0.0000e+00, 1.0695e+00, 0.0000e+00, 2.0910e+00],
                   [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.2155e-01],
                   [0.0000e+00, 2.3336e-03, 1.3300e+00, 1.0297e+00, 0.0000e+00]],
          
                  [[2.1067e+00, 0.0000e+00, 2.3458e+00, 0.0000e+00, 0.0000e+00],
                   [0.0000e+00, 8.8820e-02, 0.0000e+00, 0.0000e+00, 2.8219e-01],
                   [5.4738e-01, 0.0000e+00, 5.0418e-01, 0.0000e+00, 0.0000e+00]],
          
                  [[7.1755e-01, 3.8004e-01, 1.0034e-01, 1.5438e+00, 0.0000e+00],
                   [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.4677e-01],
                   [7.1239e-01, 1.0980e+00, 9.3312e-02, 3.2645e-01, 1.3318e+00]],
          
                  [[0.0000e+00, 9.9810e-01, 0.0000e+00, 9.5073e-01, 7.4670e-01],
                   [8.5045e-01, 0.0000e+00, 4.2842e-01, 0.0000e+00, 0.0000e+00],
                   [0.0000e+00, 2.8951e-01, 3.9431e-01, 9.6674e-01, 0.000

In [54]:
(
    chans_size,
    kh_size,
    kw_size,
    hidden_size,
    height_size,
    width_size,
    classes_size,
    batch_size,
) = (3, 3, 4, 3, 14, 15, 5, 4)
(
    chans,
    chans2,
    kh,
    kw,
    height,
    height2,
    height3,
    width,
    width2,
    width3,
    hidden,
    classes,
    classes2,
    batch,
) = gensyms(
    "chans",
    "chans2",
    "kh",
    "kw",
    "height",
    "height2",
    "height3",
    "width",
    "width2",
    "width3",
    "hidden",
    "classes",
    "classes2",
    "batch",
)

W1 = Indexable(torch.randn(chans_size, kh_size, kw_size, chans_size))[
    chans(), kh(), kw(), chans2()
]
b1 = Indexable(torch.randn(chans_size))[chans2()]
W3 = Indexable(torch.randn(hidden_size, 4, 4, chans_size))[
    hidden(), height3(), width3(), chans2()
]
b3 = Indexable(torch.randn(hidden_size))[hidden()]
W4 = Indexable(torch.randn(hidden_size, classes_size))[hidden(), classes()]
b4 = Indexable(torch.randn(classes_size))[classes()]
X0 = Indexable(torch.randn(batch_size, chans_size, height_size, width_size))[
    batch(), chans(), height(), width()
]

T1 = Relu(Conv2d(X0, W1, b1, chans, kh_size, kh, height, height2, kw, width, width2))
X1 = MaxPool2d(T1, height2, 3, kh, height3, width2, 3, kw, width3)
X3 = reduce([height3, width3, chans2], W3 * X1, torch.sum) + b3
O = Softmax(reduce([hidden], W4 * X3, torch.sum) + b4, classes, classes2)
O

Indexable(tensor([[[1.0000e+00],
                   [1.6670e-01],
                   [1.0000e+00],
                   [1.0000e+00]],
          
                  [[2.1944e-28],
                   [0.0000e+00],
                   [0.0000e+00],
                   [0.0000e+00]],
          
                  [[0.0000e+00],
                   [8.3330e-01],
                   [7.9017e-22],
                   [7.8374e-14]],
          
                  [[8.4078e-45],
                   [0.0000e+00],
                   [0.0000e+00],
                   [0.0000e+00]],
          
                  [[1.1518e-20],
                   [0.0000e+00],
                   [0.0000e+00],
                   [3.8484e-27]]]))[classes2(, ), batch(, ), slice(None, None, None)]