# TPUv1

This notebook reproduces the salient characteristics of the [TPUv1](https://arxiv.org/abs/1704.04760) accelerator.

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "..")

from src import utils

## Initialization

Initialize the input tensors. Tensor shapes and densities can be modified below.

**Warning:** Large tensors will overwhelm the video generation. Either:
1. Use small tensors; as a rule of thumb, fewer than 60 computes (e.g., multiplications) should be required.
2. Do not generate a video; remove the `spacetime` specification from the `mapping` before compiling.

In [None]:
K = 4
M = 4
N = 4

K0 = 2
N0 = 2

density = [1, 1]
seed = 0

A_KM = Tensor.fromRandom(rank_ids=["K", "M"], shape=[K, M], seed=seed, density=density, name="A")
B_KN = Tensor.fromRandom(rank_ids=["K", "N"], shape=[K, N], seed=seed + 1, density=density, name="B")

## Compile from TeAAL Specification and Run

Below is the TeAAL specification for TPU. To simulate the accelerator:
1. Compile it to HiFiber by running the cell, inserting a new cell
2. Run the new cell, which will
    - Execute the kernel; multiplying the above defined matrices
    - Generate visualizations of the actions of the kernel

Remember, if you are using large tensors, remove the spacetime specification to generate a kernel that does not produce videos. Outputs can still be checked below.

#### Notes

- Small tensors are required for video generation. If you are using large tensors, remove the spacetime specification to generate a kernel that does not produce videos. Outputs can still be checked below.
- Partition shapes are decreased accordingly above for visualization purposes. The real TPUv1 uses `K0 = 256` and `N0 = 256`.
- The visualizations generated by TeAAL do not account for the TPU's skewed dataflow (a result of the TPUs spatial array architecture). See below for a HiFiber visualization of this skewed dataflow.

In [None]:
yaml = """
einsum:
    declaration:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    expressions:
        - Z[m, n] = A[k, m] * B[k, n]
mapping:
    rank-order:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    partitioning:
        Z:
            K: [uniform_shape(K0)]
            N: [uniform_shape(N0)]
    loop-order:
        Z: [N1, K1, M, K0, N0]
    spacetime:
        Z: 
            space: [K0, N0]
            time: [N1, K1, M]
"""

utils.compile(yaml)

## Visualize Skewed Dataflow

The TPU uses a skewed dataflow that enables efficient communication of operands between PEs by only passing operands between neighbors. Below is a HiFiber loop nest that enables visualizing this dataflow. Note that the *only* difference between the code generated by the above TeAAL specification and the below HiFiber code is the `spacetime` details provided to `canvas.addActivity()` (the function that adds frames to the video).

In [None]:
Z_N1MN0 = Tensor(rank_ids=["N1", "M", "N0"], name="Z")
tmp0 = A_KM
tmp1 = tmp0.splitUniform(K0, depth=0)
A_K1K0M = tmp1
A_K1K0M.setRankIds(rank_ids=["K1", "K0", "M"])
tmp2 = B_KN
tmp3 = tmp2.splitUniform(K0, depth=0)
B_K1K0N = tmp3
B_K1K0N.setRankIds(rank_ids=["K1", "K0", "N"])
tmp4 = B_K1K0N
tmp5 = tmp4.splitUniform(N0, depth=2)
B_K1K0N1N0 = tmp5
B_K1K0N1N0.setRankIds(rank_ids=["K1", "K0", "N1", "N0"])
z_n1 = Z_N1MN0.getRoot()
A_K1MK0 = A_K1K0M.swizzleRanks(rank_ids=["K1", "M", "K0"])
B_N1K1K0N0 = B_K1K0N1N0.swizzleRanks(rank_ids=["N1", "K1", "K0", "N0"])
a_k1 = A_K1MK0.getRoot()
b_n1 = B_N1K1K0N0.getRoot()
canvas = createCanvas(A_K1MK0, B_N1K1K0N0, Z_N1MN0)
for n1_pos, (n1, (z_m, b_k1)) in enumerate(z_n1 << b_n1):
    for k1_pos, (k1, (a_m, b_k0)) in enumerate(a_k1 & b_k1):
        for m_pos, (m, (z_n0, a_k0)) in enumerate(z_m << a_m):
            for k0_pos, (k0, (a_val, b_n0)) in enumerate(a_k0 & b_k0):
                for n0_pos, (n0, (z_ref, b_val)) in enumerate(z_n0 << b_n0):
                    z_ref += a_val * b_val
                    canvas.addActivity((k1, m, k0), (n1, k1, k0, n0), (n1, m, n0), spacetime=((k0_pos, n0_pos), n1_pos * (K / K0) * M + k1_pos * M + m_pos + k0_pos + n0_pos))
tmp6 = Z_N1MN0
tmp7 = tmp6.swizzleRanks(rank_ids=["M", "N1", "N0"])
tmp8 = tmp7.mergeRanks(depth=1, levels=1, coord_style="absolute")
tmp8.setRankIds(rank_ids=["M", "N"])
Z_MN = tmp8
displayCanvas(canvas)

## Check Results

Check that above code (generated or provided) computes the correct result.

**Note**: Should be used after executing the HiFiber loopnest (one of the above cells).

In [None]:
utils.check_matmul(A_KM, B_KN, Z_MN)