# Bit serial computation - Dot product

The notebook illustrates that bit-serial computation of a dot product of a an operand `A` by an operand `B`, where the `A` operand is a set of values viewed at the bit level. This computation can be viewed as tensor computation where the bit-level representation of `A` is achieved with a rank-2 tensor where the lower rank is a set of sparse fibers with a 1 at those coordinates that match the bit-positions with a 1 in the binary representation of the value. The operand `B` is simply represented as a rank-1 tensor of values. As a result this computation can be represented with the following Einsum:

$$
Z = A_{i,j} \times B_i \times 2^j
$$

This representation of the calculation allows us to consider different dataflows and parallelism options, which are illustrated below.

## Setup

The first step is to set up the environment and create some tensors

In [None]:
# Run boilerplate code to set up environment

%run ../prelude.py --style=tree --animation=movie

## Configure some tensors

In [None]:
# Default value for the number of elements in the dot product
I = 4

# Default value for the number of bits in each elemnt of the `A` tensor
J = 8

tm = TensorMaker("dot product inputs")

tm.addTensor("A_IJ", rank_ids=["I", "J"], shape=[I, J], density=0.6, interval=1, seed=0, color="blue")
tm.addTensor("B_I", rank_ids=["I"], shape=[I], density=1, seed=1, color="green")

tm.displayControls()

## Create and display the tensors

In [None]:
A_IJ = tm.makeTensor("A_IJ")
A_JI = A_IJ.swapRanks().setName("A_JI")
B_I = tm.makeTensor("B_I")

#
# Calculate binary value of A from bit-wise represenation
#
a_values = []
for i, a_j in A_IJ:
    a_value = 0
    for j, _ in a_j:
        a_value += 2**j
    a_values.append(a_value)

print(f"A_IJ (with values {a_values})")
displayTensor(A_IJ)

print("A_JI")
displayTensor(A_JI)

print("B")
displayTensor(B_I)

## Create power array

Although the original Einsum notation includes a multiplication by a value that is a function only of an index value (`2^j`), this code will express that as a multiplicaton by a value from a constant rank-1 tensor (`pow2`). In reality, this would probably be implemented directly in hardware (in this case as a **shift**).

In [None]:
pow2 = Tensor(rank_ids=["J"], shape=[J], name="Pow2", color="lightblue")

pow2_j = pow2.getRoot()

for j, pow2_ref in pow2_j.iterShapeRef():
    pow2_ref <<= 2 ** j
    
displayTensor(pow2)

## Serial execution

Observations:

- Since both `a_i` and `b_i` are dense they can be uncompressed and their intersection in trivial
- Since `a_j` is compressed and `pow2` can be uncompressed their intersection can be leader-follower
- Elapsed time is proportional to the total occupancy of all the fibers in the `J` rank of `A_IJ`.

In [None]:
z = Tensor(rank_ids=[], name="Dot Prod")

a_i = A_IJ.getRoot()
b_i = B_I.getRoot()
pow2_j = pow2.getRoot()

z_ref = z.getRoot()

canvas = createCanvas(A_IJ, B_I, pow2, z)

for i, (a_j, b_val) in a_i & b_i:
    for j, (a_val, pow2_val) in a_j & pow2_j:
        # Note a_val can only be "1" so we don't need to multiply by it.
        z_ref += (a_val * b_val) * pow2_val
        canvas.addFrame((i,j),(i,),(j,), (0,))
        
displayTensor(z)
displayCanvas(canvas)

## Parallel across B's

Observations: 

- Time is equal to the occupancy of the longest fiber in the `J` rank of `A`

In [None]:
z = Tensor(rank_ids=[], name="Dot Prod")

a_i = A_IJ.getRoot()
b_i = B_I.getRoot()
pow2_j = pow2.getRoot()

z_ref = z.getRoot()

canvas = createCanvas(A_IJ, B_I, pow2, z)

for i, (a_j, b_val) in a_i & b_i:
    for n_j, (j, (a_val, pow2_val)) in enumerate(a_j & pow2_j):
        z_ref += (a_val * b_val) * pow2_val
        canvas.addActivity((i,j),(i,),(j,), (0,),
                          spacetime=(i, n_j))
        
displayTensor(z)
displayCanvas(canvas)

## Parallel across bits (with infinite parallelism)

Observations:

- Latency is the occupancy of the longest fibers in the `I` of the `A_JI` tensor


Note: If there are too many "j"s one approximation would be to ignore low order bits.

In [None]:
z = Tensor(rank_ids=[], name="Dot Prod")

a_j = A_JI.getRoot()
b_i = B_I.getRoot()
pow2_j = pow2.getRoot()

z_ref = z.getRoot()

canvas = createCanvas(A_IJ, A_JI, B_I, pow2, z)

for j, (a_i, pow2_val) in a_j & pow2_j:
    for n_i, (i, (a_val, b_val)) in enumerate(a_i & b_i):
        z_ref += (a_val * b_val) * pow2_val
        canvas.addActivity((i,j),(j,i),(i,),(j,), (0,),
                          spacetime=(j, n_i))
        
displayTensor(z)
displayCanvas(canvas)

## Parallel across bits (limited parallelism)

But limit parallelism to `I` to make fair comparison to `B` parallel. In this design, there is a barrier between the processing of each group of `I` bits, i.e., between the processing of each fiber of the `j0` rank of the split `A_JI` tensor.

Observations:

- Latency is the sum of the largest occupancies of the `I` rank for each of the fibers in the `j0` rank of the split `A_JI` tensor

In [None]:
A_JI_split = A_JI.splitUniform(I)

displayTensor(A_JI_split)

In [None]:
z = Tensor(rank_ids=[], name="Dot Prod")


a_j1 = A_JI_split.getRoot()
b_i = B_I.getRoot()
pow2_j = pow2.getRoot()

z_ref = z.getRoot()

canvas = createCanvas(A_IJ, A_JI_split, B_I, pow2, z)

for n_j1, (j1, a_j0) in enumerate(a_j1):
    for j, (a_i, pow2_val) in a_j0 & pow2_j:
        for n_i, (i, (a_val, b_val)) in enumerate(a_i & b_i):
            z_ref += (a_val * b_val) * pow2_val
            canvas.addActivity((i,j),(j1,j,i),(i,),(j,), (0,),
                                spacetime=(j-j1, (n_j1,n_i)))
        
displayTensor(z)
displayCanvas(canvas)

## ## Parallel across bits (limited parallelism)

Allowing slip between groups of bits, i.e., relaxes bit-level barrier. However, each PE works on a fixed position in each fiber in `a_j0`.

Observation:

- Each PE is busy for the sum of the occupancies of the `a_i` fibers at that PE's position in the `a_j0` fibers
- Latency is equal long pole PE. 

In [None]:
z = Tensor(rank_ids=[], name="Dot Prod")


a_j1 = A_JI_split.getRoot()
b_i = B_I.getRoot()
pow2_j = pow2.getRoot()

z_ref = z.getRoot()

cycles = I*[0]

canvas = createCanvas(A_IJ, A_JI_split, B_I, pow2, z)

for n_j1, (j1, a_j0) in enumerate(a_j1):
    for j, (a_i, pow2_val) in a_j0 & pow2_j:
        for n_i, (i, (a_val, b_val)) in enumerate(a_i & b_i):
            z_ref += (a_val * b_val) * pow2_val
            pe = j-j1
            canvas.addActivity((i,j),(j1,j,i),(i,),(j,), (0,),
                                spacetime=(pe, cycles[pe]))
            cycles[pe] += 1
        
displayTensor(z)
displayCanvas(canvas)