# Homework 07 – Implementation

Arthur J. Redfern arthur.redfern@utdallas.edu

## 0 Outline

- 1 Reading
- 2 Theory
- 3 Practice

# 1 Reading

1. Implementation

Motivation: understand xNN implementation <a href="https://github.com/arthurredfern/UT-Dallas-CS-6301-CNNs/blob/master/Lectures/xNNs">https://github.com/arthurredfern/UT-Dallas-CS-6301-CNNs/blob/master/Lectures/xNNs</a> 070 Implementation.pdf

#### Complete

Efficient processing of deep neural networks: a tutorial and survey
 Motivation: an alternative presentation of xNN hardware circa 2017
 https://arxiv.org/abs/1703.09039

#### Complete

3. A new golden age for computer architecture

Motivation: a very nice talk from the Turing award winners on the past and future of hardware and software, available in text and video form <a href="https://cacm.acm.org/magazines/2019/2/234352-a-new-golden-age-for-computer-architecture/fulltext">https://cacm.acm.org/magazines/2019/2/234352-a-new-golden-age-for-computer-architecture/fulltext</a>

https://www.youtube.com/watch?v=3LVeEjsn8Ts

#### Complete

## 2 Theory

4. Write out all of the terms of Strassen based matrix multiplication for  $\mathbf{C} = \mathbf{A} \mathbf{B}$  with BLAS dimensions  $\mathbf{M} = \mathbf{N} = \mathbf{K} = 4$  by applying the Strassen decomposition twice (an initial decomposition then a recursive decomposition). Use the following notation for the scalars in the 3 matrices

Apologies for the un beautiful matrix formatting, but the above is meant to represent C = A B.

### 3 Practice

- 5. We've seen the importance of quantization to reducing memory, reducing data movement and increasing compute (with the same resources). To gain more experience with quantization, read the following introduction of post training quantization and its implementation in TensorFlow Lite and work through the following examples (though remember that quantization during training is better):
  - https://www.tensorflow.org/lite/performance/post training quantization
  - <a href="https://www.tensorflow.org/lite/performance/post-training-quant-">https://www.tensorflow.org/lite/performance/post-training-quant-</a>
  - https://www.tensorflow.org/lite/performance/post training integer quant
  - https://www.tensorflow.org/lite/performance/post training float16 quant

#### Complete

6. [**Optional**] Creating tools to automate work is a common activity when working with xNNs. For this problem, you will build a tool to predict performance using a simplified network specification for a simplified hardware architecture specification.

The network specification can come from scraping the graph of the network created by TensorFlow (preferred), or it can be hand specified as follows. Specify a network as a text file using the following to describe each layer (ID is a unique identifier like a unique number)

```
Operation IDx
Parameters
Operation type, parameter 0, parameter 1, ...
Input tensors
Previous layer IDa, precision, pointer, dimension 0, dimension 1, ...
```

```
Previous layer IDb, precision, pointer, dimension 0, dimension 1, ...
...
Output tensors
Precision, pointer, dimension 0, dimension 1, ...
Precision, pointer, dimension 0, dimension 1, ...
...
```

Note that this format encodes all tensor dimensions directly into the network description. While suboptimal from an experimentation perspective (e.g., changing the input size to the network requires changing parameters in all layers), this is convenient for a deployment perspective as it allows for better optimization with respect to memory placement and data movement. Pointer is a memory address, but we won't use it here so you can ignore it (I'm not going to make you figure out optimal memory placement from a buffer reuse perspective).

Previous layer IDa, IDb, ... are included with the input tensors to indicate the previous layer that needs to complete before this input is available, a convenient value to track for graph optimization. Input tensors can be input feature maps, filter coefficients, ... basically any value that's not a parameter encoded with the operation. If the input tensor is an external edge to the graph (e.g., the network input, a filter weight, ...) the previous layer ID can be set to -1 indicating that there is no previous layer.

For a CNN style 2D convolution operation, example parameters include:

Input feature map grouping
Input feature map row pad
Input feature map col pad
Filter row stride
Filter col stride
Filter row dilation
Filter col dilation

Using a strategy similar to that of CNN style 2D convolution, additional operations can be specified for bias addition, ReLU, max pooling, average pooling, global average pooling, matrix multiplication, ... and other layer components as necessary.

For the network to specify in the above format, choose 1 or more of: ResNet-50, ResNeXt-50, MobileNet V2, MobileNet V3, Inception V4, NASNet, MnasNet or AmoebaNet. Set the input tensor for the network to  $1 \times 3 \times 512 \times 1024$  at 16 bit precision. Also assume all filter coefficients are quantized to 16 bit precision.

Why am I having you go through all this trouble? I want you to think of networks as graphs of operations controlled by parameters that map input tensors to output tensors. This is a key for understanding performance and optimal implementation strategies.

Next, specify the hardware as a text file using the following parameters (an example value is provided for each parameter in <>)

| Data movement |                                    |              |
|---------------|------------------------------------|--------------|
|               | DDR bits                           | <64 b / 8 B> |
|               | DDR frequency                      | <3200 MHz>   |
|               | DDR availability                   | <0.50>       |
|               | DDR efficiency                     | <0.80>       |
| Memory        |                                    |              |
|               | Off device                         | <∞>          |
|               | On device                          | <4 MB>       |
| Compute       |                                    |              |
|               | Frequency (cycles per second)      | <1 GHz>      |
|               | Matrix M, N, K at 8 bit precision  | <32, 32, 32> |
|               | Matrix M, N, K at 16 bit precision | <32, 16, 16> |
|               | Matrix M, N, K at 32 bit precision | <32, 8, 8>   |
|               | Cycles per matrix tile             | <32>         |
|               | Vector N at 8 bit precision        | <32>         |
|               | Vector N at 16 bit precision       | <16>         |
|               | Vector N at 32 bit precision       | <8>          |
|               | Cycles per vector op               | <1>          |
|               |                                    |              |

The matrix op will tile CNN style 2D convolution lowered to matrix multiplication. If the  $M_{conv}$ ,  $N_{conv}$  and  $K_{conv}$  values of CNN style 2D convolution are not an integer multiple of the M, N and K values of the matrix operation, then use  $M_{tiles} = ceil(M_{conv}/M)$ ,  $N_{tiles} = ceil(N_{conv}/N)$  and  $K_{tiles} = ceil(K_{conv}/K)$ , note that this represents a potential computational inefficiency. The matrix operation will also be used for matrix matrix multiplication with the same tiling strategy. The matrix compute time for the CNN style 2D convolution and matrix matrix multiplication operations is

```
Matrix time = (M_{tiles} * N_{tiles} * K_{tiles}) * (cycles per matrix tile) / (frequency)
```

All other operations will assumed to be at vector speed and their vector time is approximated as

```
Vector time = ceil(number of ops / vector N) * (cycles per vector op) / (frequency)
```

The total compute time for an operation is

```
Compute time = matrix time + vector time
```

In order to determine data movement time, locations for each of the tensors needs to be determined. As a first order approximation, mark all filter coefficient tensors as off device. Also mark the network input image tensor and output prediction tensor as off device. For all other

feature map tensors, mark them as on device if their size is smaller than that of the on device memory and mark them off device otherwise.

For all tensors marked off device, in order for them to be used in an operation, they need to be moved on device. The amount of time that it takes to move a tensor from off device to on device is

```
Tensor move time = ((precision / 8) * dimension 0 * dimension 1 * ...) / (DDR bandwidth)
```

where DDR bandwidth = (DDR bits / 8) \* (DDR frequency) \* (DDR availability) \* (DDR efficiency). The data movement time for an operation is the data movement time for all tensors that need to be moved from off device to on device for that operation

```
Data movement time = tensor move time 1 + tensor move time 2 + ...
```

Now, the total time for an operation can be bound as

```
Serial operation time = (compute time) + (data movement time)
Parallel operation time = max(compute time, data movement time)
```

Repeat the serial operation time and parallel operation time calculation for all operations in the graph. Then sum these results for an overall bound on the network time

```
Serial network time = serial operation time 1 + serial operation time 2 + ...

Parallel network time = parallel operation time 1 + parallel operation time 2 + ...
```

So to summarize, you've generated

- 1. A text file describing all operations in a chosen network
- 2. A performance prediction program for
  - a. Estimating the compute time for each operation
  - b. Determining a memory placement for each tensor
  - c. Estimating the data movement time for each operation
  - d. Estimating a serial and parallel time bound for each operation
  - e. Estimating a serial and parallel time bound for the full network

The final step is to have the performance prediction program write the results to a CSV file for each layer and the overall network.