# Evaluation of a Convolutional Neural Network Architecture
Course project for the class _IoT Based Smart Systems_, carried out by _Riccardo Maria Pesce_ during Academic Year _2021-2022_, under the kind supervision of Professor _Maurizio Palesi_.

## Introduction

### Motivation
With the latest scientific and technological advancements that has taken place in the past few years, AI techniques have been employed in different fields, with great success. 

While Deep Learning Models are still trained on the cloud (using state-of-the-art computing machines with specialized hardware such as _GPU_ or _TPU_), it is becoming always more common to perform inference on the edge, i.e. on the devices itself, so as to reduce latency and optimize the usage of bandwith, a relevant issue for constrained devices.

### Objective
The objective of this thesis is to analyze a _CNN_ (Convolutional Neural Network) architecture performance on a constrained hardware, seeing how the mapping will affect the performance in terms of throughput and energy consumption.
In particular, these performance achievements are obtained in the following ways:

* Through reducing data movements, since communication is more expensive than computation in terms of energy nowadays. We can reduce data movements by either reducing the number of times memory is accessed, employing for instance DRAM which are nearby the _PEs_ (Processing Elements), or else we can compress data to a smaller number bits to represent it, thus making data movements cheaper. From these observations, we notice how __memory is the main bottleneck__.

* Maximizing PEs parallelism.

### Mapping
Mapping defines the order of execution of the MAC operations. The ordering can either be _temporal_ when operations are mapped serially on the same PE (i.e. the temporal order of execution), or _spatial_ when operations are mapped to multiple PE to execute in parallel.

### Introduction to Timeloop and Accelergy
In order to correctly design a DNN accelerator we need to cater for the different DNN architectures, and for each one of them we have to find an optimal mapping of these workloads onto specific hardware architectures.
This is what Timeloop and Accelergy do, and in particular:
* Timeloop generates a characterization of the energetical efficiency for each workload, through a mapper which finds the optimal way to plan operations on a specified architecture. To do so, Timeloop uses a coincise and unified representation of those core elements which are generally found in DNN accelerators.
* Accelergy, on the basis of the above created characterization, provides a pretty good estimate of energy consumption.

For a more complete introduction to Timeloop/Accelergy, please refer to [the official website](http://accelergy.mit.edu/tutorial.html).

## Simulation

### Objective

In this thesis, we want to find an efficient mapping of the _ResNet-18_ model onto particular and simple DNN architectures, and see how varying the latter paramereters influence energy consumption, performances and area occupied.

### ResNet-18

We want to give a brief overview of the _ResNet-18_ architecture. 

![ResNet-18 Architecture](./assets/resnet18.png)

The main key points, as highlighted in the above picture, are:

* ResNet-18 architecture has __4 stages__.
* Input height and width must be multiple of 32 and channel width must be equal to 3.
* The main innovation of the ResNet architecture is the introduction of the _Identity Connection_ which allows the input feature map of some layer to skip some blocks and being summed to the output feature map of the skipped layers, before passing through the activation function (which is commonly the _ReLU_). This is a very succesful strategy to improve accuracy, thanks to the fact that we limit the _vanishing gradient_ problem which often occurs in very deep neural networks.
* ResNet uses Batch Normalization to mitigate the _Covariance Shift_ problem.

For a better understanding, please check out the original paper on [ArXiv](https://arxiv.org/abs/1512.03385).

The model used for this simulation was obtained through [pytorch2timeloop-converter](https://github.com/Accelergy-Project/pytorch2timeloop-converter), which is a useful tool to get timeloop compliant architectures from Pytorch models.


### Accelerator architecture

We are going to try the ResNet-18 workload on some simple architectures, which will mainly consist of a main memory, a buffer and different processing elements (PEs).

Regarding the mapper, it can be found [here](./ResNet18/mapper/mapper.yaml). We are using eight threads in parallel, to find the best parameters to optimize _delay_ and _energy_, using a _random-pruned_ algorithms which will stop once it finds 100 configurations which perform worse than the optimal one found. 

Let's now run the simulation, remembering to __spin up Docker__ as we are using Timeloop/Accelergy through it. Also, we will use _python_ to run all the simulations for each layer automatically. We are opening a terminal window, where we are starting the container using the command `docker-compose run --rm exercises`. After this, we are putting the `ResNet18` folder inside the newly created `workspace` folder. With the below snippet of code we are generating the bash file which will be run inside the container shell to run the simulations for this first configuration and all the layers. Before running, you might need to run `chmod u+x bash_script.sh`.

In [2]:
from utils import *

generate_bash_script()

Now we want to run the different configurations, and check for each the different performance indicators, all made thanks to Python and Pandas package.
We want to first review which configurations we are considering so far:

* `base`
    * __Main Memory__ (_DRAM_), with the following attributes `width = 256`, `block-size = 32` and `word-bits = 1`
    * __Global Buffer__ (_smartbuffer SRAM_), with the following attributes `depth = 12`, `width = 16`, `block-size = 16` and `word-bits = 1`
    * __PEs__ (16)
        * __Register File__ with the following attributes `depth = 16`, `width = 8`, `block-size = 8` and `word-bits = 1`
        * __MACC__ (_intmac_) with the following attribute `data-width = 16`
* `base_bigger_buffer`, buffer size
* `base_smaller_buffer`
* `base_bigger_rf`
* `base_smaller_rf`
* `base_more_pes`
* `base_less_pes`
* `base_bigger_dw`
* `base_smaller_dw`

In [5]:
# Suppress scientific notation
pd.set_option("display.float_format", lambda x: "%.2f" % x) 


