# Evaluation of a Convolutional Neural Network Architecture
Course project for the class _IoT Based Smart Systems_, carried out by _Riccardo Maria Pesce_ during Academic Year _2021-2022_, under the kind supervision of Professor _Maurizio Palesi_.

## Introduction

### Motivation
With the latest scientific and technological advancements that has taken place in the past few years, AI techniques have been employed in different fields, with great success. 

While Deep Learning Models are still trained on the cloud (using state-of-the-art computing machines with specialized hardware such as _GPU_ or _TPU_), it is becoming always more common to perform inference on the edge, i.e. on the devices itself, so as to reduce latency and optimize the usage of bandwith, a relevant issue for constrained devices.

### Objective
The objective of this thesis is to analyze a _CNN_ (Convolutional Neural Network) architecture performance on a constrained hardware, seeing how the mapping will affect the performance in terms of throughput and energy consumption.
In particular, these performance achievements are obtained in the following ways:

* Through reducing data movements, since communication is more expensive than computation in terms of energy nowadays. We can reduce data movements by either reducing the number of times memory is accessed, employing for instance DRAM which are nearby the _PEs_ (Processing Elements), or else we can compress data to a smaller number bits to represent it, thus making data movements cheaper. From these observations, we notice how __memory is the main bottleneck__.

* Maximizing PEs parallelism.

### Mapping
Mapping defines the order of execution of the MAC operations. The ordering can either be _temporal_ when operations are mapped serially on the same PE (i.e. the temporal order of execution), or _spatial_ when operations are mapped to multiple PE to execute in parallel.

### Introduction to Timeloop and Accelergy
In order to correctly design a DNN accelerator we need to cater for the different DNN architectures, and for each one of them we have to find an optimal mapping of these workloads onto specific hardware architectures.
This is what Timeloop and Accelergy do, and in particular:
* Timeloop generates a characterization of the energetical efficiency for each workload, through a mapper which finds the optimal way to plan operations on a specified architecture. To do so, Timeloop uses a coincise and unified representation of those core elements which are generally found in DNN accelerators.
* Accelergy, on the basis of the above created characterization, provides a pretty good estimate of energy consumption.

For a more complete introduction to Timeloop/Accelergy, please refer to [the official website](http://accelergy.mit.edu/tutorial.html).

## Simulation

### Objective

In this thesis, we want to test different CNN models onto a particular and simple DNN architecture.
In particular, we want to test _AlexNet_, _VGG01_, _VGG02_ and _ResNet-18_ onto a simple and pretty cheap architecture, in order to see if such hardware is suitable for some advanced Computer Vision tasks.


### Accelerator architecture

We are going to try the ResNet-18 workload on some simple architectures, which will mainly consist of a main memory, a buffer and different processing elements (PEs).

Regarding the mapper, it can be found [here](./ResNet18/mapper/mapper.yaml). We are using eight threads in parallel, to find the best parameters to optimize _delay_ and _energy_, using a _random-pruned_ algorithms which will stop once it finds 3000 configurations which perform worse than the optimal one found. 

#### Configuration `base`

This configuration (based on the 45nm technology) consists of:
* __Main Memory__ of type _DRAM_ (with `width = 256`, `block-size = 16` and `word-bits = 8`).
* __Global Buffer__ of type _SRAM_ (with `depth = 12`, `width = 16`, `block-size = 16` and `word-bits = 1`).
* __Processing Elements__ _(PE)_ containing:
    * __Register File__, with `depth = 16`, `width = 8`, `block-size = 1` and `word-bits = 8`.
    * __MACC__ _(intmac)_, with `data-width = 8`. 

Let's now run the simulation, remembering to __spin up Docker__ as we are using Timeloop/Accelergy through it. Also, we will use _python_ to run all the simulations for each layer automatically. We are opening a terminal window, where we are starting the container using the command `docker-compose run --rm exercises`. After this, we are putting the models folders inside the newly created `workspace` folder. With the below snippet of code we are generating the bash file which will be run inside the container shell to run the simulations for this first configuration and all the layers. Before running, you might need to run `chmod u+x bash_script.sh`.

In [1]:
from utils import *

generate_bash_script()

Utilization, as we can see, is one. It means all PEs are being utilized. __Remember that in each output folder there is a stat txt file which contains more detailed informations.__

Now we want to try other configurations, for example let's increase and decrease the number of PEs and see how stats change with the number of PEs.