# Evaluation of a Convolutional Neural Network Architecture
Course project for the class _IoT Based Smart Systems_, carried out by _Riccardo Maria Pesce_ during Academic Year _2021-2022_, under the kind supervision of Professor _Maurizio Palesi_.

## Introduction

### Motivation
With the latest scientific and technological advancements that has taken place in the past few years, AI techniques have been employed in different fields, with great success. 

While Deep Learning Models are still trained on the cloud (using state-of-the-art computing machines with specialized hardware such as _GPU_ or _TPU_), it is becoming always more common to perform inference on the edge, i.e. on the devices itself, so as to reduce latency and optimize the usage of bandwith, a relevant issue for constrained devices.

### Objective
The objective of this thesis is to analyze a _CNN_ (Convolutional Neural Network) architecture performance on a constrained hardware, seeing how the mapping will affect the performance in terms of throughput and energy consumption.
In particular, these performance achievements are obtained in the following ways:

* Through reducing data movements, since communication is more expensive than computation in terms of energy nowadays. We can reduce data movements by either reducing the number of times memory is accessed, employing for instance DRAM which are nearby the _PEs_ (Processing Elements), or else we can compress data to a smaller number bits to represent it, thus making data movements cheaper. From these observations, we notice how __memory is the main bottleneck__.

* Maximizing PEs parallelism.

### Mapping
Mapping defines the order of execution of the MAC operations. The ordering can either be _temporal_ when operations are mapped serially on the same PE (i.e. the temporal order of execution), or _spatial_ when operations are mapped to multiple PE to execute in parallel.

### Introduction to Timeloop and Accelergy
In order to correctly design a DNN accelerator we need to cater for the different DNN architectures, and for each one of them we have to find an optimal mapping of these workloads onto specific hardware architectures.
This is what Timeloop and Accelergy do, and in particular:
* Timeloop generates a characterization of the energetical efficiency for each workload, through a mapper which finds the optimal way to plan operations on a specified architecture. To do so, Timeloop uses a coincise and unified representation of those core elements which are generally found in DNN accelerators.
* Accelergy, on the basis of the above created characterization, provides a pretty good estimate of energy consumption.

For a more complete introduction to Timeloop/Accelergy, please refer to [the official website](http://accelergy.mit.edu/tutorial.html).

## Simulation

### Objective

In this thesis, we want to find an efficient mapping of the _ResNet-50_ model onto different architectures.

### ResNet-18

We want to give a brief overview of the _ResNet-18_ architecture. 

![ResNet-18 Architecture](./assets/resnet18.png)

The main key points, as highlighted in the above picture, are:

* ResNet-18 architecture has __4 stages__.
* Input height and width must be multiple of 32 and channel width must be equal to 3.
* The main innovation of the ResNet architecture is the introduction of the _Identity Connection_ which allows the input feature map of some layer to skip some blocks and being summed to the output feature map of the skipped layers, before passing through the activation function (which is commonly the _ReLU_). This is a very succesful strategy to improve accuracy, thanks to the fact that we limit the _vanishing gradient_ problem which often occurs in very deep neural networks.
* ResNet uses Batch Normalization to mitigate the _Covariance Shift_ problem.

For a better understanding, please check out the original paper on [ArXiv](https://arxiv.org/abs/1512.03385).


### Accelerator architecture

We are going to try the ResNet-18 workload on different hardware architectures. In particular we are going to analyze energy consumption, inferences per second and occupied area.

We start with a basic architecture, which will contain a main memory, a global buffer and multiple processing elements. The features of each one of the elements will be changed, and for each configuration statistics will be shown.

Regarding the mapper, it can be found [here](./ResNet18/mapper/mapper.yaml). We are using eight threads in parallel, to find the best parameters to optimize _delay_ and _energy_, using a _random-pruned_ algorithms which will stop once it finds 50 configurations which perform worse than the optimal one found. 

#### Configuration 1

First architecture will use a 45nm technology as follows:
* __Main Memory__: DRAM memory with `width = 512`, `block-size = 64` and `word-bits = 8`.
* __Global Buffer__: SRAM memory with `depth = 12`, `width = 16`, `block-size = 16` and `word-bits = 1`.
* __Processing Elements (16)__: made up of
    * __Register File__: with `depth = 16`, `width = 8`, `block-size = 1` and `word-bits = 8`.
    * __MACC (intmac)__: with `data-width = 16`.

Let's now run the simulation, remembering to __spin up Docker__ as we are using Timeloop/Accelergy through it. Also, we will use _python_ to run all the simulations for each layer automatically. We are opening a terminal window, where we are starting the container using the command `docker-compose run --rm exercises`. After this, we are putting the `ResNet18` folder inside the newly created `workspace` folder. With the below snippet of code we are generating the bash file which will be run inside the container shell to run the simulations for this first configuration and all the layers. Before running, you might need to run `chmod u+x bash_script.sh`.

In [1]:
import glob
from pathlib import Path

configurations = [f for f in glob.glob("workspace/ResNet18/arch/*") if f.endswith(".yaml")]
conf_names = [f.replace("workspace/ResNet18/arch/", "").replace(".yaml", "") for f in configurations]

conf = "workspace/ResNet18/arch/configuration1.yaml"
conf_name = "configuration1"

with open("workspace/bash_script.sh", "w") as bash_script:
    bash_script.write("#!/bin/bash")
    bash_script.write("\n\n")
    bash_script.write("chmod -R 777 .")
    bash_script.write("\n\n")
    Path(f"workspace/ResNet18/output/conf-{conf_name}/").mkdir(mode=777, exist_ok=True)
    Path(f"workspace/ResNet18/output/conf-{conf_name}/").chmod(0o777)
    for i in range(1, 22):
        Path(f"workspace/ResNet18/output/conf-{conf_name}/output{i}/").mkdir(mode=777, exist_ok=True)
        cmd = "timeloop-mapper " + conf.replace("workspace/", "") + " ResNet18/arch/components/*.yaml ResNet18/prob/resnet18_layer"+ str(i) + ".yaml ResNet18/mapper/mapper.yaml ResNet18/constraints/*.yaml -o ./ResNet18/output/conf-" + conf_name + "/output" + str(i)
        bash_script.write(cmd)
        bash_script.write("\n\n")
        


Let's define some functions to parse the different statistics per layer.

In [2]:
import re

def get_energy_breakdown_from_stats_txt(file_path):
  data = dict()
  with open(file_path, "r") as f:
    stats = f.read()
    for m in re.findall(r"\b(?!\bMACCs|Total\b)([a-zA-Z<>=]+)\s+=\s+(?=.*[1-9])(\d+.\d+)", stats):
      data[m[0]] = float(m[1])
    f.close()
  return data

def get_area_breakdown_from_stats_txt(file_path):
  data = dict()
  with open(file_path, "r") as f:
    stats = f.read()
    for m in re.findall(r"=== ([a-zA-Z]+) ===.*?Area .*?\s+:\s(\d+.\d+)", stats, re.DOTALL):
      if m[1] != "1.00":
        data[m[0]] = float(m[1])
    f.close()
  return data

Let's now see the energy consumption per MACC (computation). Remember that it is calculated as __pj/MACC__ as we are estimating the energy for each __MACC__ operation.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

def pj_macc_stats(output_path):
    layer_paths = [f for f in glob.glob(output_path + "/*/" + "timeloop-mapper.stats.txt")]
    layers = [int(f.replace(output_path, "")
               .replace("timeloop-mapper.stats.txt", "")
               .replace("/", "")
               .replace("output", "")) for f in layer_paths]

    stats = {layer: get_energy_breakdown_from_stats_txt(layer_path) for layer, layer_path in zip(layers, layer_paths)}
    stats_df = pd.DataFrame(stats).T.sort_index().fillna(0)

    return stats_df, stats_df.sum(axis=0), stats_df.sum(axis=1)

pj_macc_stats, pj_macc_total_by_element, pj_macc_total_by_layer = pj_macc_stats("workspace/ResNet18/output/conf-configuration1")
pj_macc_stats

Unnamed: 0,MACC,RegisterFile,MainMemory,GlobalBuffer
1,0.56,0.26,40.44,0.0
2,0.56,0.26,48.78,0.0
3,0.56,0.26,48.78,0.0
4,0.56,0.26,48.78,0.0
5,0.56,0.26,48.78,0.0
6,0.56,0.26,48.11,0.0
7,0.56,0.26,32.39,0.0
8,0.56,0.4,19.33,0.0
9,0.56,0.26,32.39,0.0
10,0.56,0.26,32.39,0.0


Let's now see how much energy per MACC has been consumed by each layer.

In [4]:
pj_macc_total_by_layer

1     41.26
2     49.60
3     49.60
4     49.60
5     49.60
6     48.93
7     33.21
8     20.29
9     33.21
10    33.21
11    37.37
12    49.01
13    15.04
14    49.01
15    49.01
16    42.14
17    69.03
18    38.04
19    69.03
20    69.03
21    17.84
dtype: float64

Let's now get the main whole statistics for each layer, remembering that __area is calculated as um<sup>2</sup>__ and this time energy is calculated as __uJ__.

In [6]:
def get_summary_stats(file_path):
  data = dict()
  with open(file_path, "r") as f:
    stats = f.read()
    initial_index = stats.index("Utilization:")
    end_index = stats.index("\n\nMACCs")
    cleaned = [c.replace(":", "") for c in stats[initial_index:end_index + 1].split("\n") if c != ""]
    summary = {s.split()[0]: float(s.split()[1]) for s in cleaned}
    
  return summary


def energy_stats(output_path):
  layer_paths = [f for f in glob.glob(output_path + "/*/" + "timeloop-mapper.stats.txt")]
  layers = [int(f.replace(output_path, "")
              .replace("timeloop-mapper.stats.txt", "")
              .replace("/", "")
              .replace("output", "")) for f in layer_paths]

  stats = {layer: {"Energy": get_summary_stats(layer_path)["Energy"]} for layer, layer_path in zip(layers, layer_paths)}
  stats_df = pd.DataFrame(stats).T.sort_index().fillna(0)

  return stats_df, stats_df.sum(axis=0).values.tolist()[0]

energy_by_layer, energy_total = energy_stats("workspace/ResNet18/output/conf-configuration1")
print(energy_by_layer)
print("---------")
print("Total: ", energy_total)

     Energy
1   4868.83
2   5734.05
3   5734.05
4   5734.05
5   5734.05
6   2828.41
7   3839.29
8    130.29
9   3839.29
10  3839.29
11  2160.27
12  5666.49
13    96.61
14  5666.49
15  5666.49
16  2435.82
17  7980.19
18   244.30
19  7980.19
20  7980.19
21  4675.94
---------
Total:  92834.58
