# Example: Code size comparison: muRISCVNN vs. CMSIS-NN

While we consider the program runtime (in ms, Cycles or Instructions) most of the time, the memory demand of a given application should not be underestimated. While most of the ROM usage is proably fixed due to the model weights, the program code itself also might take over 100kB of space, which might exceed the possibilities of some edge ML devices.

## Supported components

**Models:** Any (`aww` and `resnet` used below)

**Frontends:** `tflite` only (becaus eof used backend)

**Frameworks/Backends:** `tflmi` or `tflmc` only

**Platforms/Targets:** Any target/platform supporting both `muriscvnn` as well as `cmsisnn` (spike used below)

**Features:** `muriscvnn` and `cmsisnn` features have to be enabled 

## Prerequisites

Set up MLonmCU as usual, i.e. initialize an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

```yaml
---
home: "{{ home_dir }}"
logging:
  level: DEBUG
  to_file: false
  rotate: false
cleanup:
  auto: true
  keep: 10
paths:
  deps: deps
  logs: logs
  results: results
  plugins: plugins
  temp: temp
  models:
    - "{{ home_dir }}/models"
    - "{{ config_dir }}/models"
repos:
  tensorflow:
    url: "https://github.com/tensorflow/tflite-micro.git"
    ref: f050eec7e32a0895f7658db21a4bdbd0975087a5
  spike:
    url: "https://github.com/riscv-software-src/riscv-isa-sim.git"
    ref: 0bc176b3fca43560b9e8586cdbc41cfde073e17a
  spikepk:
    url: "https://github.com/riscv-software-src/riscv-pk.git"
    ref: 7e9b671c0415dfd7b562ac934feb9380075d4aa2
  cmsis:
    url: "https://github.com/PhilippvK/CMSIS_5.git"
    ref: ad1c3cad8f1240ef14a2c55381a78d792d76ec4d
  muriscvnn:
    url: "https://github.com/tum-ei-eda/muriscv-nn.git"
    ref: c023b80a51c1b48ec62b9b092d047e9ac0bab3e8
  mlif:
    url: "https://github.com/tum-ei-eda/mlonmcu-sw.git"
    ref: 4b9a32659f7c5340e8de26a0b8c4135ca67d64ac
frameworks:
  default: tflm
  tflm:
    enabled: true
    backends:
      default: tflmi
      tflmi:
        enabled: true
        features:
          debug_arena: true
    features:
      muriscvnn: true
      cmsisnn: true
frontends:
  tflite:
    enabled: true
    features: []
toolchains:
  gcc: true
platforms:
  mlif:
    enabled: true
    features: []
targets:
  default: spike
  spike:
    enabled: true
    features: []
```

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

## Usage

The following experiments mainly discuss the ROM usage or more specifically the code size (e.g. how large is the `.text` ELF section). Only the scalar version (non-SIMD) versions of the library are discussed in the following!

*Warning:* Wile muRISCV-NN and CMSIS-NN share a very similar code-base, differences in the observed ROM metrics are expected, espiecially when comparing different compilers (e.g. ARM-GCC vs. RISC-V) and eventually different optimization flags.

### A) Command Line Interface

First we want to check if the `muriscvnn` and `cmsisnn` feature are working as expected with a simple (2 models, 1 target) benchmark configuration:

In [4]:
!python -m mlonmcu.cli.main flow run aww resnet -b tflmi -t spike \
        --feature-gen _ --feature-gen muriscvnn --feature-gen cmsisnn

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-354]  Processing stage LOAD
INFO - [session-354]  Processing stage BUILD
INFO - [session-354]  Processing stage COMPILE
INFO - [session-354]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-354] Done processing runs
INFO - Report:
   Session  Run   Model Frontend Framework Backend Platform Target     Cycles  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data     Features                                             Config Postprocesses Comment
0      354    0     aww   tflite      tflm   tflmi     mlif  spike   47517926     155265      36220          63661     90396      1208      2816               33404           []  {'tflite.use_inout_data': False, 'tflite.visua...            []       -
1      354    1     aww   tflite      tflm   tflmi     mlif  spike   

Now let's focus on the reported ROM metrics running only until the `build` instead of the `run` stage.

In [5]:
!python -m mlonmcu.cli.main flow run aww resnet -b tflmi -t spike \
        --feature-gen _ --feature-gen muriscvnn --feature-gen cmsisnn \
        --postprocess filter_cols --config filter_cols.keep="Model,Cycles,Features,Total ROM,ROM read-only,ROM code, ROM misc"

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-355]  Processing stage LOAD
INFO - [session-355]  Processing stage BUILD
INFO - [session-355]  Processing stage COMPILE
INFO - [session-355]  Processing stage RUN
INFO - [session-355]  Processing stage POSTPROCESS
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-355] Done processing runs
INFO - Report:
    Model     Cycles  Total ROM  ROM read-only  ROM code     Features
0     aww   47517926     155265          63661     90396           []
1     aww   16472429     164716          63664     99844  [muriscvnn]
2     aww   16530136     166843          63661    101974    [cmsisnn]
3  resnet  155009373     199501         102493     95800           []
4  resnet   62430142     203378         102496     99674  [muriscvnn]
5  resnet   62563663     205381         102493    101680    [cmsisnn]


Above we have some preliminary results. It can be seen that the muRISCV-NN library adds another 5-15kB in terms of ROM usage to the baseline which is probably dominated by the TFLite Micro Interpreter itself.
However these programs compiled for optimal performance (using the `-O3` compiler optimization flag). Maybe we can improve the ROM usage by some amount by telling MLonMCU to optimize for size (`-Os`) instead?

In [6]:
!python -m mlonmcu.cli.main flow run aww resnet -b tflmi -t spike --config mlif.optimize=s \
        --feature-gen _ --feature-gen muriscvnn --feature-gen cmsisnn \
        --postprocess filter_cols --config filter_cols.keep="Model,Cycles,Features,Total ROM,ROM read-only,ROM code, ROM misc"

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-356]  Processing stage LOAD
INFO - [session-356]  Processing stage BUILD
INFO - [session-356]  Processing stage COMPILE
INFO - [session-356]  Processing stage RUN
INFO - [session-356]  Processing stage POSTPROCESS
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-356] Done processing runs
INFO - Report:
    Model     Cycles  Total ROM  ROM read-only  ROM code     Features
0     aww  153513868     144895          63701     79986           []
1     aww   16588118     157816          63704     92904  [muriscvnn]
2     aww   16645825     159943          63701     95034    [cmsisnn]
3  resnet  687837633     184631         102533     80890           []
4  resnet   62514256     192488         102536     88744  [muriscvnn]
5  resnet   62642467     194491         102533     90750    [cmsisnn]


Well this looks better, but not optimal. One issue here is, that CMSIS-NN lacks an possibility to pass over the optimization flags from  another CMake project. Hence in the end only the non CMSIS-NN/muRISCV-NN code was compiled with `-Os`.

### B) Python Scripting

To achieve the previous results with a Python script, only a few lines of code are required. Let's start with some imports:

In [18]:
from tempfile import TemporaryDirectory
from pathlib import Path

from mlonmcu.context.context import MlonMcuContext
from mlonmcu.session.run import RunStage

Benchmark Configuration

In [15]:
FRONTEND = "tflite"
MODELS = ["aww", "resnet"]
BACKEND = "tflmi"
PLATFORM = "mlif"
TARGET = "spike"
POSTPROCESSES = ["config2cols", "rename_cols", "filter_cols"]
FEATURES = [[], ["cmsisnn"], ["muriscvnn"]]
CONFIG = {
    "mlif.optimize": "s",
    "filter_cols.keep": ["Model", "Cycles", "ROM code", "Features"]
}

Initialize and run a single benchmark

In [38]:
with MlonMcuContext() as context:
    session = context.create_session()
    for model in MODELS:
        for features in FEATURES:
            def helper(session):
                cfg = CONFIG.copy()
                run = session.create_run(config=cfg)
                run.add_features_by_name(features, context=context)
                run.add_frontend_by_name(FRONTEND, context=context)
                run.add_model_by_name(model, context=context)
                run.add_backend_by_name(BACKEND, context=context)
                run.add_platform_by_name(PLATFORM, context=context)
                run.add_target_by_name(TARGET, context=context)
                run.add_postprocesses_by_name(POSTPROCESSES)
            helper(session)
    session.process_runs(context=context)
    report = session.get_reports()
report.df

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-360] Processing all stages
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-360] Done processing runs


Unnamed: 0,Model,Cycles,ROM code,Features
0,aww,153513868,79986,[]
1,aww,16645825,95034,[cmsisnn]
2,aww,16588118,92904,[muriscvnn]
3,resnet,687837633,80890,[]
4,resnet,62642467,90750,[cmsisnn]
5,resnet,62514256,88744,[muriscvnn]


Here we have the report as pandas dataframe. Of course be can also look at relative differences instead:

In [56]:
df = report.df
df.set_index('Features', inplace=True)
df.index = df.index.map(lambda x: str(x)[1:-1] if len(x) > 2 else "default")
cycles_firsts = (df.groupby('Model')['Cycles'].transform('first'))
rom_firsts = (df.groupby('Model')['ROM code'].transform('first'))
df["Cycles (rel.)"] = (1 / (df.Cycles / cycles_firsts))
df["ROM code (rel.)"] = (1 / (df["ROM code"] / rom_firsts))
df

Unnamed: 0_level_0,Model,Cycles,ROM code,Cycles (rel.),ROM code (rel.)
Features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
default,aww,153513868,79986,1.0,1.0
default,aww,16645825,95034,9.222365,0.841657
default,aww,16588118,92904,9.254448,0.860953
default,resnet,687837633,80890,1.0,1.0
default,resnet,62642467,90750,10.980373,0.89135
default,resnet,62514256,88744,11.002892,0.911498
