In [None]:
from nbutils import get
import IPython
from utils import (
        tvmc_compile_and_unpack, 
        relay_soma_conv2d,
        create_demo_file, 
        parse_cli_options,
        load_or_create_random_array,
        )
import tvm
import tvm.relay as relay
import tvm.relay.transform as transform
from tvm.driver.tvmc.model import TVMCModel
from tvm.driver.tvmc.compiler import compile_model
from tvm.relay.backend import Executor, Runtime
import numpy as np

# Performance Evaluation on Diana

Getting good insight in how well your algorithm performs on Diana is crucial.

In this tutorial you will learn how to:
* add performance counters to your compiled code
* run an experiment with performance counters and get back performance
* plot and interpret achieved performance on Diana

This tutorial assumes that you:
* have already selected a certain algorithm you want to test
* are aware of the basic architecture of Diana
* that you are familiar with how code generation works for Diana, both for the RISC-V core (tvm) and the accelerators (dory)


**Note that this notebook runs in an `IPython` interactive shell, but that commands with and exclamation mark (!) in front are actually passed onto the linux shell (`bash` in this case)**

Also keep a terminal ready where you've gone to the same directory:
```bash
cd /tvm-fork/diana/tutorial/performance_evaluation_on_dianaRun
```

## Run a simple Relay model on Diana

To start, let's again compile a model with TVM to be compiled with PULP-SDK
This time, we'll use the scripts that are inside the `/tvm-fork/diana/byoc` directory instead of defining a model ourselves.

To get a list of currently supported options, run:

In [None]:
def create_model():
    input_shape = (1, 3, 16, 16)
    x = relay.var("input", relay.TensorType(input_shape, 'int8'))

    weights_shape = (32, 3, 3, 3)
    special_data = load_or_create_random_array("weights.npy",
                                               weights_shape, np.int8)
    x, params1 = relay_soma_conv2d(x, 'conv1', weights_shape, 
                                   special_data,
                                   np.ones(weights_shape[0]).astype(np.int32), 
                                   act=False, shift_bits=4)
    params = params1

    # create an IR module from the relay expression
    mod = tvm.ir.IRModule()
    mod = mod.from_expr(x)

    return mod, params

mod, params = create_model()
print(mod)

In [None]:
model = TVMCModel(mod, params)
target = "soma_dory, c"
fuse_layers = True
tvmc_compile_and_unpack(model, target=target, fuse_layers=fuse_layers, build_path='build')
create_demo_file(mod, 'app/src/demo.c')

Before we can compile the generated C code we have to set up an application again where all the right dependencies are located:

In [None]:
APP_DST_DIR="app"
APP_SRC_DIR="../../byoc"
DORY_SRC_DIR="/dory"
DORY_DST_DIR="dory"
!cp $APP_SRC_DIR/src/*.c $APP_DST_DIR/src
!cp $APP_SRC_DIR/include/*.h $APP_DST_DIR/include
!mkdir -p $APP_DST_DIR/src
!mkdir -p $APP_DST_DIR/include
!mkdir -p $DORY_DST_DIR/include
!mkdir -p $DORY_DST_DIR/src
!cp $DORY_SRC_DIR/dory/Hardware_targets/Diana/Backend_Kernels/dory-hal/include/*.h $DORY_DST_DIR/include
!cp $DORY_SRC_DIR/dory/Hardware_targets/Diana/Backend_Kernels/dory-hal/src/*.c $DORY_DST_DIR/src
!cp $DORY_SRC_DIR/dory/Hardware_targets/Diana/Diana_TVM/Utils_files/*.h $DORY_DST_DIR/include
!cp $DORY_SRC_DIR/dory/Hardware_targets/Diana/Diana_TVM/Utils_files/*.c $DORY_DST_DIR/src

We can now directly compile the generated code with:
```bash
make -f Makefile.pulprt all
```

The binary is now generated in `build/pulpissimo/demo/demo`

INSERT PROCEDURE TO RUN THE BINARY

## Profiling the simple binary

It's great that we can run the binary now, but the problem is now that we don't have any clue of how good the performance of these networks is.

Luckily PULPissimo contains a set of hardware performance counters that can count the amount of cycles passed by.
Diana also has access to these performance counters. As such we can estimate the performance by counting the cycles passed by on the RISC-V core.

in `/tvm-fork/diana/byoc/src/pulp_rt_profiler_wrapper.c` and  `/tvm-fork/diana/byoc/include/pulpr_rt_profiler_wrapper.h` we provide wrappers to this functionality in C code.

The wrappers need to be called as follows:
1. setup the performance counters once with `init_global_perf_counter()`
2. start a counter during program execution with `start_perf_counter()`
3. save + stop a counter with `int32 count = stop_perf_counter()`
4. For subsequent counts you can still `start_perf_counter()` and `stop_perf_counter()`

Some guidelines in using these functions:
* Don't use a `start_perf_counter()` followed by another `start_perf_counter()`
* If you declare `count` to be a global variable, you can read out the variables with gdb in any part of your program.
* If you spend more than $2^{32}$ cycles in your program, the cycle counter may overflow.

While you could add these performance counters manually in your program, it is quite error-prone to insert them or measure them in this way. To alleviate this problem, we have developed some wrapper utilties which go into the files generated by tvm and then insert these counters automatically.

We have two different types of measurement:
* `global` : A measurement that just measures the total amount of cycles spent in the C function generated by TVM to run one forward pass of the neural network. 
* `individual` : A measurement that inserts performance counters before and after each individual kernel (layer) in TVM is called

### Adding global performance counters in the simple binary

Let's look into the main code generated by TVM:

In [None]:
print("'default_lib0.c' before adding performance counters:")
IPython.display.HTML(get("build/codegen/host/src/default_lib0.c"))

To add performance counters and globally defined variables in the generated C code you can run:


In [None]:
from profiler import insert_profiler

measurement="global"  # set profiling measurement to global
codegen_dir="./build/codegen/host/src/"  # set to TVM's codegen output
gdb_script_name="./gdb_profiler.sh"  # name of gdb script to be generated
gdb_log_name="./profile.txt"  # name of file gdb script will log its results to

interactive=False  # Don't use interactive mode for jupyter notebook
csv_file="profile.csv"  # name of file to log results to in interactive mode

insert_profiler(measurement=measurement,
                codegen_dir=codegen_dir,
                gdb_script_name=gdb_script_name,
                gdb_log_name=gdb_log_name,
                interactive=interactive,
                csv_file=csv_file)

With the `insert_profiler` function, we have adapted and created several files:
* We adapted `default_lib0.c` and inserted performance counters
   * We added a new global variable `perf_cyc` to contain the measurement
   * We added `init_global_perf_counter();`, `start_perf_counter()`, and `stop_perf_counter` functions.
* We created `gdb_profiler.sh` for running the binary in gdb.
    * `gdb_profiler.sh` will log to a file called `profile.txt`
    * In its log, it will print out `perf_cyc` after it has hit `gdb_anchor` in `app/src/demo.c`

Let's look into the files to proof that:




In [None]:
print("'default_lib0.c' after adding performance counters with 'insert_profiler':")
IPython.display.HTML(get("build/codegen/host/src/default_lib0.c"))

In [None]:
print("'gdb_script.sh', generated by 'insert_profiler':")
IPython.display.HTML(get("gdb_profiler.sh"))

### Getting the global performance of the simple binary

To get the performance of the binary, we will run the binary on Diana with gdb with the `gdb_profiler.sh` script.  

We first have to make the binary with the newly added performance counters.

In a terminal, again run:
```bash
make -f Makefile.pulprt all
```


Now you can run the binary on diana with

INSERT CODE TO RUN THE BINARY HERE:

After running the binary, you can now inspect the result of `profile.txt`. In a terminal, run:
```bash
cat profile.txt
```
For how many cycles did you run the forward pass of your network?

## Exercise: Inserting individual kernel performance counters on Diana

It's nice to know how much cycles it takes to run a forward pass of your network.
However, in a lot of cases you'll want to know how much time is spent in an invidual layer of your network. This can again be performed with `insert_profiler()`

To get more familiar with the profiler flow, we'll leave this part as an exercise for you.
In the next few cells, do the following:

1. Run the cell below to recompile a clean version of your network C code with TVM
2. Add performance counters with `insert_profiler()`, TIP: try changing the `measurement` parameter.
3. Look inside `build/codegen/host/src/default_lib1.c`:
   1. In which C function did `insert_profiler()` add performance counters? What does this C function do?
   2. How many performance counters were inserted? Look into the generated gdb script. What will the output of running this in gdb be?
   3. Which layers were created by Dory, and which ones by TVM?
   3. Recompile the C code and run the binary on Diana, what performance do you get for each field?

In [None]:
tvmc_compile_and_unpack(model, target=target, fuse_layers=fuse_layers, build_path='build')
create_demo_file(mod, 'app/src/demo.c')

In [None]:
#insert your solutions here!

#### Solution

Unhide the cells below if you're ready with the exercise.

2. set the parameter for measurement to `individual`:

In [None]:
%%capture --no-display --no-stderr
# %%capture will hide some output of this cell 
measurement="individual"

# Other parameters remain unchanged
tvmc_compile_and_unpack(model, target=target, fuse_layers=fuse_layers, build_path='build')
create_demo_file(mod, 'app/src/demo.c')
insert_profiler(measurement=measurement,
                codegen_dir=codegen_dir,
                gdb_script_name=gdb_script_name,
                gdb_log_name=gdb_log_name,
                interactive=interactive,
                csv_file=csv_file)

3. solutions:
    1. The performance counters were inserted in `tvmgen_default___tvm_main__` which runs the forward pass of the network.
    2. In this case there are three calls to `start_perf_counter();` and `stop_perf_counter();` and three global counter variables were generated, `perf_cyc_tvm_0`, and `1` and `2`. `gdb_profiler.sh` will print all of these.
    
    3. `tvmgen_default_soma_dory_main_0` was created by dory, but the others starting with `tvmgen_default_fused...` were created by TVM.