# Evaluation Guide

If you've written a kernel and want to get estimates on how well it performs on the DECADES architecture, this is the guide for you!

We've wrapped up some nice presets, the compilation run, and the simulator run all in two nice packages (one for the Numba route and one for the TensorFlow route). We've even got baseline numbers from some baseline implementations that you can compare against.

We'll be using the two examples from the Numba example [documentation](Numba_Decades_Examples.ipynb), i.e. matrix multiplication and reduction. You would see a very similar interface and output for a TensorFlow application.

## Specification

Our evaluation tools expect your Python tools to have a certain specification so we can keep things consistent between applications more easily. The specification is:

`<app_name>.py`

Essentially, your script should not take in any arguments.

### Script and Input Naming

With respect to what you name the script: please use the application name as the script name _if you want to compare against a baseline_.

We use a simple text analysis to try and automatically determine which baseline you are aiming to implement, but this it is not perfect.

For example, if you are programming the application "Scan Statistics", please make sure your script name contains "Scan_Statistics" in it somewhere.

### Baselines

We currently have some baseline numbers hardcoded in for applications of interest. Please contact us ASAP if you believe you should have a baseline to compare against, but the program is unable to find one.

Finally, if you are running an application with the Numba framework, our evaluator will automatically set some default presets for you that we think will mirror our 2020 system. To use these presets, please remove all `DEC_Options` settings from your script (see [here](Decades_Numba_Pipeline.ipynb)) and add the single line: `DEC_Options.preset_config()`.

### Computation Representations

For many applications, it will be infeasible to run our cycle-accurate simulator to completion for the full application. When you want to run your application through the DECADES simulator for evaluation purposes, you should run a small representative computation. For example, the TensorFlow framework should simulate only one item (e.g. image) of computation (forward and backward propagation). The evaluation metric of interest, GOPs/Watt, is time agnostic. This means that assuming you have selected an appropriate computation sample, the measured average power of this computation remains the same over multiple samples and epochs. For Numba applications where multiple kernel iterations are involved, we recommend simulating one iteration of the kernel.

## Running the Evaluator 

If your script fits the above specification, then you can run it through the evaluator! We have two very similar programs 1) `decades_numba_presets` for applications suitable for the DECADES Numba framework and 2) `decades_tf_presets` for DECADES TensorFlow applications. Instead of running `python`, you can now run these program instead!

For example, we previously tested `mat_mul.py` by running:

`$ python mat_mul.py`

To run this example through the evalutor, we can now run:

`$ decades_numba_presets mat_mul.py`

Note: this application is small enough that we can it run in its entirety without needing to focus on a smaller computation representation.

This script does both a compiler and simulator pass and may take quite some time to run, so _please be patient_. For extremely large datasets, this could take up to several hours. We hope you will try evaluating on smaller data sets first to grow comfortable with the system.

Luckily, our two examples are small enough to run in this framework in about a minute.

## Outputs

If we try running our matrix multiplication using the command:

`$ decades_numba_presets mat_mul.py`

We should see some outputs about running the application, compiler, simulator, etc. The important output at the end should look something like this:


So this means our DECADES architecture achieved approximately a 5x increase in energy efficiency over a baseline matrix multiplication run on the baseline system! Pretty cool!

You can play with the implementation of matrix multiplication to see if you can improve or worsen this value based on design decisions.

#### Note

the metric obtained here is just GOPs/Watt, or giga operations per watt. It's not necessarily fast processing that raises this number. For example, a multithreaded application that does not scale well might actually lower this number (due to more cores using more energy), even if it computes slightly faster.

Our DECADES chip uses a low-power, in-order processor design to help increase this metric. We encourage you to play with multithreaded examples and different implementation choices. 

#### Inputs

If you want to run your script on a different input, please go ahead and try! The only issue is that the baseline is stored only for one input per application. Thus the comparison will not be accurate at all, but read on to the next section to learn how to get raw outputs!

## Verbose Output

We think you might be interested in other metrics, even if you can't compare to the baseline, so we've added a way for you to do this. Simply use the same evaluator script, but set the environment variable `DECADES_VERBOSE` to the value `1`

This produces a more interesting table, which includes other metrics such as the number of cycles and power measurements:

There are still a lot of `NA` values in there because a baseline value is not available, but this feature allows you to peek under the hood and see more metrics and gain more insight on how your application is running through the DECADES framework.

## Setting Multiple Tiles (Threads) for Numba Applications

The safest option for the evaluator is to run your kernel single threaded. That's okay, there are still a lot of transparent optimizations that happen (e.g. different hardware technology and supply/compute decoupling). 

If you are adventurous (and we hope you will be!), you can explicitly tell your script to run with multiple tiles. If your application scales well, then you will likely see these metrics change in your favor!

To do this, after the `DEC_Options.preset_config()`, simply add a call to `DEC_Options.set_num_tiles(N)`, where N is the number of tiles you want to run with.

If you don't remember the parallel framework, please jog your memory by looking at the kernel execution documentation (here)[Decades_Numba_Pipeline.ipynb]

## Simulating Without the Evaluator

You can run the simulator (Pythia) without the evaluator scripts. The command is `pythiarun` and it will be in the Docker path. Running with the `-h` flag will show you a list of options. You can additionally examine the readme in: `/decades/simulator/README.md`

## Final Thoughts

Using our transparent optimizations (e.g. hardware technology and decoupling), multiple tiles, and clever algorithmic choices, we have been able to see impressive gains over the many baseline implementations (over 1000x in some cases)!

We hope you have a similar experience and please contact us for any question or comments!