![MinJie Cycle](../images/03-performance/00-intro/overview-en.png)

Apart from the functional verification introduced before. Performance verification and optimization are also crucial parts of processor development.

So in this chapter, we will introduce the MinJie (or agile) performance verification approaches used by our team and provide some demonstration.

Similar to functional verification, we also summarize the performance verification process as an iterative cycle, as shown in this picture:

We do RTL implementation and we run tests and do performance evaluation and performance analysis.

In these steps, RTL implementation and running tests are naive, so we won't detail them in this tutorial.

Next, I will introduce powerful tools we use in performance evaluation and performance analysis, and how can we speed up the optimization process. We have SimPoint, XSPerf, top-down and constantin.

![intro](../images/03-performance/01-checkpoint/intro-en.png)

Let's start with checkpoint. Here's the story:

To evaluate performance, we usually run benchmark suites via simulation (i.e. software simulation using verilator, hardware-accelerated simulation using FPGA/emulator).

However, existing approaches each have their own challenges:
- Software simulation is too slow. For a complex design like XiangShan, it can only run at a few KHz, so it takes too long to run a benchmark;
- FPGA has limited on-chip resources, making it difficult to use for complex designs like XiangShan;
- Emulators are too expensive for us, and, probably for most academia.

We have seen some works trying to accelerate software simulation or improve FPGA usability.

These are great jobs, but, we think there's a much simpler way: Checkpointing.

![method](../images/03-performance/01-checkpoint/method-en.png)

Checkpointing simply means selecting some segments of a program's execution, saving the architectural state (i.e. registers and memory) at the beginning of these segments. Later when we wants to do performance evaluation, we can simply load the saved state and start simulation there.

This brings 2 main benefits:

1. This reduces the number of instructions that need to be simulated. We're not running the entire program from the start, but only some segments of it.
2. Different segments from the same program can be simulated in parallel, thus increasing simulation parallelism.

By taking a weighted average of the performance data collected from each segment, we can estimate the overall performance.

This slide shows 2 common methods for selecting segments:
1. Uniform sampling, i.e., selecting a segment every fixed number of instructions;
2. SimPoint sampling, i.e., selecting segments that can represent the overall behavior of the program by profiling.

Next, we will demonstrate how SimPoint profiles a program, generates checkpoints, and runs simulations using checkpoints.

This section will use some paths and constants different from `../env.sh`. For convenience, we have created a `01-env.sh`. In this section, we will use this script to set environment variables. You can run the following cell to view these environment variables.

In [None]:
%%bash
source 01-env.sh

env | grep WORKLOAD= # workload to be simulated / profiled / checkpointed
env | grep CHECKPOINT_INTERVAL=
env | grep NEMU=
env | grep _HOME | tail
env | grep _PATH | tail

The first step to perform checkpointing is to compile SimPoint tool and NEMU (in checkpoint mode), and generate a checkpoint restorer.

In [None]:
%%bash
source 01-env.sh

cd ${NEMU_HOME}
git submodule update --init

# Compile simpoint generator
cd ${NEMU_HOME}/resource/simpoint/simpoint_repo
make clean
make

# Compile NEMU in checkpoint mode
cd ${NEMU_HOME}
make clean
make riscv64-xs-cpt_defconfig
make -j8

# Generate checkpoint restorer for ${WORKLOAD}
cd ${NEMU_HOME}/resource/gcpt_restore
rm -rf ${GCPT_PATH}
make -C ${NEMU_HOME}/resource/gcpt_restore/ \
    O=${GCPT_PATH} \
    GCPT_PAYLOAD_PATH=$(get_asset workload/${WORKLOAD}.bin) \
    CROSS_COMPILE=riscv64-linux-gnu-

(run)

SimPoint is used to select representative segments.

NEMU is used to profile program and generate checkpoints.

The restorer acts like a bootloader, which loads the saved memory from simulated flash to main memory, and recovers registers.

Next, we need to run the program to be checkpointed using NEMU to collect program behavior for profiling.

In [None]:
%%bash
source 01-env.sh

rm -rf ${RESULT_PATH}

_LOG_PATH=${LOG_PATH}/profiling
mkdir -p ${_LOG_PATH}

${NEMU} ${GCPT} \
    -w ${WORKLOAD} \
    -D ${RESULT_PATH} \
    -C profiling \
    -b \
    --simpoint-profile \
    --cpt-interval ${CHECKPOINT_INTERVAL} \
    > >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt)

Then, use SimPoint to perform clustering analysis on the collected program behavior, selecting segments.

In [None]:
%%bash
source 01-env.sh

CLUSTER=${RESULT_PATH}/cluster/${WORKLOAD}
mkdir -p ${CLUSTER}

random1=`head -20 /dev/urandom | cksum | cut -c 1-6`
random2=`head -20 /dev/urandom | cksum | cut -c 1-6`

_LOG_PATH=${LOG_PATH}/cluster
mkdir -p ${_LOG_PATH}

${SIMPOINT} \
    -loadFVFile ${PROFILING_RESULT_PATH}/${WORKLOAD}/simpoint_bbv.gz \
    -saveSimpoints ${CLUSTER}/simpoints0 \
    -saveSimpointWeights ${CLUSTER}/weights0 \
    -inputVectorsGzipped \
    -maxK 3 \
    -numInitSeeds 2 \
    -iters 1000 \
    -seedkm ${random1} \
    -seedproj ${random2} \
    > >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt) 

Finally, use NEMU to rerun the program that needs to be checkpointed to generate checkpoint files.

In [None]:
%%bash
source 01-env.sh

CLUSTER=${RESULT_PATH}/cluster
_LOG_PATH=${LOG_PATH}/checkpoint
mkdir -p ${_LOG_PATH}

${NEMU} ${GCPT} \
    -w ${WORKLOAD} \
    -D ${RESULT_PATH} \
    -C checkpoint \
    -b \
    -S ${CLUSTER} \
    --cpt-interval ${CHECKPOINT_INTERVAL} \
    > >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt)


Go to the directory `${RESULT_PATH}/checkpoints`, you can see the generated checkpoint files, a total of cluster number of `.gz` files, with the weight of the checkpoint indicated in the file name.

In [None]:
%%bash
source 01-env.sh

ls ${RESULT_PATH}/checkpoints

We can use emu to run one of the generated checkpoints and see the effect.

In [None]:
%%bash
source 01-env.sh

CHECKPOINT=$(find ${RESULT_PATH}/checkpoint/${WORKLOAD} -type f -name "*_.gz" | tail -1)

$(get_asset emu-precompile/emu) \
    -i ${CHECKPOINT} \
    --diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
    --max-cycles=50000 \
    2>/dev/null


When emu detects that the file is a gzip-compressed checkpoint, it will automatically decompress it and restore the memory state and architectural state from the checkpoint.

# Performance counter in XiangShan

Purpose: Collect performance events for analysis and tuning

XSPerf:
- Accumulate:
  ```c
  if (valid)
    counter += diff;
  ```
- Histogram:
  ```c
  if (valid)
    distribution[value / step] += 1;
  ```
- Rolling:
  ```c
  if (valid)
    counters[segment] += diff;
  if (cycles++ == segment_size) {
    cycles = 0;
    segment++;
  }
  ```

While running benchmarks, we need to collect and record hardware behavior (performance events) for analysis and tuning. 

In XiangShan RTL, we have implements three types of performance counters, the pesudo code are shown here:

- Accumulate: Basic counter that accumulates whenever a performance event occurs;
- Histogram: Records the distribution of values when performance events occur;
- Rolling: Works like a segmented Accumulate-type counter, it tracks the changes in the number of performance events in each segment throughout the entire run.

## Accumulate & Histogram

These two types of performance counters are printed to stderr when the simulation ends. 

Example (Accumulate): the total number of instructions committed

```scala
def ifCommitReg(counter: UInt): UInt = Mux(isCommitReg, counter, 0.U)
XSPerfAccumulate("commitInstr", ifCommitReg(trueCommitCnt), XSPerfLevel.CRITICAL)
```

Example (Histogram): the distribution of L2 Cache acquire latency

```scala
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 0, 30, 1, true, true)
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 30, 100, 5, true, true)
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 100, 200, 10, true, true)

```

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

$(get_asset emu-precompile/emu) \
    -i $(get_asset workload/hello-riscv64-xs.bin) \
    --no-diff 2>stderr.log | tail

echo "=== Last 10 lines:"
tail -n 10 stderr.log

echo "=== Example of XSPerfAccumulate: rob commitInstr"
grep -n "rob: commitInstr," stderr.log | tail

echo "=== Example of XSHistogram: l2cache acquire period"
grep -n "l2cache.slices_0.mshrCtl: acquire_period" stderr.log | tail

You can run this cell to see examples of XSPerf.

## Rolling

![rolling](../images/03-performance/02-xsperf/rolling-en.png)

This type of performance counter utilizes the ChiselDB framework introduced in 02-functional/04-chiseldb to store the collected data into a SQLite3 database file.

To enable RollingDB, you need to specify `WITH_ROLLINGDB=1` during compilation and use the `--dump-db` parameter at runtime.

The previous two types of counters cannot reflect the characteristic differences of different segments during the execution of a program, so we may miss the impact of a certain microarchitecture modification on a specific segment (critical region). Therefore, we need rolling analysis.

⚠️Note: If you are reading this notebook on the tutorial demo server, please do not recompile XiangShan, as it will take a long time and consume a lot of computing resources.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

mkdir -p ${WORK_DIR}/03-performance/02-xsperf

# for tutorial: copy a pre-generated rolling db to tutorial dir
cp $(get_asset emu-perf-result/xs-perf-rolling.db) \
    ${WORK_DIR}/03-performance/02-xsperf/xs-perf-rolling.db

Here we just copy the pre-generated rolling db file to our workspace. In code-server you're using now, you can see the commands used to generate it.

After obtaining the database file, we use a python script to analyze it.

In the following example, we use the rollingplot.py script to plot ipc data.

Gathering ipc data in XiangShan's RTL code is as follows:

```scala
// every 1000 cycles
XSPerfRolling("ipc", ifCommitReg(trueCommitCnt), 1000, clock, reset)
```

In [None]:
%%bash
cd .. && source env.sh
cd ${WORK_DIR}/03-performance/02-xsperf

# Use python scripts to analyze the rolling db, for example, plot ipc
python3 ${NOOP_HOME}/scripts/rolling/rollingplot.py \
    ./xs-perf-rolling.db \
    ipc

ls -lh ${WORK_DIR}/03-performance/02-xsperf/results/perf.png

The script outputs the following image, showing the IPC changes of XiangShan over time while running this program:

![perf](../work/03-performance/02-xsperf/results/perf.png)

If the image does not load correctly, you can try closing the notebook and reopening it.

# Top-Down

Purpose: Organize perf events in hierarchical form

![Example](../images/03-performance/03-topdown/example-en.png)

Top-Down is a common performance analysis method that organizes fragmented performance events into a hierarchical form to more accurately analyze the impact of individual performance events on overall processor performance.

Based on the XSPerfAccumulate introduced in the 02-xsperf section, we have implemented a set of Top-Down counters optimized for the XiangShan microarchitecture and RISC-V instruction set in RTL to help us better model the XiangShan microarchitecture and align it with XS-GEM5.

In the `${NOOP_HOME}/scripts/top-down` directory, we have also implemented some analysis scripts that you can run to extract Top-Down results, plot graphs, etc..

⚠️Note: If you are reading this notebook on the tutorial demo server, please do not run the analysis scripts as they will perform a large amount of disk access.

In [None]:
%%bash
cd .. && source env.sh

mkdir -p ${WORK_DIR}/03-performance/03-topdown
cd ${WORK_DIR}/03-performance/03-topdown

# for tutorial: copy analysis results
cp -r $(get_asset emu-spec-topdown-result/results) ./

echo === results ===
ls ./results

echo === first 10 lines of results.csv ===
head -n 10 ./results/results.csv

echo === first 10 lines of results-weighted.csv ===
head -n 10 ./results/results-weighted.csv

Again, we just use the pre-generated results here.

The script outputs the following image:

![perf](../work/03-performance/03-topdown/results/result.png)

If the image does not load correctly, you can try closing the notebook and reopening it.

# Constantin

Purpose: Speed-up design space exploration (by reducing re-compilation)

![Overview](../images/03-performance/04-constantin/overview-en.png)

Sometimes we want to test performance under different parameters.

We may use a cycle-accurate simulator, but as it may not 100% accurate, we sometimes want to test directly on RTL.

However, it is very time-consuming to compile every time we adjust only 1 or 2 parameters, even though the most part of RTL is unchanged. Is there any way to change parameters without compiling?

We present Constantin, which is based on the DPI-C interface and uses C++ functions and Chisel's BlackBox mechanism to configure parameters during runtime initialization.

To replace a scala parameter with Constantin, it looks roughly like this:

```scala
/* *** w/o Constantin *** */
val enableSomeModule = WireInit(false.B) // change to true.B, re-compile and re-run

/* *** w/ Constantin *** */
// in RTL
val enableSomeModule = WireInit(Constantin.createRecord("enableSomeModule", initValue = false))
// in constantin.txt
enableSomeModule 0 // change to 1, re-run
```

To enable Constantin, you need to use the `WITH_CONSTANTIN=1` option when compiling emu.

Currently, Constantin does not present as an emu argument, its configuration file must be located at `${NOOP_HOME}/build/constantin.txt`.


The following example uses Constantin to control the switch of the branch predictor. You can run this cell to compare the differences between on and off.

In [None]:
%%bash
cd .. && source env.sh

mkdir -p ${NOOP_HOME}/build

# run with default parameter
rm -f ${NOOP_HOME}/build/constantin.txt || true
$(get_asset emu-precompile/emu-constantin) \
    -i $(get_asset workload/coremark-2-iteration.bin) \
    -C 10000 \
    --no-diff \
    2>/dev/null

# run with Bpu turned off (falls back to static not-taken prediction)
echo "enableUbtb 0" > ${NOOP_HOME}/build/constantin.txt
$(get_asset emu-precompile/emu-constantin) \
    -i $(get_asset workload/coremark-2-iteration.bin) \
    -C 10000 \
    --no-diff \
    2>/dev/null

## Autosolving
Purpose: automatically find the best parameter configuration under the current microarchitecture.

Steps:
- Enable in Constantin configuration (refer to `./04-autosolving.patch`);
- Compile emu with `WITH_CONSTANTIN=1`;
- Provide config file.
  - Parameter name, bit width, initial value.
  - Performance counter name, optimization strategy.
  - Workload, etc..

We also implemented Autosolving for Constantin, which can automatically find the best parameter configuration under the current microarchitecture.

To use Autosolving, you need to enable it in the Constantin configuration (refer to `./04-autosolving.patch`), and compile emu with `WITH_CONSTANTIN=1`.

After enabling Autosolving, emu will read the Constantin configuration from stdin instead of a txt file, allowing us to use our python script to automatically run emu with specific configurations and try to find the optimal configuration.

You need to provide a configuration file for the script (refer to `04-autosolving-config.json`), including:
- Descriptions of configurable parameters
  - Parameter name
  - Bit width
  - Initial value
- Optimization goals
  - Performance counter name
  - Strategy (minimize, maximize)
  - Baseline
- Genetic algorithm parameters
- emu running parameters
  - workload
  - Maximum number of instructions
  - Number of threads

Our script uses genetic algorithm for parameter exploration, and it is also easy to implement other algorithms (such as ant colony/particle swarm optimization).

In [None]:
%%bash
cd .. && source env.sh && cd - >/dev/null

echo === Patch ===
cat ./04-autosolving.patch

echo === Config ===
cat ./04-autosolving-config.json

echo === Run ===
mkdir -p ${NOOP_HOME}/build
cp $(get_asset emu-precompile/emu-autosolving) ${NOOP_HOME}/build/emu
python3 ${NOOP_HOME}/scripts/constantHelper.py ./04-autosolving-config.json

You can run this cell to see autosolving works.