## Setup instructions

Website @ HPCA'26: https://tutorial.xiangshan.cc/hpca26/hands_on/setup/

1. Open https://t.xiangshan.cc in your browser
2. Enter password: **TO BE DISCLOSED ON SITE**
3. By default you should see a terminal
   - If not, click Menu (triple dashes) on the top-left corner, then "Terminal > New Terminal"
4. Run `./start.sh` in the terminal
5. This will create a unique random id for you, and create your workspace at `/data/${random_id}`
6. When your workspace is created, it should be opened automatically
   - If not, click "Menu > File > Open Folder", enter `/data/${random_id}`, then click "OK"
7. Feel free to ask the instructors for help any time if you have any problem

## 00-Welcome to XiangShan Tutorial

In this section, we will give some notes on this tutorial.

- Cells that start with %%bash are Bash scripts; the rest are Python code.

- Lines that start with # are comments.

You can click ▶ in the top-left corner of a cell to run that single cell; the output will be displayed below the cell.

Running a cell require selecting the 'Python environment'. You need to choose Python 3.12.3 (first one), and you might need to select it each time you open a new notebook.

In [None]:
%%bash
echo "Welcome to the XiangShan Tutorial!"

Each cell has its own working directory and environment variables, so some commands may need to be rerun. If you execute these commands directly in the shell, you can skip the repetitive parts, e.g. `source env.sh`.

In [None]:
%%bash
# Change the working directory in this cell.
cd ../
pwd

In [None]:
%%bash
# Changing the working directory in other cells does not affect this cell.
pwd

XiangShan has design documentation synchronized with development; the GitHub repository is [https://github.com/OpenXiangShan/XiangShan-Design-Doc](https://github.com/OpenXiangShan/XiangShan-Design-Doc)

We have also deployed the design documentation website at [https://docs.xiangshan.cc/projects/design](https://docs.xiangshan.cc/projects/design)


## 01-First Run

In this section, we present the basic workflow for building and running XiangShan.

The bootcamp repository contains the environment setup scripts necessary for compiling and running XiangShan and can be cloned directly from GitHub.

In [None]:
%%bash
# For this tutorial, the local directory have been preconfigured;
# therefore, you do not need to execute these commands.
# The following commands are provided for reference.

# git submodule update --init --recursive # init submodule

Then, we can getting start!

The build and execution of XiangShan rely on specific environment variables, which are provisioned by the `env.sh` script. This script must be sourced whenever a new terminal session is started; to automate this, you can add it to your `.bashrc`. As shown in Section 00-welcome, within this tutorial each cell constitutes a fresh Bash environment; therefore, the script must be re-sourced in every cell.

In [None]:
%%bash
cd ../ && source env.sh

env | grep _HOME

Running the code block above completes the environment variable setup. After the setup, go to `$NOOP_HOME` (`xs-env/XiangShan`) to build XiangShan. 


The build parameters will be introduced later.

We can also use the tree command to view the project structure.

In [None]:
%%bash
tree -d -L 1 ..

XiangShan provides hundreds of user-configurable parameters, including:

- The processor core parameters. (`src/main/scala/top/Configs.scala`)
- The SoC parameters. (`src/main/resources/config/Default.yaml`)

Press Ctrl+P to open file search, then type the file name above to jump to it quickly.

With the configuration finalized, we can proceed to build XiangShan!

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

# Warning⚠️：Building XiangShan is highly resource‑intensive;
#            For this tutorial, we’ve prepared a precompiled binary for you.
# Reference setup: 16 CPU cores, 64 GB RAM.
# make emu -j16 CONFIG=MinimalConfig

# Additional build options
# CONFIG=MinimalConfig  XiangShan configuration
# EMU_THREADS=4         Simulation thread count
# EMU_TRACE=1           Enable waveforms
# WITH_DRAMSIM=1        Simulate DRAM with DRAMSim3
# WITH_CHISELDB = 1     Enable ChiselDB
# WITH_CONSTANTIN = 1   Enable Constantin

The commands above will generate outputs like `build/emu` and `build/rtl`.

- build/rtl/*.sv is Verilog files generated by Chisel.
- build/emu is a simulation executable further compiled with Verilator.

You can run `./build/emu` to simulate XiangShan. 

Since we haven’t built emu in this tutorial, we’ll use the precompiled emu.

We’ll introduce the run-time arguments later.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

$(get_asset emu-precompile/emu) \
    -i $(get_asset workload/hello-riscv64-xs.bin) \
    --no-diff \
    2>/dev/null | outputBuffer

# Some key runtime parameters.
# -i                        Workload path
# -C / -I                   Maximum cycle count / Maximum instruction count
# --diff=PATH / --no-diff   Reference model path / Disable difftest

## 02-Build the workload using Nexus-AM


Xiangshan is a bare-metal device that requires an operating system to provide a runtime environment for running programs. However, operating systems like Linux are too **heavyweight** and inconvenient for rapid testing and iteration.

**We present a bare metal runtime environment called Nexus-AM**

**Purpose**

- Generate workloads agilely without an OS

- Provide runtime framework for bare metal machines like XiangShan



Xiangshan is a bare-metal device that requires an operating system to provide a runtime environment for running programs. However, operating systems like Linux are too **heavyweight** and inconvenient for rapid testing and iteration.

**We present a bare metal runtime environment called Nexus-AM**


Nexus-AM is a bare-metal runtime and test-generation environment. It is lightweight and easy to use, implements basic system call interfaces and exception handlers, and supports multiple ISAs and configurations. 

The `am/` directory contains the Nexus-AM framework sources; `apps/` and `tests/` hold common workload sources, and you can create your own apps and tests.

In [None]:
%%bash
cd .. && source env.sh
cd ${AM_HOME}

tree -d -L 1

echo apps: $(ls ./apps)
echo tests: $(ls ./tests)

We start with the "Hello, XiangShan" sample (`apps/hello`). Replace "Hello, XiangShan" with "Welcome to XiangShan Tutorial" 

Then compile Nexus-AM.

In [None]:
%%bash
cd ../ && source env.sh >/dev/null
cd $AM_HOME/apps/hello

# Use sed to replace "Hello, XiangShan" with "Welcome to XiangShan Tutorial".
sed -i 's/Hello, XiangShan/Welcome to XiangShan Turtorial/' hello.c

# compiling
make ARCH=riscv64-xs LINUX_GNU_TOOLCHAIN=1

# check output
ls -l build

After compilation, the following three files will be generated:
- hello-riscv64-xs.bin：Program binary image (The ELF header and other metadata was removed) for emu.
- hello-riscv64-xs.elf：The program's ELF file.
- hello-riscv64-xs.txt：The program’s disassembly for debugging

In [None]:
%%bash
cd ../ && source env.sh >/dev/null
cd $NOOP_HOME

# Use emu to run workload.
$(get_asset emu-precompile/emu) -i $AM_HOME/apps/hello/build/hello-riscv64-xs.bin --no-diff 2>/dev/null | outputBuffer

## 03-Run the RTL simulation

XiangShan's emu supports many options; run emu --help to see usage.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

$(get_asset emu-precompile/emu) --help | outputBuffer

Unlike the 01-first-run section, this section we'll run a more complex program: CoreMark (2 iterations). The binary is already prepared in the `Xiangshan/ready-to-run` folder.

Running the full CoreMark can take 5–10 minutes. To save time, use the -C option to cap the simulation at 20,000 cycles.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

$(get_asset emu-precompile/emu) \
    -i $(get_asset workload/coremark-2-iteration.bin) \
    --no-diff \
    -C 20000 \
    2>/dev/null | outputBuffer

We've also prepared a fault-injection simulation program for XiangShan; feel free to try running it.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

# err 1
$(get_asset emu-precompile/emu-alu-err) \
    -i $(get_asset workload/coremark-2-iteration.bin) \
    --no-diff \
    -C 20000 \
    2>/dev/null | outputBuffer || true

The faulty simulation program does not correctly print “Running CoreMark for 2 iterations,” and when it reaches the 20,000-cycle limit, the PC is 0xE, clearly not the behavior of a normal program.



<span style="color:red; font-weight:700; font-size:24px;">
Great! 

We have learned the basic simulation process of Xiangshan.
</span>

<span style="color:black; font-weight:500; font-size:24px;">
    
Once we get the RTL simulator ready, what's next? 

How can we 

- find and fix functional bugs
- do performance analysis
- do research on micro-architecture


</span>

Great. We have learned the basic simulation process of XiangShan.

However, getting the RTL simulator ready is just the first step. 

We want to find and fix functioanl bugs;

We also want do further performance analysis and do research on micro-architecture.

Next slide will show our solutions.


![MinJie Cycle](../images/02-functional/00-intro/overal-en.png)

As we mentioned before, we use MinJie Platform to construct the workflow, which is shown as functional verification toolchain in the picture

The functional verification loop usually includes 4 parts:

 usually, the head of the loop is test generation: we developed nexus-am: generate bare-metal tests
 
 for bug detection: nemu is used to provide golden result, and difftest is the result comparison tool.
 
 to preserve the bug context, we will demonstrate the usage of lightsss
 
 the tail of the loop, troubleshoot & bug fixation, we use waveform & chiselDB
 
We propose several tools for each step.


## 01-NEMU: ISA Reference

To address the issue “How does the simulation program know it has already encountered an error?”, we must first define what “correct” means; in other words, we need a reference model.

We developed NEMU, a Spike-like ISA simulator. With targeted optimizations, NEMU achieves QEMU-class performance and exposes APIs to compare and verify XiangShan's architectural state.

In this section, we demonstrate how to compile and run NEMU.


NEMU provides two default configurations:
- xxx_defconfig：xxx Default settings for standalone run mode
- xxx-ref_defconfig：xxx As the default configuration for DiffTest co-simulation mode

In [None]:
%%bash
cd .. && source env.sh
cd ${NEMU_HOME}

make clean

# compile default config as standalone mode
make riscv64-xs_defconfig
make -j

make clean-softfloat

# compile default config as reference mode
make riscv64-xs-ref_defconfig
make -j

Execute CoreMark on NEMU.

Use the -b option to start NEMU in batch mode and avoid manually entering commands to run the workload.

In [None]:
%%bash
cd .. && source env.sh
cd ${NEMU_HOME}

./build/riscv64-nemu-interpreter \
    -b \
    $(get_asset workload/coremark-2-iteration.bin) | outputBuffer

## 02-Difftest：ISA Co-simulation framework

To address the issue of when the simulation program fails, we introduce DiffTest, an ISA co-simulation framework. Flow: whenever the RTL core (DUT) commits an instruction or updates state, the ISA simulator(REF) executes the same instruction; DiffTest compares architectural state between the DUT and the REF. On any mismatch it halts and reports an error; otherwise it continues. 

<div align="center">
  <img src="../images/02-functional/02-difftest/difftest-arch-en.png" alt="difftest-arch">
</div>


Run the workload on XiangShan and use NEMU for Difftest.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

$(get_asset emu-precompile/emu) \
    -i $(get_asset workload/hello-riscv64-xs.bin) \
    --diff ${NEMU_HOME}/build/riscv64-nemu-interpreter-so \
    2>/dev/null | outputBuffer

You can run workloads on the prebuilt XiangShan processor with injected bugs and use NEMU for Difftest.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

$(get_asset emu-precompile/emu-alu-err) \
    -i $(get_asset workload/hello-riscv64-xs.bin) \
    --diff ${NEMU_HOME}/build/riscv64-nemu-interpreter-so \
    2>/dev/null | tee emu_err.log > /dev/null || true # tutorial：add "|| true" to avoid notebook errors; It's not needed in real usage.

echo "Difftest directly point out specific errors of registers and PC"

tail -n 7 emu_err.log


At PC 0x0080000078, the REF and DUT are not matched: a0 is 0 in the REF, but 0x2000 in the DUT.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

echo "----------------------------------------------------------------------------------------------"
echo "Difftest presents registers situation, incluing Inter/Float/CSR Registers"

tail -n 95 emu_err.log | head -n 19


In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

echo "-----------------------------------------------------------------------------------------------"
echo "Difftest shows error occurs in which commit group"

tail -n 124 emu_err.log | head -n 11

After Difftest detects an error, we can rerun the simulation and enable waveform output around the failing cycle reported by Difftest.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

mkdir -p build
rm -f ./build/*.vcd

$(get_asset emu-precompile/emu-alu-err) \
    -i $(get_asset workload/hello-riscv64-xs.bin) \
    --diff ${NEMU_HOME}/build/riscv64-nemu-interpreter-so \
    -b 8000 \
    -e 10000 \
    --dump-wave \
    2>/dev/null > /dev/null|| true

echo -n "Dump wave: "
realpath ./build/*.vcd

![LightSSS](../images/02-functional/03-lightSSS/lightSSS-overall-en.png)

As we know, if you want to reproduce a bug, re-run simulation is so time-consuming, especially for long workload like spec cpu.
So Snapshot is the way out. We have LightSSS, a light-weight simulation snapshot.

During simulation, LightSSS will record snapshots of the process with funtion fork().

When bug is detected, it will be waked up and generate waveform of several cycles from the latest snapshot before the bug occurred.

LightSSS have good scalability because you can make snapshots for any external models (such as model written in C++), and do not need to understand model details.

And the overhead of taking a snapshot is low, only about 500 micro-seconds. This is far less than the overhead of RTL snapshots from verilator.


If you see "the oldest checkpoint start to dump wave and dump nemu log...", LightSSS is active. The simulation will then restart from the latest snapshot and record waveforms.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

mkdir -p build
rm -f ./build/*.vcd

$(get_asset emu-precompile/emu-alu-err) \
    -i $(get_asset workload/hello-riscv64-xs.bin) \
    --diff ${NEMU_HOME}/build/riscv64-nemu-interpreter-so \
    --enable-fork \
    2> /dev/null | outputBuffer || true

echo -n "Dump wave: "
realpath ./build/*.vcd

## 03-ChiselDB：Debug-friendly structured database

<!-- <div align="center">
  <img src="../images/02-functional/chiselDB-overview.png" alt="chiselDB-overview" style="width: auto; height: 50%;">
</div> -->

<img src="../images/02-functional/04-chiseldb/chiseldb-en.png" alt="chiselDB-overview" style="float:right; width:40%; margin-left:5px;">

**Motivation**

- Waveforms are large in size and hard to apply further analysis

- Need to analyze structured data like memory transaction trace

**We propose ChiselDB for storage of structured data.**

**Highlights**

- Inserting probes between module interfaces in hardware

- DPI-C: Using C++ function in Chisel code to transfer data

- Persist in database, SQL queries supported
<div style="clear:both;"></div>


LigtSSS is powerful but the waveform are still large in size and hard to apply further analysis
And we want to analyze structured data like memory transaction trace
So we present ChiselDB, a debug-friendly structured database.
It will insert probes between module interfaces in hardware,
and use DPI-C in Chisel code directly to transfer bundle info and data
As for bug analysis, SQL queries are supported so it's much more easy to use than waveform.


We provide a prebuilt simulator `emu-cdb-err` with an injected bug that forces all data released from L2 Cache to L3 Cache to a constant value.

Enable ChiselDB with `--dump-db` and turn on DiffTest; after running, DiffTest reports an error and a `.db` file is generated under `./build`.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

rm -f ./build/*.db # clean old files
mkdir -p build

$(get_asset emu-precompile/emu-cdb-err) \
    -i $(get_asset workload/stream_100000.bin) \
    --diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
    --dump-db \
    2>linux.err || true

echo -n "Dump DB: "
realpath ./build/*.db

Then use SQLite to read the `.db` for analysis: query all TileLink transactions at address `0x80048f00`, and format the output with `./scripts/cache/convert_tllog.sh`.

In [None]:
%%bash
cd .. && source env.sh

DB=$(ls -t ${NOOP_HOME}/build/*db | head -n 1)

sqlite3 ${DB} "select * from TLLog where ADDRESS=0x80048f00" | sh ${NOOP_HOME}/scripts/cache/convert_tllog.sh | outputBuffer

Result: [Time/To_From/Channel/Opcode/Permission/Address/Data]

**Data successfully transferred from L1D to L2**

16171 L2_L1D_0 C ProbeAckData Shrink TtoN 0 5 80048f00 0000000080048f50 0000000080014328 0000000000000000 0000000000000000 user: 0 echo: 0 

16172 L2_L1D_0 C ProbeAckData Shrink TtoN 0 5 80048f00 0000000000000000 0000000000000000 000000008001e000 0000000080042060 user: 0 echo: 0 



**Data successfully transferred from L2 to L3**

16179 L3_L2_0 C ProbeAckData Shrink TtoN 0 2 80048f00 0000000080048f50 0000000080014328 0000000000000000 0000000000000000 user: 0 echo: 1 

16180 L3_L2_0 C ProbeAckData Shrink TtoN 0 2 80048f00 0000000000000000 0000000000000000 000000008001e000 0000000080042060 user: 0 echo: 1

**But when L1D acquires Eaddr again, data loaded from L3 is wrong**

16457 L2_L1D_0 A AcquireBlock Grow NtoT 0 0 80048f00 0000000000000000 0000000000000000 0000000000000000 0000000000000000 user: 80048f07 echo: 0 

16463 L3_L2_0 A AcquireBlock Grow NtoT 0 0 80048f00 0000000000000000 0000000000000000 0000000000000000 0000000000000000 user: 0 echo: 1 

16486 L3_L2_0 D GrantData Cap toT 1 0 80048f00 **0000000000abcdef** 0000000000000000 0000000000000000 0000000000000000 user: 0 echo: 1 

**So there must be something wrong when L3 records Release Data**


![TL-TEST](../images/02-functional/05-tl_test/tl-test-overall-en.png)


Co-verification of the Cache system with upstream modules is complex and prevents rapid iteration.

To address this issue, we developed TL-Test: a unit-level cache-system verification framework that supports the TileLink protocol, cache-coherence checking, and randomized test-case generation.

Here is another example to detect cache coherence violation by TL-Test. We inject a bug that wrongly shift the grant data. 

TL-Test generates randomized tests and pinpoints a transfer problem at a specific address in our cache design. It logs all bus transactions; we then use grep to extract log for analysis.


In [None]:
%%bash
cd ../ && source env.sh

cat $(get_asset tltest-precompile/tlt_err.patch) | outputBuffer

You can run the demo on the prebuilt tl-test.

In [None]:
%%bash
cd ../ && source env.sh
# cd $TLT_HOME && make coupledL2-test-l2l3-v3 run THREADS_BUILD=16 CXX_COMPILER=clang++-17
# cd $TLT_HOME/run && ./tltest_v3lt 2>&1 | tee tltest_v3lt.log

get_asset tltest-precompile/tltest_err
mkdir -p ${WORK_DIR}/02-functional/05-tltest
cd ${WORK_DIR}/02-functional/05-tltest

cp -r $(get_asset tltest-precompile) ./ && cd ./tltest-precompile
./tltest_err 2>&1 | tee tltest_v3lt.log > /dev/null

tail -n 50 tltest_v3lt.log | head -n 15

Error Addr： 0x80

In [None]:
%%bash
cd ../ && source env.sh > /dev/null
# grep "addr: 0x80" $TLT_HOME/run/tltest_v3lt.log

cd ${WORK_DIR}/02-functional/05-tltest/tltest-precompile && grep "addr: 0x80," tltest_v3lt.log | head -n 10

Result: [Time/INFO-Level/Node-Idx/Core/Channel/Opcode/Source/Address/alias/Data]

**L1D acquires Error addr** 

[236] [tl-test-new-INFO] #0 L2[0].C[0] [fire A] [AcquirePerm NtoT] source: 0x3, addr: 0x80, alias: 0

**L1D release Error addr, and data successfully transferred from L1D to L2**

[806] [tl-test-new-INFO] #0 L2[0].C[0] [fire C] [ReleaseData TtoN] source: 0x3, addr: 0x80, alias: 0, data: [ c7 a5 ... ]

[808] [tl-test-new-INFO] #0 L2[0].C[0] [fire C] [ReleaseData TtoN] source: 0x3, addr: 0x80, alias: 0, data: [ fe 14 ... ]

**but when L2 grant data of Error addr, data loadad from L2 is error**

[2036] [tl-test-new-INFO] #0 L2[0].C[0] [fire D] [GrantData toT] source: 0xf, addr: 0x80, alias: 0x1, data: [ 00 c7 ... ]

[2038] [tl-test-new-INFO] #0 L2[0].C[0] [fire D] [GrantData toT] source: 0xf, addr: 0x80, alias: 0x1, data: [ 00 fe ... ]

**So there must be something wrong when L2 Grant data!** 

![MinJie Cycle](../images/03-performance/00-intro/overview-en.png)

Apart from the functional verification introduced before. Performance verification and optimization are also crucial parts of processor development.

So in this chapter, we will introduce the MinJie (or agile) performance verification approaches used by our team and provide some demonstration.

Similar to functional verification, we also summarize the performance verification process as an iterative cycle, as shown in this picture:

We do RTL implementation and we run tests and do performance evaluation and performance analysis.

In these steps, RTL implementation and running tests are naive, so we won't detail them in this tutorial.

Next, I will introduce powerful tools we use in performance evaluation and performance analysis, and how can we speed up the optimization process. We have SimPoint, XSPerf, top-down and constantin.

![intro](../images/03-performance/01-checkpoint/intro-en.png)

Let's start with checkpoint. Here's the story:

To evaluate performance, we usually run benchmark suites via simulation (i.e. software simulation using verilator, hardware-accelerated simulation using FPGA/emulator).

However, existing approaches each have their own challenges:
- Software simulation is too slow. For a complex design like XiangShan, it can only run at a few KHz, so it takes too long to run a benchmark;
- FPGA has limited on-chip resources, making it difficult to use for complex designs like XiangShan;
- Emulators are too expensive for us, and, probably for most academia.

We have seen some works trying to accelerate software simulation or improve FPGA usability.

These are great jobs, but, we think there's a much simpler way: Checkpointing.

![method](../images/03-performance/01-checkpoint/method-en.png)

Checkpointing simply means selecting some segments of a program's execution, saving the architectural state (i.e. registers and memory) at the beginning of these segments. Later when we wants to do performance evaluation, we can simply load the saved state and start simulation there.

This brings 2 main benefits:

1. This reduces the number of instructions that need to be simulated. We're not running the entire program from the start, but only some segments of it.
2. Different segments from the same program can be simulated in parallel, thus increasing simulation parallelism.

By taking a weighted average of the performance data collected from each segment, we can estimate the overall performance.

This slide shows 2 common methods for selecting segments:
1. Uniform sampling, i.e., selecting a segment every fixed number of instructions;
2. SimPoint sampling, i.e., selecting segments that can represent the overall behavior of the program by profiling.

Next, we will demonstrate how SimPoint profiles a program, generates checkpoints, and runs simulations using checkpoints.

This section will use some paths and constants different from `../env.sh`. For convenience, we have created a `01-env.sh`. In this section, we will use this script to set environment variables. You can run the following cell to view these environment variables.

In [None]:
%%bash
source 01-env.sh

env | grep WORKLOAD= # workload to be simulated / profiled / checkpointed
env | grep CHECKPOINT_INTERVAL=
env | grep NEMU=
env | grep _HOME | tail
env | grep _PATH | tail

The first step to perform checkpointing is to compile SimPoint tool and NEMU (in checkpoint mode), and generate a checkpoint restorer.

In [None]:
%%bash
source 01-env.sh

cd ${NEMU_HOME}
git submodule update --init

# Compile simpoint generator
cd ${NEMU_HOME}/resource/simpoint/simpoint_repo
make clean
make

# Compile NEMU in checkpoint mode
cd ${NEMU_HOME}
make clean
make riscv64-xs-cpt_defconfig
make -j8

# Generate checkpoint restorer for ${WORKLOAD}
cd ${NEMU_HOME}/resource/gcpt_restore
rm -rf ${GCPT_PATH}
make -C ${NEMU_HOME}/resource/gcpt_restore/ \
    O=${GCPT_PATH} \
    GCPT_PAYLOAD_PATH=$(get_asset workload/${WORKLOAD}.bin) \
    CROSS_COMPILE=riscv64-linux-gnu-

(run)

SimPoint is used to select representative segments.

NEMU is used to profile program and generate checkpoints.

The restorer acts like a bootloader, which loads the saved memory from simulated flash to main memory, and recovers registers.

Next, we need to run the program to be checkpointed using NEMU to collect program behavior for profiling.

In [None]:
%%bash
source 01-env.sh

rm -rf ${RESULT_PATH}

_LOG_PATH=${LOG_PATH}/profiling
mkdir -p ${_LOG_PATH}

${NEMU} ${GCPT} \
    -w ${WORKLOAD} \
    -D ${RESULT_PATH} \
    -C profiling \
    -b \
    --simpoint-profile \
    --cpt-interval ${CHECKPOINT_INTERVAL} \
    > >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt)

Then, use SimPoint to perform clustering analysis on the collected program behavior, selecting segments.

In [None]:
%%bash
source 01-env.sh

CLUSTER=${RESULT_PATH}/cluster/${WORKLOAD}
mkdir -p ${CLUSTER}

random1=`head -20 /dev/urandom | cksum | cut -c 1-6`
random2=`head -20 /dev/urandom | cksum | cut -c 1-6`

_LOG_PATH=${LOG_PATH}/cluster
mkdir -p ${_LOG_PATH}

${SIMPOINT} \
    -loadFVFile ${PROFILING_RESULT_PATH}/${WORKLOAD}/simpoint_bbv.gz \
    -saveSimpoints ${CLUSTER}/simpoints0 \
    -saveSimpointWeights ${CLUSTER}/weights0 \
    -inputVectorsGzipped \
    -maxK 3 \
    -numInitSeeds 2 \
    -iters 1000 \
    -seedkm ${random1} \
    -seedproj ${random2} \
    > >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt) 

Finally, use NEMU to rerun the program that needs to be checkpointed to generate checkpoint files.

In [None]:
%%bash
source 01-env.sh

CLUSTER=${RESULT_PATH}/cluster
_LOG_PATH=${LOG_PATH}/checkpoint
mkdir -p ${_LOG_PATH}

${NEMU} ${GCPT} \
    -w ${WORKLOAD} \
    -D ${RESULT_PATH} \
    -C checkpoint \
    -b \
    -S ${CLUSTER} \
    --cpt-interval ${CHECKPOINT_INTERVAL} \
    > >(tee ${_LOG_PATH}/${WORKLOAD}-out.txt) 2> >(tee ${_LOG_PATH}/${WORKLOAD}-err.txt)


Go to the directory `${RESULT_PATH}/checkpoint/${WORKLOAD}`, you can see the generated checkpoint files, a total of cluster number of `.gz` files, with the weight of the checkpoint indicated in the file name.

In [None]:
%%bash
source 01-env.sh

find "${RESULT_PATH}/checkpoint/${WORKLOAD}" -type f -name "*_.gz" | tail

We can use emu to run one of the generated checkpoints and see the effect.

In [None]:
%%bash
source 01-env.sh

CHECKPOINT=$(find ${RESULT_PATH}/checkpoint/${WORKLOAD} -type f -name "*_.gz" | tail -1)

$(get_asset emu-precompile/emu) \
    -i ${CHECKPOINT} \
    --diff $(get_asset emu-precompile/riscv64-nemu-interpreter-so) \
    --max-cycles=50000 \
    2>/dev/null


When emu detects that the file is a gzip-compressed checkpoint, it will automatically decompress it and restore the memory state and architectural state from the checkpoint.

# Performance counter in XiangShan

Purpose: Collect performance events for analysis and tuning

XSPerf:
- Accumulate:
  ```c
  if (valid)
    counter += diff;
  ```
- Histogram:
  ```c
  if (valid)
    distribution[value / step] += 1;
  ```
- Rolling:
  ```c
  if (valid)
    counters[segment] += diff;
  if (cycles++ == segment_size) {
    cycles = 0;
    segment++;
  }
  ```

While running benchmarks, we need to collect and record hardware behavior (performance events) for analysis and tuning. 

In XiangShan RTL, we have implements three types of performance counters, the pesudo code are shown here:

- Accumulate: Basic counter that accumulates whenever a performance event occurs;
- Histogram: Records the distribution of values when performance events occur;
- Rolling: Works like a segmented Accumulate-type counter, it tracks the changes in the number of performance events in each segment throughout the entire run.

## Accumulate & Histogram

These two types of performance counters are printed to stderr when the simulation ends. 

Example (Accumulate): the total number of instructions committed

```scala
def ifCommitReg(counter: UInt): UInt = Mux(isCommitReg, counter, 0.U)
XSPerfAccumulate("commitInstr", ifCommitReg(trueCommitCnt), XSPerfLevel.CRITICAL)
```

Example (Histogram): the distribution of L2 Cache acquire latency

```scala
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 0, 30, 1, true, true)
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 30, 100, 5, true, true)
XSPerfHistogram("acquire_period", acquire_period, acquire_period_en, 100, 200, 10, true, true)

```

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

$(get_asset emu-precompile/emu) \
    -i $(get_asset workload/hello-riscv64-xs.bin) \
    --no-diff 2>stderr.log | tail

echo "=== Last 10 lines:"
tail -n 10 stderr.log

echo "=== Example of XSPerfAccumulate: rob commitInstr"
grep -n "rob: commitInstr," stderr.log | tail

echo "=== Example of XSHistogram: l2cache acquire period"
grep -n "l2cache.slices_0.mshrCtl: acquire_period" stderr.log | tail

You can run this cell to see examples of XSPerf.

## Rolling

![rolling](../images/03-performance/02-xsperf/rolling-en.png)

This type of performance counter utilizes the ChiselDB framework introduced in 02-functional/04-chiseldb to store the collected data into a SQLite3 database file.

To enable RollingDB, you need to specify `WITH_ROLLINGDB=1` during compilation and use the `--dump-db` parameter at runtime.

The previous two types of counters cannot reflect the characteristic differences of different segments during the execution of a program, so we may miss the impact of a certain microarchitecture modification on a specific segment (critical region). Therefore, we need rolling analysis.

⚠️Note: If you are reading this notebook on the tutorial demo server, please do not recompile XiangShan, as it will take a long time and consume a lot of computing resources.

In [None]:
%%bash
cd .. && source env.sh
cd ${NOOP_HOME}

mkdir -p ${WORK_DIR}/03-performance/02-xsperf

# for tutorial: copy a pre-generated rolling db to tutorial dir
cp $(get_asset emu-perf-result/xs-perf-rolling.db) \
    ${WORK_DIR}/03-performance/02-xsperf/xs-perf-rolling.db

Here we just copy the pre-generated rolling db file to our workspace. In code-server you're using now, you can see the commands used to generate it.

After obtaining the database file, we use a python script to analyze it.

In the following example, we use the rollingplot.py script to plot ipc data.

Gathering ipc data in XiangShan's RTL code is as follows:

```scala
// every 1000 cycles
XSPerfRolling("ipc", ifCommitReg(trueCommitCnt), 1000, clock, reset)
```

In [None]:
%%bash
cd .. && source env.sh
cd ${WORK_DIR}/03-performance/02-xsperf

# Use python scripts to analyze the rolling db, for example, plot ipc
python3 ${NOOP_HOME}/scripts/rolling/rollingplot.py \
    ./xs-perf-rolling.db \
    ipc

ls -lh ${WORK_DIR}/03-performance/02-xsperf/results/perf.png

The script outputs the following image, showing the IPC changes of XiangShan over time while running this program:

<img src="../images/03-performance/02-xsperf/result-example.png" alt="Result Example" width="70%" />

(at ../work/03-performance/02-xsperf/results/perf.png)

If the image does not load correctly, you can try closing the notebook and reopening it.

# Top-Down

Purpose: Organize perf events in hierarchical form

[1]: Yasin A. A top-down method for performance analysis and counters architecture\[C\]//2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2014: 35-44.

<img src="../images/03-performance/03-topdown/example-en.png" alt="Example" width="70%" />

Top-Down is a common performance analysis method that organizes fragmented performance events into a hierarchical form to more accurately analyze the impact of individual performance events on overall processor performance.

Based on the XSPerfAccumulate introduced in the 02-xsperf section, we have implemented a set of Top-Down counters optimized for the XiangShan microarchitecture and RISC-V instruction set in RTL to help us better model the XiangShan microarchitecture and align it with XS-GEM5.

In the `${NOOP_HOME}/scripts/top-down` directory, we have also implemented some analysis scripts that you can run to extract Top-Down results, plot graphs, etc..

⚠️Note: If you are reading this notebook on the tutorial demo server, please do not run the analysis scripts as they will perform a large amount of disk access.

In [None]:
%%bash
cd .. && source env.sh

mkdir -p ${WORK_DIR}/03-performance/03-topdown
cd ${WORK_DIR}/03-performance/03-topdown

# for tutorial: copy analysis results
cp -r $(get_asset emu-spec-topdown-result/results) ./

echo === results ===
ls ./results

echo === first 10 lines of results.csv ===
head -n 10 ./results/results.csv

echo === first 10 lines of results-weighted.csv ===
head -n 10 ./results/results-weighted.csv

Again, we just use the pre-generated results here.

The script outputs the following image:

<img src="../images/03-performance/03-topdown/result-example.png" alt="Result Example" width="70%" />

(at ../work/03-performance/03-topdown/results/result.png)

If the image does not load correctly, you can try closing the notebook and reopening it.

# Constantin

Purpose: Speed-up design space exploration (by reducing re-compilation)

![Overview](../images/03-performance/04-constantin/overview-en.png)

Sometimes we want to test performance under different parameters.

We may use a cycle-accurate simulator, but as it may not 100% accurate, we sometimes want to test directly on RTL.

However, it is very time-consuming to compile every time we adjust only 1 or 2 parameters, even though the most part of RTL is unchanged. Is there any way to change parameters without compiling?

We present Constantin, which is based on the DPI-C interface and uses C++ functions and Chisel's BlackBox mechanism to configure parameters during runtime initialization.

To replace a scala parameter with Constantin, it looks roughly like this:

```scala
/* *** w/o Constantin *** */
val enableSomeModule = WireInit(false.B) // change to true.B, re-compile and re-run

/* *** w/ Constantin *** */
// in RTL
val enableSomeModule = WireInit(Constantin.createRecord("enableSomeModule", initValue = false))
// in constantin.txt
enableSomeModule 0 // change to 1, re-run
```

To enable Constantin, you need to use the `WITH_CONSTANTIN=1` option when compiling emu.

Currently, Constantin does not present as an emu argument, its configuration file must be located at `${NOOP_HOME}/build/constantin.txt`.


The following example uses Constantin to control the switch of the branch predictor. You can run this cell to compare the differences between on and off.

In [None]:
%%bash
cd .. && source env.sh

mkdir -p ${NOOP_HOME}/build

# run with default parameter
rm -f ${NOOP_HOME}/build/constantin.txt || true
$(get_asset emu-precompile/emu-constantin) \
    -i $(get_asset workload/coremark-2-iteration.bin) \
    -C 10000 \
    --no-diff \
    2>/dev/null

# run with Bpu turned off (falls back to static not-taken prediction)
echo "enableUbtb 0" > ${NOOP_HOME}/build/constantin.txt
$(get_asset emu-precompile/emu-constantin) \
    -i $(get_asset workload/coremark-2-iteration.bin) \
    -C 10000 \
    --no-diff \
    2>/dev/null

## Autosolving
Purpose: automatically find the best parameter configuration under the current microarchitecture.

Steps:
- Enable in Constantin configuration (refer to `./04-autosolving.patch`);
- Compile emu with `WITH_CONSTANTIN=1`;
- Provide config file.
  - Parameter name, bit width, initial value.
  - Performance counter name, optimization strategy.
  - Workload, etc..

We also implemented Autosolving for Constantin, which can automatically find the best parameter configuration under the current microarchitecture.

To use Autosolving, you need to enable it in the Constantin configuration (refer to `./04-autosolving.patch`), and compile emu with `WITH_CONSTANTIN=1`.

After enabling Autosolving, emu will read the Constantin configuration from stdin instead of a txt file, allowing us to use our python script to automatically run emu with specific configurations and try to find the optimal configuration.

You need to provide a configuration file for the script (refer to `04-autosolving-config.json`), including:
- Descriptions of configurable parameters
  - Parameter name
  - Bit width
  - Initial value
- Optimization goals
  - Performance counter name
  - Strategy (minimize, maximize)
  - Baseline
- Genetic algorithm parameters
- emu running parameters
  - workload
  - Maximum number of instructions
  - Number of threads

Our script uses genetic algorithm for parameter exploration, and it is also easy to implement other algorithms (such as ant colony/particle swarm optimization).

In [None]:
%%bash
cd .. && source env.sh && cd - >/dev/null

echo === Patch ===
cat ./04-autosolving.patch

echo === Config ===
cat ./04-autosolving-config.json

echo === Run ===
mkdir -p ${NOOP_HOME}/build
cp $(get_asset emu-precompile/emu-autosolving) ${NOOP_HOME}/build/emu
python3 ${NOOP_HOME}/scripts/constantHelper.py ./04-autosolving-config.json

You can run this cell to see autosolving works.