



# **XiangShan: An Open-Source High-Performance RISC-V Processor and Infrastructure for Architecture Research**

#### The XiangShan Team

Institute of Computing Technology (ICT)
Chinese Academy of Sciences (CAS)

HPCA'24@Edinburgh, Scotland March 2, 2024

#### What we will cover in this tutorial

Highlights XiangShan on Chips and FPGAs

#### CPU Microarchitecture (60 minutes)

- <u>Design and implementation How to implement novel ideas on XiangShan</u>
- Frontend: branch prediction and instruction fetch
- Backend: out-of-order scheduler, execution units
- Load/Store Unit: LSQ, pipelines, TLBs, data caches
- L2/L3 caches and prefetchers

#### Development workflows (60 minutes)

- Introduction of their usages How to develop on XiangShan with MinJie
- Simulation and Debugging
- Research Demo

# Tutorial@HPCA'24 Schedule

| Time (AM)     | Topic                                       |
|---------------|---------------------------------------------|
| 8:30 - 9:00   | Introduction of the XiangShan Project       |
| 9:05 - 10:00  | Microarchitecture Design and Implementation |
| 10:05 - 10:50 | Hands-on Development                        |
|               | Coffee Break                                |
| 11:20 - 12:00 | Hands-on Development & Discussions (Cont.)  |



# Roadmap: MinJie Development Flows and Tools





## **Roadmap: MinJie Development Flows and Tools**



## In this tutorial, we will

- Provide access to cloud servers with you
- Prepare the XiangShan development environment for you
- Go through the development workflows including:

#### **Simulation**

- Environment
- RTL Generation
- RTL Simulation

••••

#### **Func. Verification**

- Nexus-AM
- NEMU
- Difftest

••••

#### **Perf. Verification**

- Perf. counter
- Constin
- Checkpoint

....

#### **Research Demo**

- Oracle BP
- Top-Down

••••

Required: machines with networking and SSH

#### Demo Instructions

• Interactive shell commands are highlighted in brown box with prefix \$

```
$ echo "Hello, XiangShan"
$ echo "Have a nice day"
```

• Description and notes are presented in grey box with prefix #

```
# Please prepare a laptop with an SSH client.
# Next, let's start the demo session!
```

#### Let's Start!

### Prerequisites

- Login to the provided cloud server
  - Windows (Windows Terminal / PowerShell is recommended)

```
PS > ssh guest@8.208.78.58
# Password: openxiangshan
```

• Linux / Mac

```
$ ssh guest@8.208.78.58
# Password: openxiangshan
```

For **offline users**, please refer to <a href="https://github.com/OpenXiangShan/xs-env/tree/tutorial-new">https://github.com/OpenXiangShan/xs-env/tree/tutorial-new</a>

## Prerequisites

```
# Copy tutorial environment to your dir based on your name
$ cp -r /opt/xs-env ~/<YOUR_NAME>
# Enter your dir
$ cd ~/<YOUR_NAME>
# Set up environment variables
# DO IT AGAIN when opening a new terminal
$ source env.sh
# SET XS_PROJECT_ROOT:
                                            /home/guest/YOUR_NAME
# SET NOOP_HOME (XiangShan RTL Home):
                                            $XS_PROJECT_ROOT/XiangShan
# SET NEMU_HOME:
                                            $XS_PROJECT_ROOT/NEMU
                                            $XS_PROJECT_ROOT/nexus-am
# SET AM_HOME:
```

# Prerequisites

```
# Project Structure
$ tree -d -L 1
       env-scripts
                         Useful scripts
                         NEMU
       NEMU
       XiangShan
                         XiangShan processor
                         Abstract machine
       nexus-am
                         Scripts and patches for tutorial
       tutorial
# Enter XiangShan directory
$ cd XiangShan
```

# Chisel Compilation



Compile RTL and build simulator with Verilator

```
$ make emu CONFIG=MinimalConfig MFC=1 -j8
Options:
    CONFIG=MinimalConfig
                                Configuration of XiangShan
                                Enable MLIR Firrtl Compiler
    MFC=1
                                Simulation threads
   // EMU_THREADS=2
   // EMU_TRACE=1
                                Enable waveform dump
   // WITH_DRAMSIM=1
                                Enable DRAMSim3 for DRAM simulation
                                Enable ChiseIDB feature
   // WITH_CHISELDB = 1
   // WITH_CONSTANTIN = 1
                                Enable Constantin feature
```

• Compilation might take 10 mins computing Technology, CAS

# Open Another Terminal

```
# login to the cloud server again
$ ssh guest@8.208.78.58
# Password: openxiangshan
# Come back to XiangShan directory
$ cd ~/<YOUR_NAME> && source env.sh && cd XiangShan
```

# Run RTL Simulation with Verilator

After building, we can run the simulator

#### **Great!**

We have learned the basic simulation process of Xiangshan.

Once we get the RTL simulator ready, what's next?

How can we

- generate desired workloads
- find and fix functional bugs
- do performance analysis
- do research on micro-architecture



### MinJie Functional Verification



#### **Bare Metal Runtime: Nexus-AM**

#### Purpose

- Generate workloads agilely without an OS
- Provide runtime framework for bare metal machines like XiangShan

#### We present a bare metal runtime called Nexus-AM

- Light-weight, easy to use
- Implement basic system call interfaces and exception handlers
- Support multiple ISAs and configurations

#### Nexus-AM

- Hands-on: build Coremark workload
- Showcase

```
$ cd ~/<YOUR_NAME> && source env.sh && cd tutorial
$ bash build-am.sh
```

```
# cd $AM_HOME/apps/coremark

# make ARCH=riscv64-xs

# Ls -L build

# coremark-riscv64-xs.bin Binary image of the program

# coremark-riscv64-xs.elf ELF of the program

# coremark-riscv64-xs.txt Disassembly of the program
```



### MinJie Functional Verification



#### **⇔ ISA REF: NEMU**

#### Purpose

- Golden model for verification
- Simple yet high performance

#### We present NEMU

- An instruction set simulator, similar to Spike
- Through optimization techniques, achieve performance similar to QEMU.
- A set of APIs is provided to assist XiangShan in comparing and verifying the architectural states

### **NEMU**

#### Showcase

\$ bash build-nemu.sh

```
# cd $NEMU_HOME

# make clean

# make riscv64-xs_defconfig

# make -j build NEMU as the bare metal machine, can run the Coremark from the previous step

# make clean-softfloat

# make riscv64-xs-ref_defconfig

# make build NEMU as the reference model for XiangShan
```

### **NEMU**

- Hands-on: run Coremark workload on NEMU
- Showcase

\$ bash run-nemu.sh

```
# cd $NEMU_HOME

# ./build/riscv64-nemu-interpreter -b \ run in batch mode, faster

$AM_HOME/apps/coremark/build/coremark-riscv64-xs.bin set workspace to Coremark
```

# **ISA** co-simulation using Difftest

#### Basic flows

- Instructions commit/other states update
- The simulator executes the same instructions
- Compare the architectural states between DUT and REF
- Abort or continue

#### Provided as verification infrastructure for processors

- APIs for HDLs such as Chisel/Verilog
- RTL simulators such as Verilator, VCS
- RISC-V ISS such as NEMU, Spike

#### SMP-Difftest: co-simulation on an SMP processor

- SMP Linux kernel and multi-threading programs
- Online checking of cache coherency and memory consistency



#### **Basic architecture of DiffTest**

```
while (1) {
    icnt = cpu_step();
    nemu_step(icnt);
    r1s = cpu_getregs();
    r2s = nemu_getregs();
    if (r1s != r2s) { abort(); }
}
```

#### **Online checking**

- Hands-on: run Coremark workload on XiangShan, difftest with NEMU
- Showcase

```
# Running Coremark takes about 2 minutes
```

\$ bash run-emu.sh

```
# ./emu \ if the simulator is successfully compiled, you can use $NOOP_HOME/build/emu
-i $AM_HOME/apps/coremark/build/coremark-riscv64-xs.bin \ use coremark.bin
--diff $NEMU_HOME/build/riscv64-nemu-interpreter-so \ difftest with NEMU
2>perf.out redirect stderr to file
```

```
./emu -i $AM HOME/apps/coremark/build/coremark-riscv64-xs.bin --diff $NEMU HOME/build/riscv64-nemu-interpreter-so
Emu compiled at Mar 24 2023, 20:33:03
Using simulated 32768B flash
[warning]no valid flash bin path, use preset flash instead
The image is /home/guest/xs-env-asplos2023/nexus-am/apps/coremark/build/coremark-riscv64-xs.bin
Using simulated 8192MB RAM
NemuProxy using /home/guest/xs-env-asplos2023/NEMU/build/riscv64-nemu-interpreter-so
The first instruction of core 0 has committed. Difftest enabled.
[NEMU] Can not find flash image: (null)
[NEMU] Use built-in image instead
[src/device/io/mmio.c:38,add mmio map] Add mmio map 'flash' at [0x0000000010000000, 0x00000000100fffff]
Running CoreMark for 1 iterations
2K performance run parameters for coremark.
CoreMark Size : 666
Total time (ms) : 4288
Iterations
            : 1
Compiler version : GCC9.4.0
seedoro
              : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0xe714
Finised in 4288 ms.
CoreMark Iterations/Sec 233
Core 0: HIT GOOD TRAP at pc = 0x80001c52
total guest instructions = 398,372
instrCnt = 398,372, cycleCnt = 517,099, IPC = 0.770398
Seed=0 Guest cycle spent: 517,103 (this will be different from cycleCnt if emu loads a snapshot)
Host time spent: 77,559ms
```

#### Hands-on:

- We injected a bug in a pre-built XiangShan processor
- Now, let's trigger it on the pre-built buggy XiangShan by Difftest

#### Showcase

```
$ bash run-emu-diff.sh

# ./emu-bug \
   -i $AM_HOME/apps/coremark/build/coremark-riscv64-xs.bin \
   --diff $NEMU_HOME/build/riscv64-nemu-interpreter-so # difftest with NEMU
```



- Difftest will report the error once failed
  - Dump info like PC, GPRs, etc.
  - Point out the differences

```
======= Commit Group Trace (Core 0) =========
commit group [00]: pc 0080000118 cmtcnt 7
commit group [01]: pc 008000012c cmtcnt 6
commit group [02]: pc 00800013b2 cmtcnt 6
commit group [03]: pc 0080000124 cmtcnt 7
commit group [04]: pc 0080000136 cmtcnt 6
commit group [05]: pc 00800013c0 cmtcnt 6
commit group [06]: pc 00800013da cmtcnt 6
commit group [07]: pc 0080001400 cmtcnt 7
commit group [08]: pc 0080001414 cmtcnt 7
commit group [09]: pc 0080001428 cmtcnt 6
commit group [10]: pc 0080001436 cmtcnt 6
commit group [11]: pc 00800017fe cmtcnt 6
commit group [12]: pc 008000180c cmtcnt 6 <--
commit group [13]: pc 008000139a cmtcnt 6
commit group [14]: pc 0080000124 cmtcnt 7
commit group [15]: pc 0080000154 cmtcnt 6
```

```
qp: 0x0000000000000000
                                                       t2: 0x0000000000000000
                                                       ft3: 0xffffffff00
                  fs9: 0xffffffff00000000 fs10: 0xffffffff00000000
                  ft9: 0xffffffff00000000 ft10: 0xffffffff00000000 ft11: 0xffffffff00000000
pc: 0x000000008000143a mstatus: 0x8000000a00006000 mcause: 0x00000000000000 mepc: 0x0000004fe6a8c3dc
                sstatus: 0x8000000200006000 scause: 0x00000000000000 sepc: 0x00000000000000
satp: 0x00000000000000000
mideleg: 0x0000000000000000 medeleg: 0x0000000000000000
mtval: 0x000000000000000 stval: 0x9738662f59a3300f mtvec: 0x0000000000000 stvec: 0x0000000000000
privilege mode: 3 pmp: below
4: cfg:0x00 addr:0x0000000000000000| 5: cfg:0x00 addr:0x000000000000000
6: cfg:0x00 addr:0x0000000000000000 7: cfg:0x00 addr:0x0000000000000000
10: cfg:0x00 addr:0x0000000000000000|11: cfg:0x00 addr:0x000000000000000
12: cfg:0x00 addr:0x0000000000000000|13: cfg:0x00 addr:0x0000000000000000
priviledgeMode: 3
   a3 different at pc = 0x008000180c, right= 0x00000000000002, wrong = 0x0000000000000000
Core 0:
```

- Hands-on: Dump waveform after Difftest detects the bug
- Showcase

```
$ bash run-emu-dumpwave.sh
$ ls ../XiangShan/build | grep "vcd" #There should be a .vcd waveform
```

```
# ./emu-bug \
-i $AM_HOME/apps/coremark/build/coremark-riscv64-xs.bin \
--diff $NEMU_HOME/build/riscv64-nemu-interpreter-so \
-b 12000 \ the begin cycle of the wave, you may find XiangShan trap at 12103s cycle in previous step
-e 13000 \ the end cycle of the wave
--dump-wave \ set this option to dump wave, you will find *.vcd on $NOOP_HOME/build
```



### MinJie Functional Verification



# Light-weight Simulation SnapShot: LightSSS

- Re-run simulation is time-consuming!
  - Snapshot is the way out
- LightSSS: snapshots of the process with fork()
  - Copy-on-write on Linux Kernel
- Advantage 1: Good portability and scalability
  - Snapshots for any external models (such as C++)
  - No need to understand details of external models
- Advantage 2: Low overhead of snapshots
  - ~500us for taking a snapshot
  - Far less overhead than RTL snapshots by Verilator



# LightSSS for on-demand debugging (waveform)

- Hands-on: use lightSSS to dump waveform
- Showcase

```
$ bash run-emu-lightsss.sh
```

```
# ./emu-bug \
-i $AM_HOME/apps/coremark/build/coremark-riscv64-xs.bin \
--diff $NEMU_HOME/build/riscv64-nemu-interpreter-so \
--enable-fork \ No longer need to use complex parameters
2 > lightsss.err
```

# LightSSS for on-demand debugging (waveform)

- Hands-on: use lightSSS to dump waveform
- LightSSS is working if you see:

```
[FORK_INFO pid(NUMBER)] the oldest checkpoint start to dump wave and dump nemu log... [FORK_INFO pid(NUMBER)] dump wave to /SOME/PATH/XXX.vcd...
```

• After that, simulation will start again from the existing latest snapshot



### MinJie Functional Verification



### Debug-friendly Structured Database: ChiselDB

#### Motivation:

- Waveforms are large in size and hard to apply further analysis
- Need to analyze structured data like memory transaction trace

#### We present ChiselDB

#### • Highlights:

- Inserting probes between module interfaces in hardware
- DPI-C: Using C++ function in Chisel code to transfer data
- Persist in database, SQL queries supported

#### ChiselDB

- **Design source code:** XiangShan/utility/src/main/scala/utility/ChiseIDB.scala
- **Usage:** Create table

```
// API: def createTable[T <: Record](tableName: String, hw: T): Table[T]</pre>
import huancun.utils.ChiselDB
class MyBundle extends Bundle {
       val fieldA = UInt(10.W)
       val fieldB = UInt(20.W)
val table = ChiselDB.createTable("MyTableName", new MyBundle)
```

### ChiselDB

• **Usage:** Register probes

```
/* APIs
def log(data: T, en: Bool, site: String = "", clock: Clock, reset: Reset)
def log(data: Valid[T], site: String, clock: Clock, reset: Reset): Unit
def log(data: DecoupledIO[T], site: String, clock: Clock, reset: Reset): Unit
*/
table.log(
 data = my_data /* hardware of type T */,
  en = my_cond, site = "MyCallSite",
 clock = clock, reset = reset
```

### ChiselDB

• Example: XiangShan/src/main/scala/xiangshan/cache/mmu/L2TLB.scala

```
class L2TlbMissQueueDB(implicit p: Parameters) extends TlbBundle {
  val vpn = UInt(vpnLen.W)
val L2TlbMissQueueInDB, L2TlbMissQueueOutDB = Wire(new L2TlbMissQueueDB)
L2TlbMissQueueInDB.vpn := missQueue.io.in.bits.vpn
L2TlbMissQueueOutDB.vpn := missQueue.io.out.bits.vpn
val L2TlbMissQueueTable = ChiselDB.createTable(
        "L2TlbMissQueue hart" + p(XSCoreParamsKey).HartId.toString, new L2TlbMissQueueDB)
L2TlbMissQueueTable.log(L2TlbMissQueueInDB, missQueue.io.in.fire, "L2TlbMissQueueIn", clock, reset)
L2TlbMissQueueTable.log(L2TlbMissQueueOutDB, missQueue.io.out.fire, "L2TlbMissQueueOut", clock, reset)
```

# ChiseIDB

- Hands-on: Analysis of Cache Coherence Violation
  - Add ChiselDB
  - Inject bug
  - Compile and run
  - Analyze ChiselDB

# ChiselDB

Hands-on: Analysis of Cache Coherence Violation

```
# Inject bug – We set all Release Data as a constant value
$ cd chiseldb && cat cc_err.patch
--- a/src/main/scala/huancun/noninclusive/SinkC.scala
+++ b/src/main/scala/huancun/noninclusive/SinkC.scala
@@ -93,7 +93,7 @@ class SinkC(implicit p: Parameters) extends
BaseSinkC {
       buffer(insertIdx)(count) := c.bits.data
       buffer(insertIdx)(count) := 0xABCDEF.U
```

# ChiselDB

Hands-on: Analysis of Cache Coherence Violation

```
# simulate using pre-built emu (ChiseIDB enabled)

$ bash step1-run.sh
```

```
# ./emu-cc-err -i $NOOP_HOME/ready-to-run/microbench.bin \
    --diff $NOOP_HOME/ready-to-run/riscv64-nemu-interpreter-so \
    --dump-db # set this argument to dump data base
2>linux.err
```

# ChiselDB

Hands-on: Analysis of Cache Coherence Violation

```
# Analyze
$ bash step2-analyze.sh
# 1. sqlite query for all transactions on Eaddr
# 2. use script to parse TLLog
# sqlite3 $NOOP_HOME/build/*.db \
  "select * from TLLOG where ADDRESS=0x80008fc0" | \
  sh $NOOP_HOME/scripts/utils/parseTLLog.sh
```

# **ChiselDB:**

```
# Result: [Time | To_From | Channel | Opcode | Permission | Address | Data(0)]
# Data successfully transferred from L1D to L2
186297 L2_L1D_0 C ReleaseData Shrink TtoN 80008fc0 000000000000000
# But when L1D acquires addr again, data loaded from L2 is wrong
187022 L2 L1D 0 A AcquireBlock Grow NtoB 80008fc0
187029 L2 L1D 0 D GrantData Cap toT 80008fc0 0000000000abcdef
# So there must be something wrong when L2 records Release Data
```



## MinJie Performance Verification



# **Checkpoints**

- Uniform Checkpoint
- Divide a program to small parts which can be simulated in parallel



- Simpoint Checkpoint
- Select representative parts from a program with cluster analysis





step0: Prepare NEMU environment for checkpoints

```
$ cd ../checkpoint
                       # tutorial/checkpoint
$ bash uniform-step0-prepare.sh
```

```
# uniform-step0-prepare.sh
# cd $NEMU HOME
# git checkout asplos2023-tutorial
# git submodule update --init
# make clean
# make tutorial_defconfig
# make - j4
// generate gcpt restorer binary
# cd resource/qcpt restore && make -j4
```



Basic format of RISC-V Checkpoint

step1: Generate uniform checkpoints in NEMU

\$ bash uniform-step1-gen.sh

```
# uniform-step1-gen.sh
# cd $NEMU HOME
# rm -rf tutorial uniform
  ./build/riscv64-nemu-interpreter
      --cpt-interval 2000000 the profiling period for simpoint
#
                                   uniformly take cpt with fixed interval
      -u
      -b
                                   run with batch mode
      -D tutorial uniform
#
                                   directory where the checkpoint is generated
      -C try
                                   name of this task
      -w Linux
                                   name of workload
      -r ./resource/gcpt_restore/build/gcpt.bin binary of gcpt restorer
      --dont-skip-boot
                                   checkpoint immediately after boot
      -I 3000000
                                   max number of instructions executed
      ./ready-to-run/linux-0xa0000.bin
#
```



```
0.0000000 Dentry cache hash table entries: 4096 (order: 3, 32768 bytes)
                         hash table entries: 2048 (order: 2, 16384 bytes)
                                                                                   → Take a checkpoint from here
    0.000000] Sorting __ex_table...
 src/checkpoint/serializer.cpp:255.sho<mark>rtdTake</mark>tptj Should ta<mark>ke cpt now: 2000002</mark>
 src/checkpoint/path manager.cpp:77,setOutputDir] Created tutorial uniform/try/linux/0/
src/checkpoint/serializer.cpp:127,serializeRegs] Writing int registers td checkpoint memory @[0x80001000, 0x80001100) [0x1000, 0x1100)
 src/checkpoint/serializer.cpp:137,serializeRegs] Writing float registers 🖎 checkpoint memory @[0x80001100, 0x80001200) [0x1100, 0x1200)
 src/checkpoint/serializer.cpp:145,serializeRegs] Writing
                                                            0xffffffff60003e4c at addr 0x80001200
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x105
                                                           0xffffffff8039075c
 0xffffffffffffffff
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x141;
                                                                                          Save int and float registers
                                                         0x80200098
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x142: 0xc
src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x143: 0x80200098
                                                CSR 0x180
src/checkpoint/serializer.cpp:168,serializeRegs]
                                                          0x800000000008066f
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x300
                                                          0xa00000180
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x301
                                                           0x800000000014112d
src/checkpoint/serializer.cpp:168,serializeRegs]
                                                CSR 0x302:
                                                          0xb109
src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x303
                                                          0x222
                                                          0x22a
                                                                                            Save CSR registers
src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x304
src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x305
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x306
                                                          0xfffffffffffffff
 src/checkpoint/serializer.cpp:168,serializeRegs| CSR 0x340
                                                           0x800abdc0
                                                          0xfffffffff80390ed4
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x341
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x342: 0x9
 src/checkpoint/serializer.cpp:168,serializeRegs CSR 0x3a0: 0x1f0800
src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x3b0:
                                                          0x20000000
                                                                         Save other information and gcpt restorer
                                                          0x2002d000
src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x3b1
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x3b2;
                                                          0x3fffffffff
 src/checkpoint/serializer.cpp:168,serializeRegs] CSR 0x5c4: 0x3
src/checkpoint/serializer.cpp:171,serializeRegs] Writing CSR to checkpoint memory @[0x80001300_-0x80009300) [0x1300. 0x9300)
src/checkpoint/serializer.cpp:179,serializeRegs
src/checkpoint/serializer.cpp:183,serializeRegs<mark>|</mark> Record mode flag: 0x1 at add<u>r_0x6000</u>0f08
عتل src/checkpoint/serializer.cpp:188,serializeRegs Record time: 0x1 at ad
 src/checkpoint/serializer.cpp:192,serializeRegs Record time: 0x1 at addr 0x80000f18
 <u>src/checkpoint/serializer.cpp:81,serializePMemil Put gcpt restorer ./resburce/gcpt_restore/build/gcpt.bin to start of pmem</u>
Opening tutorial_uniform/try/linux/0/_2000002_.gz as checkpoint output file
 src/checkpoint/serializer.cpp:110.serializePMem] written 0x7fffffff bytes
 src/checkpoint/serializer.cpp:110,serializePMem] Written 0x7fffffff bytes
src/checkpoint/serializer.cpp:110,serializePMem]    Written 0x7fffffff bytes
                                                                                     Write checkpoints to gz file
src/checkpoint/serializer.cpp:110,serializePMem]                               Written 0x7fffffff bytes
src/checkpoint/serializer.cpp:110,serializePMeml Written 0x4 bytes
src/checkpoint/serializer.cpp:263,notify taken] Taking checkpoint @ instruction count 2000002
 src/cpu/cpu-exec.c:203,per bb profile] Should take checkpoint on pc 0xffffffff80003e4c
    0.000000] Memory: 25548K/30720K available (699K kernel code, 78K rwdata, 102K rodata, 3648K init, 98K bss, 5172K reserved, 0K cma-reserved)
    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
    0.000000] NR IRQS: 0, nr irgs: 0, preallocated irgs: 0
```

- step2: Run the uniform checkpoint in NEMU or XiangShan
- Here we take NEMU as an example

```
$ bash uniform-step2-run-nemu.sh
```



```
[src/monitor/monitor.c:103,parse_args] Restoring from checkpoint
src/monitor/image loader.c:69,load img] Loading Gcpt file form cmdline: tutorial uniform/try/linux/0/ 2000002 .gz
[src/monitor/image loader.c:77,load im<mark>g</mark>] Loading GZ image tutorial unifg<mark>rm/try/linux/0/ <del>200</del>0002</u> .gz</mark>
                                                                                                    Restore from checkpoint
[src/monitor/image loader.c:69,load imb] Loading Gcpt restorer form cmd/ine: (null)
                                                                                                    and load gz files
[src/monitor/image loader.c:71,load img] No image is given. Use the default built-in image/restorer.
[src/device/io/port-io.c:30.add_pio_map] Add_port-io_map_'uartlite'_at_[0x000000000000003f8, 0x00000000000404]
[src/device/io/mmio.c:38,add mmio map] Add mmio map 'uartlite' at [0x000000040600000, 0x00000004060000c]
src/device/io/port-io.c:30,add pio map] Add port-io map 'rtc' at [0x0000000000000048, 0x000000000004f]
src/device/io/mmio.c:38.add mmio map| Add mmio map 'rtc' at [0x00000000a1000048, 0x00000000a100004f]
src/device/io/port-io.c:30,add pio map] Add port-io map 'screen' at [0x0000000000000100, 0x00000000000107]
[src/device/io/mmio.c:38,add mmio map] Add mmio map 'screen' at [0x0000000040001000, 0x0000000040001007]
[src/device/io/mmio.c:38,add_mmio_map] Add_mmio_map_'vmem'_at_[0x0000000050000000, 0x0000000500752ff]
[src/device/io/port-io.c:30.add pio map] Add port-io map 'keyboard' at [0x00000000000000000, 0x00000000000003]
[src/device/io/mmio.c:38,add mmio map] Add mmio map 'keyboard' at [0x00000000a1000060, 0x00000000a1000063]
[src/device/io/mmio.c:38,add_mmio_map] Add_mmio_map 'sdhci'_at [0x0000000040002000, 0x000000004000207f]
[src/device/sdcard.c:137,init sdcard]    Can not find sdcard image:
[src/monitor/monitor.c:44.welcome] Debug: OFF
                                                                                                 NEMU will start execution from
[src/monitor/monitor.c:49,welcome] Build time: 19:35:10, Mar 25 2023
                                                                                                 the moment of that checkpoint
Welcome to
                 -NEMU!
For help, type "help"
              Memory: 25548K/30720K available (699K kernet code, 78K rwdata, 102K rodata, 3648K init, 98K bss, 5172K reserved, 0K cma-reserved)
    0.000000 SLUB: HWalign=64, Order=0-3, MinOb ects=0, CPUs=1, Nodes=1
    0.000000 NR_IRQS: 0, nr_irqs: 0, preallocated irqs: 0
    0.000000] clocksource: riscv clocksource: mask: 0xffffffffffffffffffffffmax cycles: 0x1d854df40, max idle ns: 3526361616960 ns
    0.000000] console [hvc0] enabled
    0.000000] console [hvc0] enabled
    0.000000] bootconsole [early0] disabled
    0.000000] bootconsole [early0] disabled
    0.000000] Calibrating delay loop (skipped), value calculated using timer frequency.. 2.00 BogoMIPS (lpj=10000)
    0.000000] pid max: default: 4096 minimum: 301
    0.000000] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
    0.000000] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
    0.010000 clocksource: jiffies: mask: 0xffffffff max cycles: 0xffffffff, max idle ns: 19112604462750000 ns
    0.020000 clocksource: Switched to clocksource riscy clocksource
```

step0: Prepare NEMU environment for Simpoint checkpoint

\$ bash simpoint-step0-prepare.sh

```
# simpoint-step0-prepare.sh
# cd $NEMU HOME
# git checkout cpt-bk
# git submodule update --init
# cd resource/simpoint/simpoint_repo/analysiscode && make simpoint -j4
                                           generate simpoint generator binary
# cd resource/gcpt_restore && make -j4 generate gcpt restorer binary
# cd $NEMU HOME
# make clean
# make ISA=riscv64 XIANGSHAN=1 -j4
                                           compile NEMU
```

step1: Execute the workload and collect program behavior

\$ bash simpoint-step1-profiling.sh

```
# simpoint-step1-profiling.sh
# cd $NEMU HOME
# rm -rf tutorial simpoint
  ./build/riscv64-nemu-interpreter
#
      -b
                                            run with batch mode
      -D $NEMU_HOME/tutorial_simpoint directory where the checkpoint is generated
      -C profiling
                                            name of this task
      -w stream
                                            name of workload
      --simpoint-profile
                                            is simpoint profiling
      --interval 50000000
                                            simpoint interval instructions
      $XS_PROJECT_ROOT/tutorial/bin/stream-0xa0000.bin
#
```



```
[src/isa/riscv64/exec/special.c,38,exec nemu trap] Start profiling, resetting inst count
STREAM version $Revision: 5.10 $
This system uses 4 bytes per array element.
Array size = 2000000 (elements), Offset = 0 (elements)
Memory per array = 7.6 MiB (= 0.0 GiB).
Total memory required = 22.9 MiB (= 0.0 GiB).
                                                                        NEMU will do profiling while
Each kernel will be executed 10 times.
                                                                        executing workload STREAM
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
Your clock granularity/precision appears to be 2047 microseconds.
Each test below will take on the order of 102400 microseconds.
   (= 50 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
Function Best Rate MB/s Avg time Min time
                                                   Max time
         710.2 0.023438
Copy:
                                     0.022528
                                                   0.024576
Scale: 181.7
                          0.089884
                                     0.088064
                                                   0.092160
             217.0
172.3
Add:
                          0.112868
                                     0.110592
                                                   0.114688
Triad:
                          0.140402 0.139264
                                                   0.141312
Solution Validates: avg error less than 1.000000e-06 on all three arrays
[src/monitor/cpu-exec.c,110,cpu exec] nemu: HIT GOOD TRAP at pc = 0x0000002aaaaaa5f6
```

step2: Cluster and obtain multiple representitive slices

\$ bash simpoint-step2-cluster.sh

```
# simpoint-step2-cluster.sh
# cd $NEMU HOME
# mkdir -p $NEMU HOME/tutorial simpoint/cluster/stream
# export CLUSTER=$NEMU HOME/tutorial simpoint/cluster/stream
  ./resource/simpoint/simpoint repo/bin/simpoint
      -loadFVFile ./tutorial simpoint/profiling/simpoint bbv.gz load a frequency vector file
      -saveSimpoints $CLUSTER/simpoints0
                                                                              file to save simpoints
      -saveSimpointWeights $CLUSTER/weights0
                                                                             file to save simpoint weights
      -inputVectorsGzipped
#
                                                                input vectors have been compressed with gzip
      -maxK 3
                                maximum number of clusters to use
      -numInitSeeds 2
                                times of different random initialization for each run, taking only the best clustering
#
      -iters 1000
                                maximum number of iterations that should perform
      -seedkm 123456
                                random seed for choosing initial k-means centers
      -seedproj 654321
                                random seed for random linear projection
```



```
Run number 3 of at most 4, k = 2
  Initialization seed trial #1 of 2; initialization seed = 123460
  Initialized k-means centers using random sampling: 2 centers
  Number of k-means iterations performed: 1
  BIC score: 123.793
                                                              Compute the information
  Distortion: 1.61246
  Distortions/cluster: 1.54841 0.0640483
                                                              of K-means clustering
  Variance: 0.161246
  Variances/cluster: 1.54841 0.00711647
 Initialization seed trial #2 of 2; initialization seed = 123461
 Initialized k-means centers using random sampling: 2 centers
  Number of k-means iterations performed: 1
 BIC score: 123.793
 Distortion: 1.61246
                                                  Obtain multiple program
 Distortions/cluster: 0.0640483 1.54841
 Variance: 0.161246
                                                  representitive checkpoints
  Variances/cluster: 0.00711647 1.54841
  The best initialization seed trial was #1
Post-processing runs
  For the BIC threshold, the best clustering was run 3 (k = 2)
  Post-processing run 3 (k = 2)
```

step3: Generate corresponding Checkpoints based on clustering results

```
$ bash simpoint-step3-take-cpt.sh # It may take about 3 minutes
```

```
# simpoint-step3-take-cpt.sh
# cd $NEMU HOME
# rm -rf $NEMU HOME/tutorial simpoint/stream/take cpt
  ./build/riscv64-nemu-interpreter
#
      -b
                                         run with batch mode
     -D tutorial_simpoint
                                         directory where the checkpoint is generated
     -C take cpt
                                         name of this task
     -w stream
                                         name of workload
      -S ./tutorial_simpoint/cluster simpoints dir
     --checkpoint-interval 50000000 simpoint interval instructions
     $XS PROJECT ROOT/tutorial/bin/stream-0xa0000.bin
#
```





- step4: Run the simpoint checkpoint in NEMU or XiangShan
- Here we take XiangShan as an example

```
$ bash simpoint-step4-run-xs.sh
```

```
# simpoint-step4-run-xs.sh

$XS_PROJECT_ROOT/tutorial/emu
   -i `find $NEMU_HOME/tutorial_simpoint/ -type f -name "*.gz" | tail -1`
   --diff $NOOP_HOME/ready-to-run/riscv64-nemu-interpreter-so
   --max-cycles=100000
2>simpoint.err
```

# Simulation Performance Counter

### Two common types:

- Accumulation: basic accumulator counter
- **Histogram**: count the distribution

# Accumulation: [PERF][time=10000] ctrlBlock.rob: commitInstr, 1500 [PERF][time=10000] ctrlBlock.rob: waitLoadCycle, 800 Histogram: [PERF][time=10000] l2cache: acquire\_period\_10\_20, 15 [PERF][time=10000] l2cache: acquire\_period\_20\_30, 200 [PERF][time=10000] l2cache: acquire\_period\_30\_40, 60

# Simulation Performance Counter

### • Usage:

- Add XSPerfAccumulate('name', signal) or Histogram wherever you want
- By default, it is printed after simulation finishes
- Set DebugOptions(EnablePerfDebug = false) to turn off

### Showcase

```
$ cd ..
$ bash run-emu-perf.sh

# ./emu -i hello.bin \
   --diff $NEMU_HOME/build/riscv64-nemu-interpreter-so 2>perf.err

$ less perf.err
```

# Simulation Performance Counter

```
[PERF ][time=
                             2893] TOP.SimTop.l soc.core with l2.core.ctrlBlock.decode: utilization,
[PERF ][time=
                             2893] TOP.SimTop.l soc.core with 12.core.frontend.bpu.predictors.components 3.tables 0.wrbypass: wrbypass hit,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.icache.missUnit.entries 1: entryPenalty1,
[PERF ][time=
                             2893] TOP.SimTop.l soc.core with l2.core.ctrlBlock.decode: waitInstr,
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.bpu.predictors.components 3.tables 0.wrbypass: wrbypass miss,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.ctrlBlock.decode: stall cycle,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.icache.missUnit.entries 1: entryReq1,
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.bpu.predictors.components 3.tables 1.wrbypass: wrbypass hit,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend: FrontendBubble,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.icache.missUnit.entries 0: entryPenalty0,
[PERF ][time=
                            2893] TOP.SimTop.l_soc.core_with_l2.core.frontend.bpu.predictors.components_3.tables_1.wrbypass: wrbypass_miss,
                             2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: utilization,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with 12.core.frontend.icache.missUnit.entries 0: entryReq0,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 0 1,
[PERF ][time=
                             2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 1 2,
                                                                                                                    541
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.misc.busPMU: dcache bank 0 A channel fire,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 2 3,
                            2893] TOP.SimTop.l soc.core with l2.misc.busPMU: dcache bank 0 A channel stall,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 3 4,
                             2893] TOP.SimTop.l soc.core with 12.core.exuBlocks.scheduler.rs 4.staRS 0.oldestSelection: oldest override 0,
[PERF ][time=
                            2893] TOP.SimTop.l_soc.core_with_l2.misc.busPMU: dcache_bank_0_A_channel_PutFullData fire,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 4 5,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with 12.core.exuBlocks.scheduler.rs 4.staRS 0.oldestSelection: oldest same as selected 0,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.misc.busPMU: dcache bank 0 A channel PutFullData stall,
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 5 6,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with 12.core.exuBlocks.scheduler.rs 4.staRS 0.oldestSelection: oldest override 1,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.misc.busPMU: dcache bank 0 A channel PutPartialData fire,
                             2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 6 7,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with 12.core.exuBlocks.scheduler.rs 4.staRS 0.oldestSelection: oldest same as selected 1,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.misc.busPMU: dcache bank 0 A channel PutPartialData stall,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: util 7 8,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.memBlock.dtlb st tlb st.entries: port0 np sp multi hit,
                            2893] TOP.SimTop.l soc.core with l2.misc.busPMU: dcache bank 0 A channel ArithmeticData fire,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l_soc.core_with_l2.core.frontend.ibuffer: full,
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.memBlock.dtlb st tlb st.entries: port1 np sp multi hit,
                            2893] TOP.SimTop.l_soc.core_with_l2.misc.busPMU: dcache_bank_0_A_channel_ArithmeticData_stall,
[PERF ][time=
[PERF ][time=
                            2893] TOP.SimTop.l soc.core with l2.core.frontend.ibuffer: exHalf,
```

Find more parsing scripts at https://github.com/OpenXiangShan/env-scripts/blob/main/perf/perf.py

### Motivation

- Wanna test the performance under different parameters
- But re-compilation is time consuming
- Can we change parameters(constants) at runtime?
- We present Constantin (Constant in)
- Design/Highlights:
  - DPI-C: Using C++ function in Chisel code (through Verilog using BlackBox)
  - Signal values configured at runtime rather than compile time



→ XiangShan git:(master) ×

### Constants keep constant only during runtime



- Usage: Create signal
- XiangShan/coupledL2/src/main/scala/coupledL2/prefetch/TemporalPrefetch.scala

```
require(cacheParams.hartIds.size == 1)
139
140
        val hartid = cacheParams.hartIds.head
        // 0 / 1: whether to enable temporal prefetcher
141
        private val enableTP = WireInit(Constantin.createRecord("enableTP"+hartid.toString, initValue = 1.U))
142
        // 0 ~ N: throttle cycles for each prefetch request
143
        private val tpThrottleCycles = WireInit(Constantin.createRecord("tp throttleCycles"+hartid.toString, initValue = 4.U(3.W)))
144
        // 0 / 1: whether request to set as trigger on meta hit
145
146
        private val hitAsTrigger = WireInit(Constantin.createRecord("tp hitAsTrigger"+hartid.toString, initValue = 1.U))
        // 1 ~ triggerQueueDepth: enqueue threshold for triggerQueue
147
        private val triggerThres = WireInit(Constantin.createRecord("tp_triggerThres"+hartid.toString, initValue = 1.U(3.W)))
148
        // 1 ~ tpEntryMaxLen: record threshold for recorder and sender (storage size will not be affected)
149
        private val recordThres = WireInit(Constantin.createRecord("tp_recordThres"+hartid.toString, initValue = tpEntryMaxLen.U))
150
        // 0 / 1: whether to train on vaddr
151
        private val trainOnVaddr = WireInit(Constantin.createRecord("tp trainOnVaddr"+hartid.toString, initValue = 0.U))
152
153
        // 0 / 1: whether to eliminate L1 prefetch request training
        private val trainOnL1PF = WireInit(Constantin.createRecord("tp trainOnL1PF"+hartid.toString, initValue = 0.U))
154
```

• Hands-on: Pass constant via standard input stream

```
$ cd constantin
                             # now in tutorial/constantin
# Build emu with constantin (time consuming, use pre-built emu instead)
# bash step0-build.sh
# Run emu and pass constant
$ bash step1-basic.sh
please input total constant number
please input each constant ([constant name] [value])
tp_recordThres0 14
tp_throttleCycles0 3
```

The reference model is ./ready-to-run/riscv64-nemu-interpreter-so
The first instruction of core 0 has commited. Difftest enabled.
Core 0: EXCEEDING CYCLE/INSTR LIMIT at pc = 0x800000e4c
instrCnt = 1,000, cycleCnt = 10,235, IPC = 0.097704

• Hands-on: Pass constant via standard input stream

```
/tu/test/t/constantin tutorial-new = ?4
                                                                            please input total constant number
emu-constantin compiled at Oct 28 2023, 17:38:43
please input total constant number
please input each constant ([constant name] [value])
tp_recordThres0 14
tp_throttleCycles0 3
                                                                            please input each constant ([constant name] [value])
Using simulated 32768B flash
isWriteICacheTable0 = 0
                                                                            tp_recordThres0 14
isWriteFetchToIBufferTable0 = 0
isWriteIfuWbToFtqTable0 = 0
isWritePrefetchPtrTable0 = 0
                                                                            tp_throttleCycles0 3
isWriteFTQTable0 = 0
isWriteBankConflictTable0 = 0
isWriteL1MissQMissTable0 = 0
isWriteLoadMissTable0 = 0
isFirstHitWrite0 = 0
isWriteLoadAccessTable0 = 0
isWriteL2TlbPrefetchTable0 = 0
isWriteL1TlbTable0 = 0
isWritePageCacheTable0 = 0
isWritePTWTable0 = 0
isWriteL2TlbMissOueueTable0 = 0
                                                                            ForceWriteUpper_0 = 60
CorrectMissTrain0 = 0
isWriteInstInfoTable0 = 0
l1_stride_ratio0 = 2
                                                                            ForceWriteLower_0 = 55
l2_stride_ratio0 = 5
tp_trainOnL1PF0 = 0
tp hitAsTriager0 = 1
                                                                            tp_recordThres0 = 14
tp_triggerThres0 = 1
enableTP0 = 1
enableL3StreamPrefetch0 = 0
                                                                            tp_throttleCycles0 = 3
enableL1StreamPrefetcher0 = 1
nMaxPrefetchEntrv0 = 14
always_update0 = 1
                                                                            tp_trainOnVaddr0 = 0
ForceWriteUpper_0 = 60
ForceWriteLower_0 = 55
tp_recordThres0 = 14
                                                                            StoreWaitThreshold_0 = 0
tp_throttleCycles0 = 3
tp trainOnVaddr0 = 0
StoreWaitThreshold_0 =
                                                                            ColdDownThreshold_0 = 12
enableDynamicPrefetcher0 = 1
StoreBufferThreshold_0 = 7
                                                                            enableDynamicPrefetcher0 = 1
StoreBufferBase_0 = 4
l2DepthRatio0 = 2
l3DepthRatio0 = 3
depth0 = 32
Using simulated 8192MB RAM
The image is ./ready-to-run/linux.bin
```

• Feature: Auto Solving of better constant values

### Process

- Enable AutoSolving & prepare signals
- Compile
- Run basic demonstration
- Set parameters for AutoSolving
- Auto solve
- Cleanup

• Feature: Auto Solving of better constant values

```
# Automatical parameter solver configuration file
$ cat constantin.json
```

- Parameters defined
  - properties of signals
  - target of Auto Solving
  - parameters of genetic algorithm
  - parameters of XS simulator

Usage: Auto Solving of better constant values

```
# Run auto solver, output will be the optimal constant currently found
$ bash step2-solve.sh

# The solver generates optimal parameters
opt constant for gene algrithom is
[['block_cycles_cache_0', XXX], ['block_cycles_cache_1', XXX],
['block_cycles_cache_2', XXX], ['block_cycles_cache_3', XXX]] fitness 0
```



# Top-Down: Method for Performance Ana

- Organize scattered performance events in a hierarchical mai<sup>g</sup> 1.00
- Accurately calculate one event's impact on processor perfor



- We apply the Top-Down approach to XiangShan
  - Complete targeted optimization adaptation for RISC-V instruction set
  - Optimize **performance counters** according to XiangShan microarchitecture
  - Further refine the hierarchical design of Top-Down model
  - Without missing or duplicating any events
  - Without assuming the blocking cycles of performance events in advance
- Due to the limited time, we will not arrange a hands-on section here
  - Please refer to <a href="https://github.com/OpenXiangShan/XiangShan/tree/master/scripts/top-down">https://github.com/OpenXiangShan/XiangShan/tree/master/scripts/top-down</a>

# **Top-Down**

• Use Top and Backend as examples





# **Top-Down**

Setting Performance Counters in RTL Code

```
val stallReason = Wire(chiselTypeOf(io.stallReason.reason))
// ...
TopDownCounters.values.foreach(ctr =>
    XSPerfAccumulate(ctr.toString(), PopCount(stallReason.map(_ === ctr.id.U)))
)
```

• Do simulation and collect performance counter data

# **Top-Down**

Analyze through Top-Down method

```
env-scripts: perf/top_down/configs.py

xs_coarse_rename_map = {
   'OverrideBubble': 'MergeFrontend',
   'FtqFullStall': 'MergeFrontend',
   'IntDqStall': 'MergeCoreDQStall', # attribute perf-counters to Top-Down hierarchy
   'FpDqStall': 'MergeCoreDQStall'
}
```

Obtain analysis results and make targeted optimizations



# Summary: MinJie Development Flows and Tools









# Thanks!