# Ventus tutorial @ MICRO 2025

Simulation Part

20251018

## Welcome to Ventus GPGPU

Let's start with configuring the simulation environment.

All Ventus-related repositories are available under [Github THU-DSP-LAB](https://github.com/THU-DSP-LAB).  
The [ventus-env](https://github.com/THU-DSP-LAB/ventus-env/tree/MICRO2025) repository aggregates all subprojects,
including the Chisel RTL, simulators, compiler, and software stack.  

This tutorial will be conducted in the ventus-env repository.
Use the following commands to fetch and initialize the repository:

In [None]:
# Do not need this today, as you are already in ventus-env

# git clone https://github.com/THU-DSP-LAB/ventus-env -b MICRO2025
# cd ventus-env
# make init

Use the script `build-ventus.sh` to build all projects in one step and install them under `./install/`.   
You can pass `--build` to build a single component; use `--help` for more details.

In [None]:
# Do not need this today, as they have already been built

# bash build-ventus.sh

In [None]:
# But let's have a try to build one single project
bash build-ventus.sh --build gpgpu  # Chisel RTL + verilator simulation
# It's normal to get few outputs, because they were pre-built.

In [None]:
# Build results are installed to `./install` folder
ls install/

## First testcase
Let's start with a simple testcase from POCL: matrix add

In [None]:
# Set environment variables to let OpenCL App use Ventus
# Note: this is needed every time you start a new terminal
source env.sh

In [None]:
# Run functional (ISA-level) simulation in ventus-spike
./pocl/build/examples/matadd/matadd

At the end of the output you should see `OK`, indicating the Ventus result matches the CPU result for this test.

You will also see information about the test case, for example:
```bash
numw:1   # There is 1 wavefront (warp) in each workgroup (thread block)
numt:32  # There are 32 workitems (thread) in each wavefront
numwg:1  # There is 1 workgroup (thread block) of this kernel
kernelx:1,kernely:1,kernelz:1  # The workgroups are arranged as a 1x1x1 grid across the three dimensions
```

## Functional setup
### Multiple simulation backends

Ventus driver supports switching among multiple simulation backends:
* Verilator-based RTL simulator (`rtl`)
* SystemC-based cycle-level simulator (`cycle`)
* Spike-based ISA simulator (`isa` or `spike`)

In [None]:
# Now let's try vector add testcase on these simulation backends
cd pocl/build/examples/vecadd
VENTUS_BACKEND=isa   ./vecadd 128 64 &> isa.log    && echo isa sim ok
VENTUS_BACKEND=cycle ./vecadd 128 64 &> cycle.log  && echo cycle sim ok
VENTUS_BACKEND=rtl   ./vecadd 128 64 &> rtl.log    && echo rtl sim ok
cd - > /dev/null

In [None]:
# What's in the log?
cd pocl/build/examples/vecadd
echo -e "\nISA SIM OUTPUT:" && egrep "arg gpgpu"       isa.log
echo -e "\nISA SIM LOG:"    && egrep "endprg|finished" vecadd_0.log
echo -e "\nCYCLE SIM LOG:"  && egrep "endprg|finished" cycle.log
echo -e "\nRTL SIM LOG:"    && egrep "endprg|finished" rtl.log
cd - > /dev/null

All three simulators can emit per-instruction logs. See the simulators’ READMEs for details.

From the logs above, we known that this test has 2 workgroups, each containing 2 wavefronts.  
In the cycle and RTL simulator outputs, you can locate the exit points (`endprg`) and corresponding timestamps for these 4 wavefronts.  
* The cycle simulator finishes at 25,375 ns; with 10 ns per cycle, that is 2,537 cycles.
* The RTL simulation finishes at time 7005; with 10 time units per cycle, that is 700 cycles.
* The main difference comes from the cycle simulator integrating a Ramulator-based DDR timing model, which increases runtime.

In [None]:
# Turn off DDR timing in cycle-level simulation
cd pocl/build/examples/vecadd
VENTUS_BACKEND=cycle VENTUS_TIMING_DDR=0 ./vecadd 128 64 |& grep "vecadd finished"
cd - > /dev/null

After turing off DDR timing simulation, cyclesim will get a much closer timing result to RTLsim.   
(25375→5775  VS  7000)

### Tools

There are several functional environment variables, for example:
* Setting `VENTUS_WAVEFORM=1` enables the rtlsim backend to dump FST waveforms and the cyclesim backend to dump VCD waveforms.
* Setting `VENTUS_DUMP_RESULT=filename.json` saves all data copied from the device to the host by OpenCL programs, along with their device addresses, into the specified JSON file for debugging.
* Setting `VENTUS_TIMING_DDR=0` disables cyclesim backend to calculating DDR timing with ramulator, which is enabled by default. RTL backend does not support DDR timing now. 

Dumping waveforms significantly slows down simulation.  
The RTL simulation supports dumping waveforms only for a selected time window to reduce overhead.

In [None]:
# Dump waveform in RTL simulation
cd pocl/build/examples/matadd
VENTUS_BACKEND=rtl VENTUS_WAVEFORM=1 VENTUS_DUMP_RESULT=matadd.rtl.json ./matadd &> matadd.rtl.log
ls -alFh waveform.rtl.fst matadd.rtl.json
cat matadd.rtl.json
cd - > /dev/null

In [None]:
# Dump waveform and result in cycle-level simulation
cd pocl/build/examples/matadd
VENTUS_BACKEND=cycle VENTUS_WAVEFORM=1 VENTUS_DUMP_RESULT=matadd.cycle.json ./matadd &> matadd.cycle.log
ls -alFh waveform.cycle.vcd matadd.cycle.json
cd - > /dev/null

## Testcases and regression 

Ventus currently runs a subset of the gpu-rodinia benchmark suite correctly (in `rodinia/opencl/`).  
We also write several typical OpenCL testcases under `testcases/`.  

We modified the Rodinia cases to compare Ventus outputs with CPU/NVIDIA outputs to form regression tests.  
`ventus-env` provides a regression script that can run on all three simulation backends.

Before running, we recommend tuning the following arguments of `regression-test.py` to match your machine’s performance:
* `-t TIMEOUT_SCALE`: Timeout scale (default: 1). Increase this to allow testcases to run longer. Timeouts are treated as failures.
* `-j JOBS`: Parallel multi-process num (default: auto). Note that in RTL simulation, each process has 8 multi-threads by default.

In [None]:
VENTUS_BACKEND=isa python3 regression-test.py -t 1

# These are slow
# VENTUS_BACKEND=cycle python3 regression-test.py
# VENTUS_BACKEND=rtl   python3 regression-test.py

In [None]:
# regression test logs are saved
ls regression-test-logs/

If you want to run testcases manully, do as follows: 
1. go into the testcase's path
2. run `make` manually
3. run the executable, or use `./run` if it exists
* for excutables who need arguments, a `run` script is provided
* suitable for both rodinia and ventus-opencl-testcase

In [None]:
# Let's have a look at MNIST_conv testcase (3-layer CNN digit recognition)

# cd testcases/_get_case/MNIST_conv         # This is slow
cd testcases/_get_case/MNIST_conv_tiny      # Smaller mnist testcase for quick demo
make &> /dev/null
VENTUS_BACKEND=rtl ./conv.out |& sed -n '/CONV3/,$p' | head -n 14
cd - > /dev/null

## OpenCL Conformance Test Suite (CTS)

The Ventus software stack passes most tests in the OpenCL CTS regression suite.

The OpenCL CTS is large and time-consuming under simulation; here we demonstrate running a single test on Spike.

In [None]:
cd OpenCL-CTS/build/test_conformance/
# ./basic/test_basic --help                   # check which tests are available
./basic/test_basic intmath_int4 |& tail -n 4  # run a specific test
cd - > /dev/null

# Optional part
## RTL specification change
The hardware configuration can be modified in `gpgpu/ventus/src/top/parameters.scala`.   
For example:
```diff
--- a/ventus/src/top/parameters.scala
+++ b/ventus/src/top/parameters.scala
@@ -4,9 +4,9 @@ import L2cache.{CacheParameters, InclusiveCacheMicroParameters, InclusiveCachePa
 import chisel3.util._
 
 object parameters { //notice log2Ceil(4) returns 2.that is ,n is the total num, not the last idx.
-  def num_sm = 2
-  var num_warp = 8
-  var num_thread = 32
+  def num_sm = 1
+  var num_warp = 4
+  var num_thread = 16
   val SINGLE_INST: Boolean = false
   val SPIKE_OUTPUT: Boolean = true
   val INST_CNT: Boolean = true
```