You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2-Wide Out-of-Order Superscalar RISC-V (RV32I) CPU
A fully functional dual-issue, out-of-order execution RISC-V CPU built in
Verilog, developed incrementally for teaching and learning purposes.
4,754 lines of synthesizable RTL across 18 modules.
Architecture Overview
Feature
Implementation
ISA
RV32I (40 instructions)
Pipeline
8-stage: IF → ID1 → ID2 → EX1 → EX2 → AGU → MEM → WB
make # compile
make run # run default test (test01)
make wave # open waveform in GTKWave
make robust # wrong-path store regression
make clean # clean build artifacts
Run a Specific Program
make run PROG=tb/programs/bench/fib_20.hex
make run PROG=tb/programs/generated/rv32_rand_042.hex
Test Infrastructure
Random Regression (500 RV32I tests)
# Generate 500 random tests
python3 tools/gen_rv32_tests.py --count 500 --out-dir tb/programs/generated
# Run all
python3 tools/run_generated_tests.py --dir tb/programs/generated -j 1
Pass criteria: x31 = 0xCAFEBABE detected, no timeout, register state
matches the built-in ISA simulator.
Performance Benchmarks
# Generate and run all 9 benchmarks
python3 tools/run_benchmarks.py --regen
# Run a single kernel with extended timeout
make run PROG=tb/programs/bench/matmul_4x4.hex TIMEOUT_NS=8000000
# Filter benchmarks by name
python3 tools/run_benchmarks.py --filter "matmul_*"
Available Benchmarks
Kernel
Description
Primary Stress
fib_20
Iterative Fibonacci(20)
Short RAW chain, EX→EX forwarding
sum_1_to_100
Sum 1..100
Predictable backward branch
memcpy_64w
Copy 64 words
LW/SW pairs, D-Cache bandwidth
popcount_64
Popcount over 1..64
Nested loop, data-dependent termination
bsort_16
Bubble sort 16 ints
Adjacent LW/SW + data-dependent branches
crc32_64b
CRC32 over 64 bytes
Bit-level loop, hard-to-predict branches
bsearch_64
Binary search in 64-element array
Data-dependent branches
dotprod_32
Dot product (len 32, SW multiply)
Function call pattern (JAL/JALR)
matmul_4x4
4×4 integer matmul (SW multiply)
Triple loop + strided memory access
Adding a New Benchmark
Edit tools/gen_benchmarks.py:
Write the kernel using the Asm helper (tools/asm.py) with label support.
End with append_epilogue(a) for the testbench PASS detector.
Add to the BENCHMARKS dict.
Run python3 tools/run_benchmarks.py --regen.
Directory Layout
rtl/
core/ Pipeline and OoO control RTL (cpu_top, RS, ROB, rename, BPU, PRF, ...)
mem/ Cache modules (I-Cache, D-Cache)
tb/
tb_cpu.v Testbench with observability counters
programs/
bench/ Performance benchmark hex files
generated/ Random regression hex files
micro_hazard/ Targeted hazard test cases
micro_slot1/ Slot1 pairing test cases
sim/ Simulation artifacts (gitignored)
docs/ Architecture planning documents
tools/
asm.py Label-based RV32I assembler
gen_benchmarks.py Benchmark hex generator
gen_rv32_tests.py Random test generator + ISA simulator
run_benchmarks.py Benchmark runner with CPI/branch/cache reporting
run_generated_tests.py Batch regression runner