`hapenny`: a half-width RISC-V

hapenny is a 32-bit RISC-V CPU implementation that operates internally on 16-bit chunks. This means it takes longer to do things, but uses less space.

This approach was inspired by the MC68000 (1979), which also implemented a 32-bit instruction set using a 16-bit datapath. (hapenny uses about half as many cycles per instruction as the MC68000, after optimization.)

hapenny was written to evaluate the Amaranth HDL.

(The current hapenny was formerly version 2; once it became mature enough I removed version 1.)

Bullet points

Over 12M inst/sec on iCE40 HX1K, while occupying under 800 LCs, or less than 63% of the chip. (Throughput compares favorably to some 32-bit implementations occupying twice the area.)
Native 16-bit bus allows for simpler peripherals and external RAMs. (Can run out of external 16-bit SRAM with no penalty.)
Parameterized with knobs for trading off size vs capability.
Implements the RV32I unprivileged instruction set (currently missing FENCE and SYSTEM).
Optional interrupt support in the older core. (yet to come in the revised one)
Written in Python using Amaranth.

But why

There are a bazillion open-source RISC-V CPU implementations out there, which is what happens when you release a well-designed and free-to-implement instruction set spec -- nerds like me will crank out implementations.

I wrote hapenny as an experiment to see if I could target the space between the PicoRV32 core and the SERV core, in terms of size and performance. I specifically wanted to produce a CPU with decent performance that could fit into an iCE40 HX1K part (like on the Icestick evaluation board) with enough space left over for useful logic. PicoRV32 doesn't quite fit on that chip; SERV fits but takes 32-64 cycles per instruction.

Property	PicoRV32-small	`hapenny`	SERV
Datapath width (bits)	32	16	1
External data bus width	32	16	32
Average cycles per instruction	5.429	5.525	40-ish
Minimal size on iCE40 (LCs)	1500-ish	796	200-ish
Typical MHz on iCE40	40s?	72+	40s?

(Cycles/instruction is measured on Dhrystone. Minimal size is the output produced by the icestick-smallest.py script. I would appreciate help getting apples-to-apples comparison numbers!)

So, basically,

hapenny is significantly smaller than a similarly-configured PicoRV32 core for only 1.7% less performance per clock. (Of course, PicoRV32 is a far more general and well-tested processor, and in practice you'd configure it with performance-enhancing features like a dual-port register file and faster shifts.)
hapenny is much faster than SERV, but also about 4x larger. (SERV is also better tested than hapenny.)

hapenny is easy to interface to 16-bit peripherals and external memory with no (additional) performance loss. This can result in smaller overall designs and simpler boards. For instance, hapenny can run at full rate out of the 16-bit SRAM on the Icoboard.

Independent from the datapath width, I also did some fairly aggressive manual register retiming in the decoder and datapath, which means hapenny can often close timing at higher Fmax than other simple RV32 cores. (I miss automatic retiming from ASIC toolchains.)

Details

hapenny executes (most of) the RV32I instruction set in 16-bit pieces. It uses 16-bit memory, a 16-bit (single-ported) register file, and a 16-bit ALU. To perform 32-bit operations, it uses the same techniques a programmer might use in software on a 16-bit computer, e.g. "chaining" operations using preserved carry/zero bits.

All memory interfaces in hapenny are synchronous, including the register file, which is another reason why operations take more cycles. The RV32I register file is comparatively large (at 1024 bits), and using a synchronous register file ensures that it can be mapped into an FPGA block RAM if desired.

Here's what the CPU does during the timing of a typical instruction like ADD. I've color/brightness-coded three different executions that are in flight during this diagram.

The "FD-Box" is responsible for fetch and decode, and is always working on the next instruction. It requires three cycles to fetch both halfwords of an instruction, and then uses the DECODE cycle to do initial instruction decoding and start the read of rs1's low half. (It spends one cycle out of four essentially idle to make the state machines line up conveniently.)
The "EW-Box" is responsible for execute and writeback. It goes through at least four states in every instruction:
- R2L starts the load of the low half of rs2 from the register file.
- OPL operates on the low halves of rs1 and rs2 (or rs1 and an immediate), and also starts the load of the high half of rs1.
- R2H and OPH do the same thing for the high half.

Most instructions take four cycles, as shown in that diagram. Some take more if they need to do additional things (by adding states), or if they change control flow such that the FD-Box's speculative fetch was wrong. The CPU test bench (sim-cpu.py) measures the cycle timing for every instruction; here's where things currently stand:

Instruction	Cycles	Notes
AUIPC	4
LUI	4
JAL	8	Includes four-cycle re-fetch penalty
JALR	8	Includes four-cycle re-fetch penalty
Branch	5/10	Not Taken / Taken
Load	6
SW	5
SB/SH	4
SLT(I)(U)	6
Shift	6 + N	N is number of bits shifted
Other ALU op	4

On the instruction mix in Dhrystone, this yields an average of 5.525 cycles/instruction.

Interfaces

hapenny uses a very simple bus interface with up to 32-bit addressing. In practice, applications will wire up fewer than 32 address lines, which saves space.

Signal	Driver	Width	Description
`addr`	CPU	up to 31	addresses a halfword, i.e. LSB missing
`data_out`	CPU	16	carries data for a write
`lanes`	CPU	2	signals a write of either or both byte in a halfword; zero means a load
`valid`	CPU	1	when high, indicates that the signals above are valid and starts a bus transaction.
`response`	device	16	on the cycle after a load, carries back data from the addressed device.

The PC can be shrunk separately from the address bus if you know that all program memory appears in e.g. the bottom half of the address space. This further saves space.

The bus interface does not support wait states, to reduce complexity. This makes it difficult to interface to things like XIP SPI Flash or SDRAM. hapenny is really intended for applications that don't rely on such things.

hapenny exposes a fairly flexible debug interface capable of inspecting processor state and reading and writing the register file. These feautres are only available when the processor is halted, which can be achieved by holding halt_request high until the processor confirms (at the next instruction boundary) by asserting halted. Release halt_request to resume.

Finally, hapenny has an RVFI (RISC-V Formal Interface) trace port for generating a trace of instruction effects, though I haven't wired up the actual test suite.

Interrupt options

Currently, hapenny does not support interrupts, but I'm planning on changing this. (An earlier version did, support was removed when I rearchitected the core for v2.)

Drawbacks

Written by someone who pretends to be an electrical engineer as a way to procrastinate finishing his slides for a talk.
Used for exactly one thing so far, so not exactly battle-hardened.
Less general than more mature implementations like PicoRV32 -- e.g. no support for wait states, hardware multiply, coprocessors, or (currently) interrupts.
16-bit external data bus means that, currently, 32-bit reads/writes are not atomic -- a problem when interfacing with peripherals with 32-bit memory-mapped registers. (Peripherals with 16-bit memory-mapped registers work fine, however.)
Not exactly well factored/commented.
Written in Python, so chances are pretty good the code won't keep working across OS updates / minor runtime versions.

What's with the name

hapenny is implemented using about half the logic of other cheap RV32 cores.

The half-penny, or "ha'penny," is a historical English coin worth (as the name implies) half a penny. So if the other cheap cores cost a penny, this is a ha'penny.

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
boards		boards
dhrystone		dhrystone
doc		doc
hapenny		hapenny
montool		montool
notes		notes
tinyboot		tinyboot
.env.toolchain		.env.toolchain
.gitignore		.gitignore
README.mkdn		README.mkdn
icestick-chonk.py		icestick-chonk.py
icestick-smallest.py		icestick-smallest.py
icesticktest.py		icesticktest.py
icoboard-large.py		icoboard-large.py
icolarge-bootloader.bin		icolarge-bootloader.bin
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
sim-chonk.py		sim-chonk.py
sim-cpu.py		sim-cpu.py
smallest-toggle.bin		smallest-toggle.bin
tiny-bootloader.bin		tiny-bootloader.bin
tinyboot-upduino-chonk.bin		tinyboot-upduino-chonk.bin
upduino-bootloader.bin		upduino-bootloader.bin
upduino-chonk.py		upduino-chonk.py
upduino-large.py		upduino-large.py

cbiffle/hapenny

Folders and files

Latest commit

History

Repository files navigation

hapenny: a half-width RISC-V

Bullet points

But why

Details

Interfaces

Interrupt options

Drawbacks

What's with the name

About

Resources

Stars

Watchers

Forks

Languages

`hapenny`: a half-width RISC-V