Skip to content

RISC V OLD

Julian Kemmerer edited this page Oct 28, 2022 · 1 revision

OLD ARTICLE

This page previously was a partially complete example of 'a CPU copied from C++ code' that has since been revived in the form of a custom PipelineC implementation instead. Below is the old article:


This page describes the process of taking Alex Lao's LittleRiscy RISC-V emulator (#TODO git link?) written in C++ and converting that into a fully synthesizeable PipelineC RISC-V design. This example was synthesized as #pragma PART "xc7a35ticsg324-1l", a Xilinx Artix 7 35T.

Original C++ Emulator and RISCV Core

I can't thank Alex Lao enough for sharing the original LittleRiscy C++ project. Please visit their blog over at Voltage|Divide for a bunch of hardware and software excellence.

Conversion to PipelineC

The emulator C++ was ever so nicely organized into classes mimicing hardware modules. So the first step was flattening out all those classes into a single C file of just functions, structs, enums without namespaces. switch statements needed to be converted to if-else's since break; fall through is not implemented yet. cout stuff needed to be switched to printf's. Instead of loading instructions from file they are hard coded into array init values (with some helper scripts). And finally the data and instruction memory sizes were shrunk down from 32768 to more manageable 128 elements.

The next phase was converting the logical execution (as opposed to just syntax) of the original C++ into something synthesizable. The structure of this emulator's main function was already an instruction by instruction while loop. So that while loop was removed and the sequence of cpu stages remained as the PipelineC main function executing one instruction per cycle. Some main functionality was wrapped in helper functions as to be more easily identified in schematic/synthesis.

Functions (modules in PipelineC) need to be converted to pass by value (input wires in PipelineC). Any values modified by references were converted to pass by value input and included in the return output. Luckily this emulator was again structured to nicely have a state variable being modified by individual stage functions in just this way.

//TODO CODE

Single Cycle #1 (Simple, inefficient)

This initial version does a few inefficient things. The data memory and register file are implemented as arrays of registers/arrays of signals lacking the proper structure for inferring register files or block RAM by the synthesis tool. This causes there to be a large number of registers for the data memory with lots of muxing LUTs. This shows up in the utilization report in the execute_instruction (seems to be fetch,decode,execute all in one giant arithmetic and muxing thing) and execute_memory_operation (again more muxing, to from data memory registers) functions.

1util

The critical path of this design is provided by the PipelineC tool (via Vivado) and shows a maximum operating frequency of ~54 MHz.

Current MHz: 54.87871803314674
Problem computation path:
START:  riscv/registers_r_reg[state_regs][program_counter][7] =>
 ~ 18.222 ns of logic ~
END: => riscv/registers_r_reg[state_regs][program_counter][0]

Vivado can provide more details on the path, ex. that is it through 22 levels of logic.

1time

The PipelineC tool produces a (currently crude) visual aid 'pipeline map' for identifying the critical path delay through your code. The clock cycle delay goes from top to bottom. The Y axis size is the delay relative to the entire path - good for seeing the order of operations and how much delay they have. The X axis size is a poor visual representation or ~roughly how much is going on in parallel. Now would be a good time to mention I am very open to code contributors if they wanted to spruce of some of this stuff with some quick python jazz.

1pipemap

Now that we know this design is realizable in hardware, lets simulate it to make sure its actually correct.

The PipelineC tool is able to start ModelSim with VHDL files compiled and ready to simulate from various points in the tool flow.

1model

1model2

This is where those printf debug statements come in handy. On the right is the original LittleRiscy C++ emulator. On the left is ModelSim clock by clock printf's from the generated PipelineC VHDL modules.

1emu

Interesting to notice that because of the inherent parallelism of what is being described the simulator produces an arbitrary looking order of print outs from various modules/functions in the code.

Single Cycle #2 (Simple, less inefficient)

The first optimization in this round of changes was to refactor the code such that register file reads of the left and right operand values were done only once in execute_instruction as opposed to multiple times in each execute sub-function/module. This was likely creating a-bunch-o-logic. The utilization shows about 3K fewer LUTs and 1K fewer muxes after this change. In theory this should be something a compiler could detect and optimize away. Anyone want to help with graph optimizations?

By opening the PipelineC generated Vivado files I was able to look the synthesis results without flattening the design hierarchy. This gave the least optimized but most accurate relative utilization numbers for optimizing the PipelineC code.

2noflat

And... we have an problem with the execution_memory_operation. This contains a 128x32b array of registers acting at the data memory. With a bunch of muxing logic to and from those registers doing the read and write enable logic. A compiler should be able to detect this and reorganize it into synthesis-tool-compatible register file / RAM structure - anyone want to help with graph optimizations and stuff? The solution for now is to use PipelineC built in pragmas and generated RAM primitive function calls that ensure proper inference from the synthesis tool. The utilization report shows ~75% fewer resources overall (and a ~58MHz fmax, no big change).

TODO CODE

utilwram

Clone this wiki locally