Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asking about system bus protocol and simulation instructions. #59

Open
jeras opened this issue Mar 10, 2022 · 12 comments
Open

Asking about system bus protocol and simulation instructions. #59

jeras opened this issue Mar 10, 2022 · 12 comments

Comments

@jeras
Copy link

jeras commented Mar 10, 2022

What I am asking for is some details on how to use system bus backpressure signals, so I can integrate quark into my FPGA synthesis and Verilator simulation (ISA tests). I will get through anyway but I like asking in advance so I have a communication channel if I get stuck.

I am developing my own RISC-V CPU in SystemVerilog, and I would like to compare it against femtorv32_quark.v.
I used an unusual coding style for the instruction decoder which is causing logic consumption and timing issues. I am unable to predict which decoder change would provide improvements, what I think would be an optimization results in worse area and timing. I have not used one-hot decoding yet, since I forgot about it after many years working on designs where optimization was not a priority.
Comparing with quark which is codded with a more conventional style would help me understand and debug those issues.

Initially I am using FPGA vendor synthesis tools due to SystemVerilog requirements, but I will try yosys with SystemVerilog enhancements after I fix the major issues. My final target is to use open source tools for both FPGA and ASIC.

If you are still reading, I have a few questions and comments regarding the quark RTL, which I must say after reviewing it has a low WTF to code line ratio :).
I noticed separate adders are used for addition/subtraction and then the two results are multiplexed. There is also a separate adder for the load/store address. And as I remember a separate PC incrementer and an adder for branches/jumps.
Did you try different adder configurations to optimize timing and FPGA resource utilization?

While I have problems optimizing my unusual instruction decoder, I had some success optimizing adders and datapath multiplexers. My CPU is designed to execute all instructions in a single clock cycle without exceptions. At a minimum I need 2 adders, one for PC increment and branch, and another for everything else, ADD/SUB, branch compare, load/store address and jumps. I was able to see some good area vs. timing compromises by adding extra adders for each immediate value (branch, load, store).

I will run some of those optimization experiment on quark too and will give you some feedback.

@Mecrisp
Copy link
Contributor

Mecrisp commented Mar 10, 2022

The busy signals work this way:

Processor sets address/data lines and pulses RSTRB or WMASK for exactly one clock cycle.

In the cycle following the RSTRB or WMASK pulse, the processor checks for WBUSY/RBUSY and if high, waits until they go low again.

Memory or peripherals therefore can raise RBUSY or WBUSY one clock cycle after they received the request; if no busy signaling is necessary, peripherals can ignore these lines, which can even be tied low permanently. The processor holds address/data lines stable as long as the busy lines are set.

Oh well, yes, we certainly did experiment with different possibilities for internal adders/mux combinations and optimised Quark for size! Follow the complete history here: #1 Quark is the result of the most useful compromise fitting into the Icestick with HX1K FPGA.

If you want to go faster with Quark in terms of maximum frequency, choose Tachyon.

https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_tachyon.v

If you want to go faster with Quark in terms of cycles/instruction, which gives more gain in performance at the cost of size, try these:

http://mecrisp.sourceforge.net/2022_02_11-femtorv32_quark_barrel_shifter.v
http://mecrisp.sourceforge.net/2022_02_11-femtorv32_quark_bicycle.v

@Mecrisp
Copy link
Contributor

Mecrisp commented Mar 10, 2022

Processor reads data from peripherals/memories one cycle after the RSTRB pulse or later when RBUSY goes low again.

@Mecrisp
Copy link
Contributor

Mecrisp commented Mar 10, 2022

@BrunoLevy Shall I prepare a pull request to add Quark variant with barrel shifter and Quark variant with barrel shifter and two-cycle operation to maybe RTL/PROCESSOR/EXPERIMENTAL ? Individua is still WIP, but these are ready :-)

@BrunoLevy
Copy link
Owner

Hi @Mecrisp, I've added it to RTL/PROCESSOR (all of them are experimental in fact :-) It's super cool, love it !

@BrunoLevy
Copy link
Owner

@Mecrisp, Regarding how to include your new versions, it's as you prefer (I can include them / update them when you tell me, but for sure pull requests can make it simpler/easier)

@jeras
Copy link
Author

jeras commented Mar 12, 2022

I integrated quark into my testbench, yes, it was rather trivial.
Then I run riscv-arch-test and got the next results.
At least some issues are probably related to the environment (Verilog testbench, linker script, memory size, ...), the first thing I had to change was the timeout, since quark takes more cycles than my CPU.
Before you look too much into the results, please note, I did not try to debug any of the tests yet, I just run them.

Compare to reference files ... 

Check add-01                    ... OK 
Check addi-01                   ... OK 
Check and-01                    ... OK 
Check andi-01                   ... OK 
Check auipc-01                  ... FAIL 
Check beq-01                    ... OK 
Check bge-01                    ... OK 
Check bgeu-01                   ... OK 
Check blt-01                    ... OK 
Check bltu-01                   ... OK 
Check bne-01                    ... OK 
Check fence-01                  ... OK 
Check jal-01                    ... FAIL 
Check jalr-01                   ... OK 
Check lb-align-01               ... FAIL 
Check lbu-align-01              ... FAIL 
Check lh-align-01               ... FAIL 
Check lhu-align-01              ... FAIL 
Check lui-01                    ... OK 
Check lw-align-01               ... OK 
Check or-01                     ... OK 
Check ori-01                    ... OK 
Check sb-align-01               ... FAIL 
Check sh-align-01               ... FAIL 
Check sll-01                    ... OK 
Check slli-01                   ... OK 
Check slt-01                    ... OK 
Check slti-01                   ... OK 
Check sltiu-01                  ... OK 
Check sltu-01                   ... OK 
Check sra-01                    ... OK 
Check srai-01                   ... OK 
Check srl-01                    ... OK 
Check srli-01                   ... OK 
Check sub-01                    ... OK 
Check sw-align-01               ... OK 
Check xor-01                    ... OK 
Check xori-01                   ... OK 
--------------------------------
 FAIL: 8/38 RISCV_TARGET=r5p RISCV_DEVICE=I XLEN=32 

I suspect the *-align-01 tests fail due to how I connected the system bus byte enable bits.
I have no ideas yet why auipc-01 and jal-01 would fail, I will have a look at it and probably end up fixing a few more issues with my test environment.

Still I wonder if you are running riscv-arch-test regularly to check for regressions.

I also tried to update riscv-arch-test to the latest version (mine is a few months old) and I got many more test failing on both my CPU and quark, I will look into this too.
If you are not running this tests, I could try to help. But note, the make files are hard to follow, so at least for me this was not a trivial task.

@Mecrisp
Copy link
Contributor

Mecrisp commented Mar 12, 2022

Thanks for your efforts!

Failing of almost all alignment edge cases is no surprise for me, as they are implemented in a very small, but quirky way internally.

But I am really surprised on fails in auipc and jal. Could you elaborate on this and provide more data/insight/logs what exactly is going wrong, please? Maybe this is due to our limited ADDR_WIDTH; did you set it to 32 bits for testing? Default is 24 bits only on the address bus which is non-conformant.

@jeras
Copy link
Author

jeras commented Mar 12, 2022

I looked at the code now, and the errors seem to be some kind of sign extension issue.
The adder used to calculate both jal and auipc is wire [ADDR_WIDTH-1:0] PCplusImm, while the result is XLEN.
At least for some failures in auipc tests, the MSB bits are 1 in the reference and 0 in quark.
There are some different errors which I did not look into yet, but the testbench does not always compare the tested instruction results directly, so this might be the same issue.
I used the main XLEN sized ALU adder for this two instructions.

@Mecrisp
Copy link
Contributor

Mecrisp commented Mar 12, 2022

Set parameter ADDR_WIDTH to 32.

@BrunoLevy
Copy link
Owner

Yes, there are good chances it will fix the problem ! (We have created a possibly smaller internal address bus / address adder to save LUTs in the smallers FPGA like Ice40HX1K, but then it does not exactly follow the norm).

@jeras
Copy link
Author

jeras commented Mar 12, 2022

I would assume the sign extension on the immediate is not properly transferred to the result.

To add the test environment to FemtoRV I would need a step by step walk-through of current Verilator testbench scripts.
I tried to read the scripts, but got stuck with the firmware.hex (if I remember the name correctly) not being found.

I would probably have to modify the Verilator testbench code handling memory loads to use $value$plusargs for the firmware file name.
https://github.com/jeras/rp32/blob/master/hdl/tbn/r5p_tb.sv#L214-L225
I started having issues in December 2021, with make complaining about test scripts not handling multi-threading properly. Apparently it was about all tests reading the same hex file path into memories. I fixed this by passing file paths directly into the riscv-arch-test directory hierarchy. I provide the file names (firmware hex input and test signature dump as output) as argument to the Verilator generated simulation executable at runtime.
Using parameters for file names was my first idea, but it would mean recompiling the HDL for each test.
https://github.com/jeras/rp32/blob/master/test/r5p/device/rv32i_m/Makefile.include#L45

@BrunoLevy
Copy link
Owner

The configuration used for testbench is defined in this file:
RTL/CONFIGS/bench_config.v

See also this file:
RTL/CONFIGS/femtosoc_config.v
that loads the core definition depending on what was defined in bench_config.v

To configure the firmware for this core:

$ make BENCH.firmware_config

Then

$ cd FIRMWARE/EXAMPLES
$ make xxxx.hex

where xxxx is the name of the program you want to compile (pick tinyraytracer.hex for instance). It will copy the
generated ROM to firmware.hex.

Then to start simulation:

$ cd ../..
$ make BENCH

If everything goes well, you will see some raytracing computed by the RV32F core.
It works also for programs in assembly:

$ cd FIRMWARE/ASM_EXAMPLES
$ make mandelbrot_terminal.hex
$ cd ../..
$ make BENCH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants