Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RV32IMC with IBusCachedPlugin #93

Open
MarekPikula opened this issue Oct 16, 2019 · 11 comments
Open

RV32IMC with IBusCachedPlugin #93

MarekPikula opened this issue Oct 16, 2019 · 11 comments

Comments

@MarekPikula
Copy link
Contributor

Hi, first of all congratulations for your impressive work. I'm currently evaluating different RISC-V cores and VexRiscv is my favourite so far.

I have a problem when trying to configure Briey to IMC ISA. I've tried Murax before with IBusSimplePlugin(compressedGen = true) and it worked just fine, but it doesn't seem to work with IBusCachedPlugin. When I set compressedGen = true in Briey I have the following error:

[Runtime] SpinalHDL v1.3.6    git head : 9bf01e7f360e003fac1dd5ca8b8f4bffec0e52b8
[Runtime] JVM max memory : 2444.5MiB
[Runtime] Current date : 2019.10.16 15:19:49
[Progress] at 0.000 : Elaborate components
PcManagerSimplePlugin is now useless

**********************************************************************************************
[Warning] Elaboration failed (0 error).
          Spinal will restart with scala trace to help you to find the problem.
**********************************************************************************************

[Progress] at 1.023 : Elaborate components
PcManagerSimplePlugin is now useless
Exception in thread "main" java.lang.Exception: Missing inserts : INSTRUCTION_ANTICIPATED
	at vexriscv.Pipeline$class.build(Pipeline.scala:95)
	at vexriscv.VexRiscv.build(VexRiscv.scala:86)
	at vexriscv.Pipeline$$anonfun$1.apply$mcV$sp(Pipeline.scala:161)
	at vexriscv.Pipeline$$anonfun$1.apply(Pipeline.scala:161)
	at vexriscv.Pipeline$$anonfun$1.apply(Pipeline.scala:161)
	at spinal.core.ClockDomain.apply(ClockDomain.scala:306)
	at spinal.core.Component$$anonfun$prePop$1.apply(Component.scala:124)
	at spinal.core.Component$$anonfun$prePop$1.apply(Component.scala:123)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at spinal.core.Component.prePop(Component.scala:123)
	at spinal.core.Component.delayedInit(Component.scala:138)
	at vexriscv.VexRiscv.<init>(VexRiscv.scala:86)
	at vexriscv.demo.Briey$$anon$3$$anon$4.<init>(Briey.scala:400)
	at vexriscv.demo.Briey$$anon$3.delayedEndpoint$vexriscv$demo$Briey$$anon$3$1(Briey.scala:395)
	at vexriscv.demo.Briey$$anon$3$delayedInit$body.apply(Briey.scala:346)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at spinal.core.ClockingArea.delayedInit(Area.scala:84)
	at vexriscv.demo.Briey$$anon$3.<init>(Briey.scala:346)
	at vexriscv.demo.Briey.delayedEndpoint$vexriscv$demo$Briey$1(Briey.scala:346)
	at vexriscv.demo.Briey$delayedInit$body.apply(Briey.scala:270)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at spinal.core.Component.delayedInit(Component.scala:131)
	at vexriscv.demo.Briey.<init>(Briey.scala:270)
	at vexriscv.demo.Briey$$anonfun$main$1.apply(Briey.scala:497)
	at vexriscv.demo.Briey$$anonfun$main$1.apply(Briey.scala:496)
	at spinal.core.internals.PhaseCreateComponent.impl(Phase.scala:1920)
	at spinal.core.internals.PhaseContext.doPhase(Phase.scala:195)
	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$10.apply(Phase.scala:2156)
	at spinal.core.internals.SpinalVerilogBoot$$anonfun$singleShot$10.apply(Phase.scala:2154)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at spinal.core.internals.SpinalVerilogBoot$.singleShot(Phase.scala:2154)
	at spinal.core.internals.SpinalVerilogBoot$.apply(Phase.scala:2090)
	at spinal.core.Spinal$.apply(Spinal.scala:311)
	at spinal.core.SpinalConfig.generateVerilog(Spinal.scala:142)
	at vexriscv.demo.Briey$.main(Briey.scala:496)
	at vexriscv.demo.Briey.main(Briey.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMainV2.main(AppMainV2.java:131)

I'm using current master with IntelliJ IDE.

@Dolu1990
Copy link
Member

Thanks :)

It isn't realy a bug, but a set of incompatible feature.
Basicaly, with the Briey default configuration, the cache use two cycle (https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/demo/Briey.scala#L69)
which give the instruction in the decode stage, but there is also the INSTRUCTION_ANTICIPATED which provide the future value of the decode instruction, which allow the register file to use a syncronus ram read using as address the INSTRUCTION_ANTICIPATED to produce the RS1 RS2 in the decode stage.

The issue with the RVC in that case, is that the RVC decompression is only done in the decode stage, which do not allow to produce the INSTRUCTION_ANTICIPATED value used by the reg file.

There is multiple ways to workaround that.
If you are using a FPGA with distributed ram capability, i would just pass the RegFilePlugin from SYNC to ASYNC.
Else you can ask the IBusCachePlugin twoCycleRam and twoCycleCache to false.
Else you can set the https://github.com/SpinalHDL/VexRiscv/blob/master/src/main/scala/vexriscv/plugin/IBusCachedPlugin.scala#L36 to true, in the CPU config, which would add an additional stage in the fetch pipeline which will allow INSTRUCTION_ANTICIPATED to be generated.

@MarekPikula
Copy link
Contributor Author

After disabling twoCycleRam and twoCycleCache it works like a charm 👍

I'd propose to add information about it in readme and possibly add cached IMC to standard configurations and size/performance breakdown, since IMC is quite popular configuration.

Besides what makes IMC less performant than IM? I'm running CoreMark and it's about 5% slower. It's not a huge difference, but I'm curious, especially that for Syntacore scr1 it's the other way around (IM is about 2% less performant than IMC). These are not huge differences, but still it's interesting to know what is the cause of this. VexRiscv is my first SpinalHDL project I'm looking at, so I don't have enough competence to look around the code and figure it out myself.

@Dolu1990
Copy link
Member

A performance hit due to RVC which should impact most implementations (and at least VexRiscv) is the fact that branch/jump on 32 bits unaligned instruction will result in the fetcher having to fetch two words. before being able to deliver that 32 bits instruction.

You can imagine other cases where it require to read two different lines of the cache, or event two diferrent MMU TLB.

I don't realy know about the Syntacore scr1 architecture.
RVC performance improvement for coremark could come from op-fusion ? But else i don't realy see any reason. Excepted maybe less i$ trashing, but normaly coremark should fully fit into 4 KB i$ d$. But scr1 is cacheless right ?

@MarekPikula
Copy link
Contributor Author

To be honest I didn't go too deep into scr1 architecture, but from what I can see there is no caching.

FYI I'm using scr1 under PULPissimo with their logarithmic interconnect for memory, which is very fast (thus caching is not that required). I'll check if there is any performance difference for Ibex, which I was also evaluating.

@MarekPikula
Copy link
Contributor Author

Same story for Ibex. IM is about 1,5% less performant than IMC. If I'll find some time maybe I'll try to fit cacheless VexRiscv in PULPissimo and do some benchmarks to see what is the difference.

@Dolu1990
Copy link
Member

logarithmic interconnect for memory

What is that :D ?

IM is about 1,5% less performant than IMC

That's weird, i'm curious to understand why XD
I mean technicaly speaking, RVC is a subset of RV. So maybe that's because of the compiler creativity ?

@MarekPikula
Copy link
Contributor Author

You can read more about log interconnect in Near-Threshold RISC-V Core With DSP Extensions
for Scalable IoT Endpoint Devices
article (here is a copy I've found if you don't have access to IEEE).

Compiler creativity has nothing to do in this particular case, because on all platforms I'm testing, I'm executing the same exact code compiled using the same exact compiler with the same exact flags, so the only difference is the core and SoC platform.

@tomverbeure
Copy link
Contributor

I noticed a similar case on the picorv32: when RVC is enabled, dhrystone performance is slightly lower when running identical IM32 binary.

@MarekPikula
Copy link
Contributor Author

For me it's exactly the same core (so full IMC), but running either IMC or IM code, so the only variable compiled code.

@MarekPikula
Copy link
Contributor Author

@Dolu1990 for your reference here is log interconnect code: pulp-platform/L2_tcdm_hybrid_interco.

@MarekPikula
Copy link
Contributor Author

And more about log interconnect: https://iis-people.ee.ethz.ch/~arahimi/papers/DATE11.pdf

xobs added a commit to xobs/VexRiscv-verilog that referenced this issue Nov 12, 2019
According to SpinalHDL/VexRiscv#93 compressed
mode doesn't work with a sync register file, and with
twoCycleRam/twoCycleCache.  Fix compressed instructions by moving to an
async register file.

Signed-off-by: Sean Cross <sean@xobs.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants