Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multicore configurations #85

Closed
mithro opened this issue Sep 15, 2019 · 8 comments
Closed

Support multicore configurations #85

mithro opened this issue Sep 15, 2019 · 8 comments

Comments

@mithro
Copy link
Contributor

mithro commented Sep 15, 2019

It would be good to support VexRISCV in multicore configurations.

With the low resource usage of VexRISCV, supporting 2 or 4 core complexes on cheap boards would be very possible. We can then use it at litex-hub/linux-on-litex-vexriscv#47

As VexRISCV now being used to run Linux, SMP support would hopefully improve performance.

@SpinalHDL / @Dolu1990 - What would be needed to make this happen? I assume a bunch of stuff around atomics and cache coherence?

@Dolu1990
Copy link
Member

As far i know there would be the change required :

  • As you said, a coherent data cache. Possibly adding write back behaviour in addition of the actual write through to avoid producing to much write transactions.
  • No atomic change required, basicaly, if the data cache is coherent, the atomics which are done in the data cache itself become coherent as well
  • It would also require a coherent memory interconnect, maybe a coherent L2 depending the design choosed

Basicaly, there is a list of things to do on VexRiscv to improve various things:

  • Increasing the data width of I$ D$ memory busses, to allow faster miss refill, actualy that's only 32 bits, this would boost each core, as currently it seem the main bottleneck is the I$ miss penality. This kind of things do not show when we run benchmarks as dhrystone and coremark, as they work on very little code base. But running linux is a I$ hell ^^
  • Data cache coherency for SMP
  • Having xtval xtvec xepc xscratch in a block ram instead of raw register, this would save area and avoid having to reduce features set in machine mode (done to save area)
  • FPU

Currently i'm slamed by third party obligations, but in march i should have much more free time too move things forward.

@WillGreen
Copy link

Would you mind elaborating on the use-case for SMP @mithro?

My gut feeling is that SMP support would be impressive, but not that useful. There are more straightforward ways to boost performance that benefit a wide range of uses, not just SMP-aware operating systems. For example, increasing the width of cache busses, as @Dolu1990 has already mentioned.

I see CPUs on FPGAs as providing control and performing high-complexity operations, such as division and square root. If you’re doing something parallel and performance-critical, you can do it in a co-processor or separate logic block, without complicating the core VexRiscv design. If there is going to be a substantial addition of functionality, then I think an FPU is more desirable than additional integer cores.

Another approach is to place multiple VexRiscv cores on one FPGA and link them with a bus. Such a design wouldn’t “just work” in Linux, but would allow custom designs to perform more CPU operations in parallel if required.

A big part of what makes VexRiscv unique is its clean, elegant design. I’d hate to lose that.

@Dolu1990
Copy link
Member

Dolu1990 commented Nov 7, 2019

I'm currently crushed by none SpinalHDL/VexRiscv obligations :(
So my hope is that in march i will get free of most of them, and move forward to make things progress.

@WillGreen @mithro Maybe SMP would be possible in some kind of FPGA friendly / clean way :

  • Keeping the memory bus untouched from the SMP stuff
  • Having one coherent directory which would manage and track cache line states (invalid, shared, unique)
  • To write, a CPU should have the line in a unique state
  • To read, a CPU should have the line in a shared or unique state)

To negotiate line state, there would be 2 channels between each CPU and the directory :

Channel 1

  • Directory -> CPU to ask invalidate a line (allowing to make a line unique for another CPU) or to make it shared (which would imply the CPU isn't allow to keep the line dirty)
  • CPU -> Directory to notify when one of the above transaction in comleted

Channel 2

  • CPU -> Directory, to ask a line as unique or shared
  • Directory -> CPU to notify when one of the above transaction is done

Channel 1 should have priority over channel 2
the memory channel share no logic with channel 1 and 2, as it isn't connected by any way to the directory, so nothing specific there.

Then all data transaction would still be done by the same bus as actualy. This would allow to bring the system on all sort of memory system without specific requirements.
It would also avoid duplicating data path, as i would like to avoid tilelink like things.

That's just my current unrefined idea of how SMP could be done, but i'm not expert in that field XD

@Dolu1990
Copy link
Member

Dolu1990 commented Nov 7, 2019

Another solution would be more AXI ACE like with 3 channel :

Channel 1 (write) :

  • cpu -> interconnect : write cmd + data
  • interconnect -> cpu : ack

Channel 2 (read / reserve) :

  • cpu -> interconnect : ask to read and/or make unique/shared a given address
  • interconnect -> CPU : readed data
  • cpu -> interconnect : ack

Channel 3 (probe) :

  • interconnect -> cpu : address invalidate/shared
  • cpu -> interconnect : ack

which would result into 3 steam with address, 2 stream with data.
The CPU can write the data back to memory when it got a probe hit.

@Dolu1990
Copy link
Member

Dolu1990 commented Nov 9, 2019

A variation of the above proposal could be that a CPU cache could react to a probe request by the following ways :

  • cache miss => channel 3 rsp ack
  • cache hit => write back to memory on channel 1 + channel 3 rsp ack
  • cache hit (and this one is special) => Channel 1 probe response with the related data., no channel 3 rsp ack. This would allow at a cheap cost to move data between caches while avoiding adding a new datapath.

@Dolu1990
Copy link
Member

Dolu1990 commented Feb 14, 2020

There is a draft of a coherent interface spec
https://github.com/SpinalHDL/VexRiscv/blob/dev/doc/smp/smp.md

@Dolu1990 Dolu1990 pinned this issue Apr 10, 2020
@Dolu1990
Copy link
Member

Dolu1990 commented Apr 10, 2020

Quite some progress here.

dev branch :
https://github.com/SpinalHDL/VexRiscv/tree/smp

Basicaly, the aim is to implement write-through invalidate coherency protocole for the CPU L1 d$.
There is a few reasons to not adopte (yet) a write-back based coherency protocol for the L1 :

  • Write back penalities / workaround cost
  • Write allocate penalities / workaround cost
  • Write allocate line virtualy reduce the cache size / number of ways (ex single way d$ doing memory copy can endup into sever cash trashing if source and destination are aligned)
  • Interconnect complexity, not that it is particulaly heavy for a writeback proposal, but still, quite some complexity added.
  • Latency added on the interconnect

Currently, all the CPU side stuff is implemented
In a single core config, with random invalidations comming from the testbench, it can boot linux.

LR/SC are now using the memory bus "exclusive" feature, something similar to the AXI4 one, while AMO are emulated in the CPU hardware using those same LR/SC memory bus request.

Synthesis look good, as FMax is only 5% lower (without any time used to improve that), and the LUT occupancy is 2 % higher.

The only requirements for the interconnect are to implement eclusive accesses and to propagate write request as invalidation request to the other CPU.

@Dolu1990 Dolu1990 unpinned this issue May 2, 2020
@Dolu1990
Copy link
Member

Dolu1990 commented May 2, 2020

Done :)

https://github.com/enjoy-digital/litex_vexriscv_smp
https://github.com/SpinalHDL/VexRiscv/blob/smp/src/main/scala/vexriscv/demo/smp/VexRiscvSmpCluster.scala#L21

But i have to document things

@Dolu1990 Dolu1990 closed this as completed May 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants