Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishbone burst access #693

Open
stokdam opened this issue Sep 27, 2023 · 8 comments
Open

Wishbone burst access #693

stokdam opened this issue Sep 27, 2023 · 8 comments
Labels
enhancement New feature or request HW hardware-related

Comments

@stokdam
Copy link

stokdam commented Sep 27, 2023

I'm opening this issue since I have not found any dedicated thread on this topic.
The topic has been discussed here #573 (comment)

I'm going to synthesize it in the following lines.

As stated by @stnolting in #573 (comment), bus accesses are currently the main performance bottlneck of the architecture. This is because any request will be asserted only after the previuos one has been acknowledged. Especially with high latency memory, this behavior is crushing performances. For example SDRAMs have a certains CAS latency, which is added to each bus transaction. However, most SDRAMs offer burst accesses, which means that you must wait CL only once for the entire burst (e.g. 8 words), then you receive 1 word per clock cycle. This behavior suits well with caches, in particular with instruction cache.

@NikLeberg, in #573 (reply in thread), proposed two ways in which burst accesses can be implemented:

  • Registered feedback mode:

CTI_IO() cycle time identifier and BTE_IO() burst type extension. In CTI a value of 000 indicates a classic access whereas 001 means constant address burst and 010 means incrementing address burst. The BTE would only be used for the incementing burst to indicate after how many beats the burst should wrap around to the start address.

The advantage of this approach is that wrapping is actually supported by SDRAM bursts. It can be useful to reduce first word access latency after a miss. For example, if address 0x5 is required, you start an 8-word burst access at address 0x5; after receiving address 0x7, the burst wraps around and sends you address 0x0, until 0x4 which is the last one. This allows the icache to always perform bursts that are aligned with cache blocks, while still minimizing miss penalty. The disadvantage is the necessity to implement the above-mentioned control signals.

  • Using the real pipelined mode:

Instead of asserting STB basically identical to CYC and requiring that the address and data bits do not change, on each clock where STB is asserted, new valid data and address are presented on the bus. The slave is expected to buffer them internally (if he needs wait states) and respond asynchronously with ACKs for each cycle. But now the slave also gets a new output signal STALL with which he tells the master that the current STB cycle will be ignored and the STB must be repeated/kept.
Slaves could always assume an incrementing access and just stall the master if the address is not the expected increment (and load the correct value on the next clock).

This approach may be easier to implement on the CPU side.

@NikLeberg
Copy link
Collaborator

Hi there @stokdam!
I totally agree, the bus interface can be improved upon. I did not touch the topic since the discussions in #573 though, i.e. did not implement any VHDL code for it (so far).

But regarding wishbone registered feedback mode and pipelined: I'd propose we look into pipelined. It seems, that according to ZipCPU, to be simple but efficient. Only problem that an eventual error on the bus can no longer (safely) be assigned to an individual read or write but only to the whole transaction that was in flight. Meaning that for example a burst of 16 bytes that triggers a slave error after 4 bytes may have to be restarted from byte 0.

Some more comments/ideas:

  • the CPU can't issue burst read/writes, there is simply no risc-v instruction for that
  • the only components that (currently) could profit from burst accesses are the caches and the DMA (which would help a lot!)
  • if the whole bus is to be switched to a pipelined mode, then not all neorv32 peripherals must necessarily be updated: we could move them behind a simple gateway that translates pipelined to classic accesses.
  • keep in mind that the neorv32 internal bus is not stock wishbone though, it is very close but not the same

Do you @stokdam have by any chance experience with pipelined Wishbone?

@stokdam
Copy link
Author

stokdam commented Sep 28, 2023

Hi @NikLeberg!

Only problem that an eventual error on the bus can no longer (safely) be assigned to an individual read or write but only to the whole transaction that was in flight

I do not undertand this point. Aren't individual accesses indepentent from each other with pipelined approach? Slave just assumes master want always perform an incremental burst (and stall the pipeline if it is actually not), but can raise an err/ack for each transferred word.

Do you @stokdam have by any chance experience with pipelined Wishbone?

Unfortunately I discovered Wishbone within this project, always worked with AMBA AHB/APB buses.

EDIT:

if the whole bus is to be switched to a pipelined mode, then not all neorv32 peripherals must necessarily be updated: we could move them behind a simple gateway that translates pipelined to classic accesses

A non-pipelined slave can interface with a pipelined master by properly asserting the stall signal (Paragraph 5.2 in WB specs)

@NikLeberg
Copy link
Collaborator

Let's assume the following burst write transaction:
wavedrom

Master is writing five words of data over the bus and incrementing adr by 4 each clock.
The slave has a delay of 2. Meaning the dat for the white cycle is acknowledged by the slave during the orange cycle. A delay of 1 is required by the pipelined nature of the bus. But a slave may acknowledge the transaction whenever it wants (as long as it does not assert stall it has to buffer the incoming packets though).
After a few more packets the slave asserts err. Now the problem is, that Wishbone B4 does not clearly define what happens. What is defined, is, that the master drops cyc and the transaction is aborted. But which write did trigger the error? Was it the blue write, or the orange one? And what packets have been successful? Were the acked writes white and yellow really successful or also invalidated?

For neorv32 we have to define exactly how the slaves shall behave. And how the master reacts to error conditions.

I'd handle / define it the same as you have understood it initially. Treat any err the same as an ack only that the transaction is prematurely aborted. Meaning that the err is asserted by the slave with the same delay as he did assert the ack signal. In the example above this would mean that the error happened with the orange packet and the following blue and cyan packets were dropped.

I'm by no means an expert in this! I'm very open for discussion and love that you have started the initiative. 😃

@NikLeberg
Copy link
Collaborator

I did check the HDL and found that almost all slaves on the internal bus seem to only ever have a latency of 1 from re or we to ack or err. Exceptions are:

  • neorv32_crc: latency of 5 clocks
  • neorv32_cfs: user defined, in the example 1 cycle though
  • neorv32_dcache: many, at least 2 * length of cache line
  • neorv32_icache: many, at least 2 * length of cache line
  • neorv32_xip: depends on flash chip

Masters that could profit from a burst access are:

  • cpu frontend (filling the prefetch buffer in one swoop)
  • icache
  • dcache
  • dma (would need an internal fifo though)

@stnolting can you validate/check my findings?

To support pipelined bus transactions the internal bus does not necessarily need to be implemented as a wishbone bus. But at least two more signals would need to be added. Something similar to the wishbone stall and cyc.
(This would make the internal bus virtually identical to wishbone. Only differences: internal_we <= wishbone_stb and wishbone_we, internal_re <= wishbone_stb and not wishbone_we and internal_rsvo. Other signals are equal.)

@stokdam
Copy link
Author

stokdam commented Oct 2, 2023

A delay of 1 is required by the pipelined nature of the bus

You don't need to put any extra-cycle, the concept of pipeline here means that you can assert a new request before the following one has been acknowledged. However, the slave is allowed to answer in the clock cycle just after the one in which STB is asserted.

But a slave may acknowledge the transaction whenever it wants (as long as it does not assert stall it has to buffer the incoming packets though).

Mmm, I don'l like the possibility to push an undefined number of transactions before getting any acknoledge. I think WB does not forbid this since it wants to be as flexible as possible. AHB, for example, provides the HREADY signal that fuctions as acknowledge as well as stall request, thus allowing a maximum of one "pending" transaction.

I think it would be overkill to support multiple pending transactions on the master side, and also not so useful since it would not guarantee a better throughput. This means that, on the master side, also the signal ACK_I plays some kind of stall request functionality. There are two possibilities:

  • Allow a maximum of two "pending" requests. Looking at your diagram, this means the following:
    • CC1: assert the white transaction
    • CC2: assert the yellow one and observe the ack signal
    • CC3: assert the orange transaction only if ack signal was asserted during CC2
  • Allow a maximum of one "pending" request:
    • CC1: assert the white transaction
    • CC2: assert the yellow one only if the ack signal is asserted in the same clock cycle.

In order to implement the latter, STB signal must be combinatorially generated starting from ACK_I, however, it would allow an easer bridging with AMBA buses

@stnolting
Copy link
Owner

stnolting commented Oct 4, 2023

Hey everyone!

Burst accesses would be a great thing to have! My current evaluation setup uses an external DDR RAM, which is horrible slow when using single accesses. Hence, I am using a cache IP block between the core's Wishbone interface and the actual SDRAM controller.

Adding burst support to the NEORV32 itself has been on my to do list for quite some times. But there are some difficulties here. As @NikLeberg has already correctly identified there are several modules that could benefit from burst accesses:

  • Caches are the perfect candidates for burst accesses as they access memory block based. However, at first we would need to modify the data cache, which uses "write through" strategy right (we would need "write back" to fully utilize bursts).
  • In contrast to the caches the DMA also requires constant address burst (for example for copying a memory block to a single IO device register like UART).
  • Adding burst accesses to the CPU front-end is a little bit more complex, as it might emit bursts that are not naturally aligned to their size (e.g. after executing a branch instruction).

To make things even more complicated all the burst-requesting modules would need to be able to handle all kind of responses, especially:

  • modules that do not support bursts at all
  • modules with variable latency

Adding a dedicated stall signal like the Wishbone protocol suggests is something I would like to avoid as this can result in a very long combinatorial feedback paths making timing closure more complicated (image a large Wishbone networks with lots of muxes and switches; the stall signal would require to be passed asynchronously through all elements).

To support pipelined bus transactions the internal bus does not necessarily need to be implemented as a wishbone bus.

I agree. Actually, I am reworking the internal bus protocol right now to make it more Wishbone-alike (replace we and re by Wishbone's stb and we):

-- bus request --
type bus_req_t is record
addr : std_ulogic_vector(31 downto 0); -- access address
data : std_ulogic_vector(31 downto 0); -- write data
ben : std_ulogic_vector(03 downto 0); -- byte enable
stb : std_ulogic; -- request strobe (single-shot)
rw : std_ulogic; -- 0=read, 1=write
src : std_ulogic; -- access source (1=instruction fetch, 0=data access)
priv : std_ulogic; -- set if privileged (machine-mode) access
rvso : std_ulogic; -- set if reservation set operation (atomic LR/SC)
end record;

I like the idea to add a buffer to the Wishbone interface that converts bursts to single access and vice versa, but I am not sure if that would really help to increase performance as the the transfers between the Wishbone port and the caches would still be single accesses only.

This is a complex topic and I am happy for any kind of help/feedback/ideas! 👍 Adding burst to the system will help increase performance - especially when talking to external memory.

@stnolting stnolting added enhancement New feature or request HW hardware-related labels Oct 4, 2023
@NikLeberg
Copy link
Collaborator

NikLeberg commented Oct 4, 2023

  • In contrast to the caches the DMA also requires constant address burst (for example for copying a memory block to a single IO device register like UART).

That is true. I think the best option would be to let the bus slave assume a specific burst mode. Meaning a FIFO or register may assume that a burst has a constant address. A good old chunk of memory may assume incrementing access. If that assumption was not correct then the slave could stall the producer and correct its mode. Whatever that would mean for the slave...

  • Adding burst accesses to the CPU front-end is a little bit more complex, as it might emit bursts that are not naturally aligned to their size (e.g. after executing a branch instruction).

I don't really see a particular problem with non naturally aligned bursts... Can't it just issue a burst starting from the address it wants? It's up to the slave if he can handle that or not. A slave can always just fall back to the single transfer pattern.

Adding a dedicated stall signal like the Wishbone protocol suggests is something I would like to avoid as this can result in a very long combinatorial feedback paths making timing closure more complicated (image a large Wishbone networks with lots of muxes and switches; the stall signal would require to be passed asynchronously through all elements).

How about skid buffers? They provide the functionality of a stall signal but break up the combinatorial path. They basically register the stall signal and temporarily store the in-flight message that would get dropped due to the stall. Dan's implementation seems to have no added latency when the slave is non-stalled.

Having a stall like signal is unavoidable I think... 🤔

To support pipelined bus transactions the internal bus does not necessarily need to be implemented as a wishbone bus.

I agree. Actually, I am reworking the internal bus protocol right now to make it more Wishbone-alike (replace we and re by Wishbone's stb and we):

I like it! To support bursts we basically just need to add a cyc signal... and change a whole lot of bus slaves. 😅

I like the idea to add a buffer to the Wishbone interface that converts bursts to single access and vice versa, but I am not sure if that would really help to increase performance as the the transfers between the Wishbone port and the caches would still be single accesses only.

I think this option would be more about not having to change the whole system at once. By putting the simpler components that do not profit from burst access that much behind a simple converter we could reduce the changes necessary.

Please correct me @stnolting but if I'm not mistaken then the CPU waits for a write to ack before it issues the next instruction. This is partly because then it can directly issue a store exception if it must. Having a d$ does not change this as it is write through, right?
If we change to write back then that data might get lost when the write back fails. So the store exception would be generated quite a bit delayed if it should be generated at all. Same with pipelined access that has a large latency. The CPU could go forward and issue the next instruction while the previous write did not complete. Is this something you would want the CPU to support? Or keep it more to the roots of the README:

Special focus is paid on execution safety to provide defined and predictable behavior at any time. Therefore, the CPU ensures that all memory accesses are properly acknowledged [...]. README.md

@stnolting
Copy link
Owner

That is true. I think the best option would be to let the bus slave assume a specific burst mode. Meaning a FIFO or register may assume that a burst has a constant address. A good old chunk of memory may assume incrementing access. If that assumption was not correct then the slave could stall the producer and correct its mode. Whatever that would mean for the slave...

Stalling again... 😅 I really don' t fell well with that. All bus controllers would need to support stalling, which would require additional logic. Furthermore, what if a faulty device never ever releases the stall signal again? Is this something the bus monitor would need to keep track of?

I don't really see a particular problem with non naturally aligned bursts... Can't it just issue a burst starting from the address it wants? It's up to the slave if he can handle that or not. A slave can always just fall back to the single transfer pattern.

Naturally aligned memory blocks are easier to handle in terms of hardware arbitration logic (just mask out log2(size_of_block) bits in the address). And I also think that DDR controller would suffer from unaligned bursts as several banks would need to be opened and closed. 🤔

How about skid buffers? They provide the functionality of a stall signal but break up the combinatorial path. They basically register the stall signal and temporarily store the in-flight message that would get dropped due to the stall. Dan's implementation seems to have no added latency when the slave is non-stalled.

I'll have a look at this. But how do they achieve zero delay on unstalled accesses? Is there a latch somewhere?

I think a plain FIFO would be much more generic (and maybe even cleaner) if we want to relax the critical path for bus systems that support stalling.

Having a stall like signal is unavoidable I think... 🤔

🙈 😅

I like it! To support bursts we basically just need to add a cyc signal... and change a whole lot of bus slaves. 😅

For the new bus protocol the cyc signal would contain only redundant information. But maybe this is truly required for bursts to identify all the accesses belonging to one burst.

I think this option would be more about not having to change the whole system at once. By putting the simpler components that do not profit from burst access that much behind a simple converter we could reduce the changes necessary.

The important question is (again): which modules would benefit from burst accesses? The CPU requires two cycles to fetch access memory (at least). If we just access internal memory then there is no real gain when having bursts - even the caches are not bringing any performance boosts here.

If we ignore the DMA for a moment then only the XIP and the Wishbone interface (or external memory) would benefit from bursts/caches.

Please correct me @stnolting but if I'm not mistaken then the CPU waits for a write to ack before it issues the next instruction. This is partly because then it can directly issue a store exception if it must. Having a d$ does not change this as it is write through, right?

That's correct. By having only one memory request in flight the CPU always knows which access went rogue and can issue precise exceptions.

If we change to write back then that data might get lost when the write back fails. So the store exception would be generated quite a bit delayed if it should be generated at all. Same with pipelined access that has a large latency. The CPU could go forward and issue the next instruction while the previous write did not complete.

That's a good point I was not thinking about...

Is this something you would want the CPU to support? Or keep it more to the roots of the README:

To be honest I would rather keep the "safety" approach then optimizing for maximum bus throughput. This safety thing has become something like a niche application of this core. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request HW hardware-related
Projects
None yet
Development

No branches or pull requests

3 participants