Skip to content

Add a high performance, general purpose RSP command queue engine #253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 27, 2022

Conversation

snacchus
Copy link
Contributor

This PR introduces the rspq library (short for "RSP command queue"), which provides the basic infrastructure to allow a very efficient use of the RSP coprocessor. On the CPU side, it implements an API to enqueue "commands" to be executed by RSP into a ring buffer, that is concurrently consumed by RSP in background. On the RSP side, it provides the core loop that reads and execute the queue prepared by the CPU, and an infrastructure to write "RSP overlays", that is libraries that plug upon the RSP command queue to perform actual RSP jobs (eg: 3D graphics, audio, etc.).

The library is extremely efficient. It is designed for very high throughput and low latency, as the RSP pulls by the queue concurrently as the CPU fills it. Through some complex synchronization paradigms, both CPU and RSP run fully lockless, that is never need to explicitly synchronize with each other (unless requested by the user). The CPU can keep filling the queue and must only wait for RSP in case the queue becomes full; on the other side, the RSP can keep processing the queue without ever talking to the CPU.

The library has been designed to be able to enqueue thousands of RSP commands per frame without its overhead to be measurable, which should be more than enough for most use cases.

Commands

Each command in the queue is made by one or more 32-bit words (up to 15 currently). The MSB of the first word is the command ID. The higher 4 bits are called the "overlay ID" and identify the overlay that is able to execute the command; the lower 4 bits are the command index, which identify the command within the overlay. For instance, command ID 0x37 is command index 7 in overlay 3.

As the RSP executes the queue, it will parse the command ID and dispatch it for execution. When required, the RSP will automatically load the RSP overlay needed to execute a command. In the previous example, the RSP will load into IMEM/DMEM overlay 3 (unless it was already loaded) and then dispatch command 7 to it.

Higher-level libraries and overlays

Higher-level libraries that come with their RSP ucode can be designed to use the RSP command queue to efficiently coexist with all other RSP libraries provided by libdragon. In fact, by using the overlay mechanism, each library can obtain its own overlay ID, and enqueue commands to be executed by the RSP through the same unique queue. Overlay IDs are allocated dynamically by rspq in registration order, to avoid conflicts between libraries.

End-users can then use all these libraries at the same time, without having to arrange for complex RSP synchronization, asynchronous execution or plan for efficient context switching. In fact, they don't even need to be aware that the libraries are using the RSP. Through the unified command queue, the RSP can be used efficiently and effortlessly without idle time, nor wasting CPU cycles waiting for completion of a task before switching to another one.

Higher-level libraries that are designed to use the RSP command queue must:

  • Call rspq_init at initialization. The function can be called multiple times by different libraries, with no side-effect.
  • Call rspq_overlay_register to register a rsp_ucode_t as RSP command queue overlay, obtaining an overlay ID to use.
  • Provide higher-level APIs that, when required, call rspq_write and rspq_flush to enqueue commands for the RSP. For instance, a matrix library might provide a "matrix_mult" function that internally calls rspq_write to enqueue a command for the RSP to perform the calculation.

To be compatible with the queue engine, ucodes must simply include rsp_queue.inc at the top of the file and define a header and a command table at the beginning of their data section using RSPQ_BeginOverlayHeader, RSPQ_DefineCommand and RSPQ_EndOverlayHeader. An overlay ucode doesn't have a single entry point: it exposes multiple functions bound to different commands, that will be called by the queue engine when the commands are enqueued. See tests/rsp_test.S for an example.

Blocks

A block (rspq_block_t) is a prerecorded sequence of RSP commands that can be played back. Blocks can be created via rspq_block_begin / rspq_block_end, and then executed by rspq_block_run. It is also possible to do nested calls (a block can call another block), up to 8 levels deep.

A block is very efficient to run because it is played back by the RSP itself. The CPU just enqueues a single command that "calls" the block. It is thus much faster than enqueuing the same commands every frame.

Notice that this library does not support static (compile-time) blocks. Blocks must always be created at runtime once (eg: at init time) before being used.

Syncpoints

The RSP command queue is designed to be fully lockless, but sometimes it is required to know when the RSP has actually executed an enqueued command or not (eg: to use its result). To do so, this library offers a synchronization primitive called "syncpoint" (rspq_syncpoint_t). A syncpoint can be created via rspq_syncpoint and records the current writing position in the queue. It is then possible to call rspq_check_syncpoint to check whether the RSP has reached that position, or rspq_wait_syncpoint to wait for the RSP to reach that position.

Syncpoints are implemented using RSP interrupts, so their overhead is small but still measurable. They should not be abused.

High-priority queue

This library offers a mechanism to preempt the execution of RSP to give priority to very urgent tasks: the high-priority queue. Since the moment a high-priority queue is created via rspq_highpri_begin, the RSP immediately suspends execution of the command queue, and switches to the high-priority queue, waiting for commands. All commands added via standard APIs (rspq_write) are then directed to the high-priority queue, until rspq_highpri_end is called. Once the RSP has finished executing all the commands enqueued in the high-priority queue, it resumes execution of the standard queue.

If required, it is possible to call rspq_highpri_sync to wait for the high-priority queue to be fully executed.

Final notes

For an explanation of implementation details, see documentation in src/rspq.c.

In addition to the core library, this PR also ports the existing RSP mixer library to be compatible with rspq.

The implementation of this library was a combined effort between @rasky and me. I prototyped the command queue, implemented overlay support and ported the mixer library. Rasky added the blocks and highpri queue, drove the API design and wrote most of the absolutely incredible rspq ucode, which is both extremely fast and as efficient on instruction size as humanly possible, leaving plenty of IMEM for overlays.

Copy link
Collaborator

@anacierdem anacierdem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really well implemented and documented! Loved it!

I was only able to review the headers files for now (and a few other places here and there) and tried to write down whenever I have a question in mind. I will continue with the implementation but that's something I'm not very proficient. Still eventually I will go over the whole thing but I'm ok to merge once someone goes over my current comments if I lag behind.

This is huge! Congrats to both of you @rasky & @snacchus! Can't tell how happy I am having you building upon frankenGAS :)

The only downside is we have more to do on the pre-emptive multithreading now :) One step at a time though.

@snacchus
Copy link
Contributor Author

snacchus commented Mar 19, 2022

As mentioned in a reply above, I revised some of the API in rsp_queue.inc to make writing overlays a little bit easier and less error prone. The location of the saved state now doesn't need to be explicitly passed to the RSPQ_BeginOverlayHeader macro. Instead, you just place the data that needs to be persisted between a pair of RSPQ_BeginSavedState and RSPQ_EndSavedState. The overlay registration now also asserts whether the size of the saved state is bigger than zero, which would cause undefined behavior before. If no data needs to be persisted, then RSPQ_EmptySavedState can be used instead.

I also added a bunch of documentation to the macros in rsp_queue.inc and a short guide on how to write overlays.

Copy link
Collaborator

@anacierdem anacierdem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over the implementation rather quickly as well. Please forgive me if some of the questions does not make sense. I believe most of the things might feel trivial to you in hindsight.

Can't believe how much work you have put on this!

I'm ready to merge this whenever you feel comfortable. I will also run the tests on my hardware, just in case we find something different with the setup.

@snacchus
Copy link
Contributor Author

@meeq @Ryzee119 since this PR also affects the mixer, it would be fantastic if you guys could test it in your games to verify we didn't break anything. It should just work out of the box, no additional code required. And sound the same of course :)

snacchus and others added 2 commits March 27, 2022 15:54
Co-authored-by: Giovanni Bajo <rasky@develer.com>
@rasky rasky merged commit 558399b into DragonMinded:trunk Mar 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants