Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using DMA-enabled SPI more effectively? #772

Open
henrygab opened this Issue Apr 14, 2019 · 6 comments

Comments

Projects
None yet
2 participants
@henrygab
Copy link

henrygab commented Apr 14, 2019

Can FastLED be … faster? I think it can, at least when the hardware SPI supports DMA and the chip has some extra RAM....

I'd love to know that this is already done. :) The goal is to be able to prepare a "next" update buffer while a current update buffer is still in the process of being sent via SPI DMA hardware.

Today, even with hardware SPI with DMA capability, bytes are still sent one (or two) at a time. This is a waste of CPU cycles! The DMA engine provides greater benefits with larger buffers. Even without interrupts, just changing when one waits for a larger SPI DMA operation to complete should enable parallelization of some processing.

Strawman flow:

When adding LEDs that use SPI with hardware DMA:

  1. Calculate the largest buffer required to send an entire update (full-update buffer)
  2. Allocate N (three?) of those full-update buffers (thus the extra RAM)

When FastLED.show(), for each set of LEDs that use this method:

  1. Select the least-recently used full-update buffer (cycle between them)
  2. Apply all color corrections, dimming, etc. to that full-update buffer
  3. Hand that full-update buffer to the hardware-DMA-SPI-enabled class, along with the length of bytes that must be sent from that buffer.

What the FastSPI class would then do:

  1. Check and wait as needed for ability to setup a new SPI DMA transaction
    • e.g., on some chipsets, this just means TX started on prior transaction
    • e.g., on other chipsets, this would be same as prior transaction completing
  2. Setup the new SPI chipset transaction, using the provided full-update buffer / length
  3. Check and wait as needed for ability to start a new SPI DMA transaction (and clear the STARTED/COMPLETED events)
  4. Start the new SPI DMA transaction
  5. return immediately, without waiting for TX to start or transaction to complete

Other potential benefits

Interrupts need not be disabled.
Reliably transmission.
Clockless via SPI?

Once the full-update buffer is setup, the SPI chipset will ensure the bits are sent at the selected clock speed, increasing reliability of SPI updates for long LED strings, even when other non-blockable interrupts occur (I'm looking at those shiny new "integrated BLE" chipsets <cough>nRF52</cough>).

If the SPI chipset supports the exact speed needed for a clockless chip, it may be possible to reliably control the clockless chip, even with interrupts enabled.

Question....

Is this framework (or similar) already existing? I only have familiarity with the nRF51 codebase (and the nRF52 chips also), but it appears to support only single-byte SPI transactions....

Thanks!

@focalintent

This comment has been minimized.

Copy link
Member

focalintent commented Apr 14, 2019

This is basically what the OctoWS2811 driver does - but for clockless chipsets. I do want to do a framework for a variety of DMA based outputs - but that's further down the line.

What I found, though, is that with chipsets like the APA102 the time spent scaling/dithering was roughly in line with the time it was taking to write out a byte (the teensy 3.x. SPI output does a lot of this - pushing out data, then scaling/dithering the next chunk of data while that data is being written out, basically parallelizing the scaling/dithering with the output writing). For the most part, the benefit that would've come from using DMA was more theoretical than not, even moreso when considering that the clockless chips still reign supreme and APA102 hasn't quite managed to kick them out - and with the clockless chips, parallel output is a far better option (doubly so if you can get DMA'd parallel output, like with the OctoWS2811 on the teensy 3.x).

Since I'm completely redoing the controller architecture as part of the rgbw/16-bit work, I'm going to be putting in a framework for playing with DMA output on platforms/systems that would support it.

(Also - anyone who has tried to drive APA102's at over 12Mhz are laughing hysterically at the "reliable transmission" line : )

Also also - the DMA mechanism for every chipset is wildly different in configuration/operation - it makes GPIO access look downright uniformly defined!

@henrygab

This comment has been minimized.

Copy link
Author

henrygab commented Apr 14, 2019

Heh, reliable as in clocking, and even 4MHz SPI is a win. 8MHz is expected to be more reliable on the nRF52 by setting a higher-strength drive mode (both to zero and to one), to tighten timing. wiring, pin selection, and the like are another thing … but getting clocking out of the way?

Teensy 3... arm\KL66? That state-machine based Write template is scary excellent! 👍 I can see so many opportunities for the compiler to collapse instructions... Impressive!

Comments to understand what the code is doing, what the flags correspond to, and what the (implied) contracts are … those are unfortunately missing. This makes it really hard for someone who isn't deeply familiar with the K66 registers to make a mental model of what's happening. I hope you won't mind questions?

writeWord() appears to do the following:

  1. if PRE wait state, loops until bitflag 0x4000 in register SPIX.SR is clear.
    a. Is SPIX.SR essentially the "status register" for the SPI transfer?
    b. _What does bitflag 0x4000 correspond to for SPIX.SR? (transfer complete? transfer started?) _
  2. set SPIX.PUSHR register to a value, based on two of the template parameter + 16-bit value to be written.
    a. Is SPIX.PUSHR a configuration register that modifies how the next SPI byte is transmitted?
    b. What does the "EOQ" in SPI_PUSHR_EOQ stand for?
    c. What does the "CONT" in SPI_PUSHR_CONT stand for / what purpose of ECont?
    d. Is it correct that ELast refers to whether the callback is the last one in a SPI transaction?
    e. Is it correct that SPI_PUSHR_CTAS(N) returns a value (for SPIX.PUSHR) corresponding to the number of bytes to transfer? (used only with N=1 or N=1)
  3. unconditionally set bitflag SPI_SR_TCF in SPIX.SR
    a. What does the "TCF" in SPI_SR_TCF stand for?
  4. if POST wait state, loops until bitflag 0x4000 in register SPIX.SR is clear.

In writeBit(), it states to clear the FMSZ bits. What in the world is an FMSZ bit? What's the point of the writeBit() function, since it appears to be writing an entire word?

The <D>writeBytes() function appears to:
11. For each current byte in the buffer:
a. Calling D::Adjust(currentByte) -- is this for color correction?
b. Passing the adjusted byte to writeByte() -- why is there no wait here?
12. Calling D::postBlock(len) -- what is this for?
13. waitFully(), which is a mystery in itself... is this equivalent to waiting for all the bytes to have been both sent and received?

Final q: wait1() appears to wait until at least one of the flags 0x8000, 0x4000, or 0x2000 in SPIX_SR is set.... but what does that represent? (e.g., transfer started, tx complete, transfer complete?)

So many great optimization tricks in that code. What I write will be unlikely to enable the compiler to collapse as much as K66's code -- I can see it when it's written, but lack the experience to know the tricks to try. Hopefully, having a slower working (correct) code base will enable the optimizations to be added later, by optimization experts such as yourself. :)

@focalintent

This comment has been minimized.

Copy link
Member

focalintent commented Apr 14, 2019

Unfortunately - register heavy code like that I did a lot of work on with the KL66 data sheet in one hand, a scope in the other, and a keyboard/compiler in the rest. I'll answer some of these at random, but don't have the time at the moment to go digging through the code and the data sheet :)

FMSZ -- "FraMe SiZe" -- the writeBit method resets the frame size to 1 bit and then writes that bit out. This is necessary for irritating chipsets like the SM16756 (I think that's the one) that has a 1-bit header at the start of each frame facepalm

D::PostBlock is to account for different chipsets that might have different end-frame activities (before releasing the SPI hardware).

Also - on reliable clocking - the problem with the APA102 is the signal regeneration happening in each chip - in which case, it doesn't matter what you do with the clocking you give it, it's going to mess you up if your strip is long enough : )

The thing that the KL66 has is a small set (8 bytes, I forget?) of bytes that you can pre-buffer for the SPI output - think of it as small scale DMA - trying to optimize this up is where I discovered that the scaling/dithering was the bottleneck more than the writing led data, which made fully implementing DMA seem less urgent.

waitFully is doing two things - the first is waiting for the queue'd up bytes to be fully sent over to the shift register for the SPI output - then once that's done it waits for shift register to be fully written out. This is as opposed to wait - which just waits for there to be room in the queue to add data to get shifted out.

writeByte does the waiting itself:

	static void writeByte(uint8_t b) __attribute__((always_inline)) { wait(); SPIX.PUSHR = SPI_PUSHR_CTAS(0) | (b & 0xFF); SPIX.SR |= SPI_SR_TCF;}

Note that it waits for there to be space to write something - it doesn't wait before returning (because, well, time spent waiting is time that'd better be spent prepping the next set of data for writing : )

@henrygab

This comment has been minimized.

Copy link
Author

henrygab commented Apr 14, 2019

Wonderful! Thank you. Knowing the implied contracts is super-helpful:

  • wait() <== wait until can add buffer to SPI queue
  • waitFully() <== wait for ALL buffered SPI transfers to complete (queue empty, SPI transaction complete)

Of course, the buffer here is either one or two bytes at a time.

IIRC, the APA102's signal regeneration inverts the clock signal … a neat hack. It's a long-shot, as I'd be shocked if not true, but … does FastLED ensure both the high and low clock signal times are substantially equivalent? If not, I could that causing all sorts of corruption, as every other APA102 chip might be dealing with timings that are somewhat off?

Do I recall correctly that the APA102 chips also need the clock to continue for a bit after the data was sent? I seem to recall something quirky there, especially when exceeding 64(?) LEDs....

@focalintent

This comment has been minimized.

Copy link
Member

focalintent commented Apr 14, 2019

For hardware SPI - that's up to the SPI hardware to define the shape of the clock - but https://www.pjrc.com/why-apa102-leds-have-trouble-at-24-mhz/ talks more about what's happening with the APA102's clock.

As for the clock cycling at the end of a frame, that's what the APA102 chipset end boundary is for:

https://github.com/FastLED/FastLED/blob/master/chipsets.h#L210

@henrygab

This comment has been minimized.

Copy link
Author

henrygab commented Apr 16, 2019

FastPin PR incoming.
That's a great writeup on what's going on with the APA102 clock.
I'm not going to his those limits, as my longest planned section is 32 LEDs, and I don't intend to go above 12MHz SPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.