Using DMA-enabled SPI more effectively? #772
Can FastLED be … faster? I think it can, at least when the hardware SPI supports DMA and the chip has some extra RAM....
I'd love to know that this is already done. :) The goal is to be able to prepare a "next" update buffer while a current update buffer is still in the process of being sent via SPI DMA hardware.
Today, even with hardware SPI with DMA capability, bytes are still sent one (or two) at a time. This is a waste of CPU cycles! The DMA engine provides greater benefits with larger buffers. Even without interrupts, just changing when one waits for a larger SPI DMA operation to complete should enable parallelization of some processing.
When adding LEDs that use SPI with hardware DMA:
When FastLED.show(), for each set of LEDs that use this method:
What the FastSPI class would then do:
Other potential benefits
Interrupts need not be disabled.
Once the full-update buffer is setup, the SPI chipset will ensure the bits are sent at the selected clock speed, increasing reliability of SPI updates for long LED strings, even when other non-blockable interrupts occur (I'm looking at those shiny new "integrated BLE" chipsets <cough>nRF52</cough>).
If the SPI chipset supports the exact speed needed for a clockless chip, it may be possible to reliably control the clockless chip, even with interrupts enabled.
Is this framework (or similar) already existing? I only have familiarity with the nRF51 codebase (and the nRF52 chips also), but it appears to support only single-byte SPI transactions....
This is basically what the OctoWS2811 driver does - but for clockless chipsets. I do want to do a framework for a variety of DMA based outputs - but that's further down the line.
What I found, though, is that with chipsets like the APA102 the time spent scaling/dithering was roughly in line with the time it was taking to write out a byte (the teensy 3.x. SPI output does a lot of this - pushing out data, then scaling/dithering the next chunk of data while that data is being written out, basically parallelizing the scaling/dithering with the output writing). For the most part, the benefit that would've come from using DMA was more theoretical than not, even moreso when considering that the clockless chips still reign supreme and APA102 hasn't quite managed to kick them out - and with the clockless chips, parallel output is a far better option (doubly so if you can get DMA'd parallel output, like with the OctoWS2811 on the teensy 3.x).
Since I'm completely redoing the controller architecture as part of the rgbw/16-bit work, I'm going to be putting in a framework for playing with DMA output on platforms/systems that would support it.
(Also - anyone who has tried to drive APA102's at over 12Mhz are laughing hysterically at the "reliable transmission" line : )
Also also - the DMA mechanism for every chipset is wildly different in configuration/operation - it makes GPIO access look downright uniformly defined!
Heh, reliable as in clocking, and even 4MHz SPI is a win. 8MHz is expected to be more reliable on the nRF52 by setting a higher-strength drive mode (both to zero and to one), to tighten timing. wiring, pin selection, and the like are another thing … but getting clocking out of the way?
Teensy 3... arm\KL66? That state-machine based Write template is scary excellent!
Comments to understand what the code is doing, what the flags correspond to, and what the (implied) contracts are … those are unfortunately missing. This makes it really hard for someone who isn't deeply familiar with the K66 registers to make a mental model of what's happening. I hope you won't mind questions?
writeWord() appears to do the following:
Final q: wait1() appears to wait until at least one of the flags 0x8000, 0x4000, or 0x2000 in SPIX_SR is set.... but what does that represent? (e.g., transfer started, tx complete, transfer complete?)
So many great optimization tricks in that code. What I write will be unlikely to enable the compiler to collapse as much as K66's code -- I can see it when it's written, but lack the experience to know the tricks to try. Hopefully, having a slower working (correct) code base will enable the optimizations to be added later, by optimization experts such as yourself. :)
Unfortunately - register heavy code like that I did a lot of work on with the KL66 data sheet in one hand, a scope in the other, and a keyboard/compiler in the rest. I'll answer some of these at random, but don't have the time at the moment to go digging through the code and the data sheet :)
FMSZ -- "FraMe SiZe" -- the writeBit method resets the frame size to 1 bit and then writes that bit out. This is necessary for irritating chipsets like the SM16756 (I think that's the one) that has a 1-bit header at the start of each frame facepalm
D::PostBlock is to account for different chipsets that might have different end-frame activities (before releasing the SPI hardware).
Also - on reliable clocking - the problem with the APA102 is the signal regeneration happening in each chip - in which case, it doesn't matter what you do with the clocking you give it, it's going to mess you up if your strip is long enough : )
The thing that the KL66 has is a small set (8 bytes, I forget?) of bytes that you can pre-buffer for the SPI output - think of it as small scale DMA - trying to optimize this up is where I discovered that the scaling/dithering was the bottleneck more than the writing led data, which made fully implementing DMA seem less urgent.
waitFully is doing two things - the first is waiting for the queue'd up bytes to be fully sent over to the shift register for the SPI output - then once that's done it waits for shift register to be fully written out. This is as opposed to wait - which just waits for there to be room in the queue to add data to get shifted out.
writeByte does the waiting itself:
Note that it waits for there to be space to write something - it doesn't wait before returning (because, well, time spent waiting is time that'd better be spent prepping the next set of data for writing : )
Wonderful! Thank you. Knowing the implied contracts is super-helpful:
Of course, the buffer here is either one or two bytes at a time.
IIRC, the APA102's signal regeneration inverts the clock signal … a neat hack. It's a long-shot, as I'd be shocked if not true, but … does FastLED ensure both the high and low clock signal times are substantially equivalent? If not, I could that causing all sorts of corruption, as every other APA102 chip might be dealing with timings that are somewhat off?
Do I recall correctly that the APA102 chips also need the clock to continue for a bit after the data was sent? I seem to recall something quirky there, especially when exceeding 64(?) LEDs....
For hardware SPI - that's up to the SPI hardware to define the shape of the clock - but https://www.pjrc.com/why-apa102-leds-have-trouble-at-24-mhz/ talks more about what's happening with the APA102's clock.
As for the clock cycling at the end of a frame, that's what the APA102 chipset end boundary is for: