-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spi: Add SPI block-write to C++ and HAL for performance #4207
Conversation
+1 for the change, however this is a breaking change (should be noted here and even in the commit message), so users updating to the newer version would be aware of this. What about the rest of drivers? I2C has the same problem (write for one byte locks/unlocks). |
Is the point to then allow external synchronization around CS and override lock/unlock? The performance increase doesn't look much better but I don't know what test was used. Looks like write still calls acquire which still calls lock/unlock on a per write basis. Would having a block write addition not provide more performance? |
A nit but might consider updating the protected member naming typo (not your addition)? https://github.com/geky/mbed/blob/863717f427cb3859de5bd7091200c52d09f6ebc1/drivers/SPI.h#L263 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I would be curious to see what the performance of a block write function is compared to the single with and without a lock.
@0xc0170 This might not be as critical for I2C since it typically runs at a lower frequency than SPI.
drivers/SPI.cpp
Outdated
#if MBED_CONF_RTOS_PRESENT | ||
MBED_ASSERT(!_mutex_owner || _mutex_owner == rtos::Thread::gettid()); | ||
#endif | ||
|
||
aquire(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also locks and unlocks the mutex. You might be able to get more performance if the lock is used only if owner is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looks like you and @sg- are right, will update to your suggestion
863717f
to
d883f7c
Compare
Sorry, I should have clarified, this was tested by timing the With the locks removed from acquire this is now:
I2C should probably be updated as well for consistency, but it is less priority for the reason @c1728p9 mentioned. I'll see what benefit there might be for a block write for the K64F, but that would be a much bigger HAL change. |
@geky I didn't mean test block write at the HAL level. I just meant calling spi_master_write in a loop with a block write function. That way the lock, acquire and class call are needed only once for the block. |
drivers/SPI.h
Outdated
@@ -115,7 +115,12 @@ class SPI { | |||
* | |||
* @returns | |||
* Response from the SPI slave | |||
*/ | |||
* | |||
* @note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this note also relevant for transfer() function? As acquire was changed?
drivers/SPI.cpp
Outdated
SingletonPtr<PlatformMutex> SPI::_mutex; | ||
osThreadId _mutex_owner = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this global and not static
? _mutex
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh huh, I must have just missed that completely. Can update for consistency.
@sg- and @c1728p9 What do you think about this patch (after all the comments), what's the step forward? We shall address this low speed problem. As mentioned already we got some options.
This would be one - having block write as i2c has that would have lock/unlock. Or as is it here, lock/unlock is up to an app via lock/unlock methods, they are not invoked within write and acquire.
Would it? Having write() method that just wraps around single-byte write() with locking? |
As @c1728p9 pointed out, I was confused on what layer the block write was inteded for. Here's the measurements for the
The block-writes and byte-writes without locking are comparable. There are still some patterns that don't map to block writes, such as sync patterns and commands, but these are a low percentage of SPI traffic. I personally think taking in both removing the locking from byte-writes and the block-writes would be beneficial to have, so I don't really have a preference. Do we want to go with the block-writes instead? |
Thanks @geky for testing all options ! Having extending SPI by block write would be a way to go now (not a breaking change, current users who are affected by this slow speed would use it) and leaving byte write/read as they are for now (note can be added to the docs if we havent yet, something along the lines - byte writes in the loops might be slow due to locking, use block write instead). +1 for all this work ! |
bump @geky Where is this going? |
Yep, can this make mbedOS 5.5 release? We should try to. |
you should also call acquire in the method lock, i have a bug (weird signal while switching owner of spi) |
Over in atmel-rf-driver, we're having problems with too-slow SPI, and in our case we have already disabled the locks via override, but we're still suffering from the time to setup the hardware on each SPI call to the ST driver. From that, it appears we do require block read/write, but actually extended into the HAL itself, to avoid programming lots of SPI peripheral registers per byte transferred. The extension could be signalled by adding a DEVICE_SPI_BLOCK_WRITE, and giving SPI.cpp a compatibility fallback via spi_master_write if not provided. (We'll only be doing 2-byte transfers for register accesses, but still it would be a significant gain over 1-byte transfers). So my request is that if you're adding that new write method, please provide the hook for a spi_master_block_write(). |
@geky bump |
Do we need another HAL function if we already got 2 that might be sufficient for this (1byte blocking write, block async write ) or not? Would write block method using async HAL spi write function be slow as well? |
If you can write your block API in terms of the async HAL API, then I guess that's fine. I haven't really dug through to figure out what the potential traps are there, if any. Would it preclude usage from interrupt? (Not that we do that any more). In our particular case, it turns out all transfers in the time-critical path are only 2 bytes, so if there was significant setup overhead for the async, it might not work out. But for the ST driver it seems that their sync and async HAL calls are very similar (which is indeed why the sync setup is so slow), so maybe it would be equivalent. As noted on the other discussion, the other pitfall is we need inter-byte gaps during bursts (250ns - about 2 bits). There's no current API to control that, afaict, so we'd probably still need to tweak the HAL. |
Thanks @kjbracey-arm for the inputs, this should allows us to find a solution to this. I believe async would be too overkill, same for the new HAL function.
This is an issue for ST spi implementation in this case, I assume that other targets do not have this problem or might have. Thus we will have a look at this (we can write simple test to measure time how long it takes for the transmission for targets). The assumption for 1 byte spi write should be - this function should be fairly small/fast and in most cases possible to be inlined (assuming lock done, SPI peripheral configured (we got that SPI* owner there in SPI class)). In case it is slow, hitting this type of problems, the implementation should be looked at. We had something similar with GPIO, when some targets have drivers that takes long times to set gpio output high/low. @Sissors made a very nice timing GPIO table. We might need something like that for SPI byte write. Based on that table, some targets were improved. The way forward then would be:
Does it sound good? |
Sounds reasonable to me. If you look at the discussion on the Atmel driver PR, ST have already managed some prototype improvements - enough to suffice for our application, I believe. The graphs there comparing K64F and ST are nice to visualise. We do have some larger transfers - the radio data itself - 6 to 125 bytes. Those aren't speed-critical though, and possibly would be better served as async anyway. |
d883f7c
to
5256742
Compare
This looks nice! b as in bits or b as in Bytes?
|
Why was These were options I believe @geky considered:
The speed increase in K64F for block write is because their API is suited for block writes (invoking quite some state machine there and all the preparation for it to set up transfer attributes), thus invoking DSPI_MasterTransferBlocking - is not using DMA, please correct the commit. This is polling blocking call. they provide own DMA (its separate implementation) and I believe we don't want to add anything dma related before we got DMA story ready. |
b as in bits
Ah! Looks like you're right! My bad. That's a surprising speedup for software. It looks like they are pushing 16-bits to the hardware at a time so that may be some of it. I'll correct the commit message.
I'll try to add more to the commits, the HAL function ( There is an easy default implementation of
That sounds like a good idea. I was trying to only change a small amount, but I can update the pr to deprecate |
Lets not deprecate HAL things until we actually have a specification to work against. I'd consider this benign addition as welcomed given good doxygen description and that support was added for all targets. |
Thanks for the explanation. Agree that deprecation later, just adding it into consideration ! LGTM Please @geky fix the CI, this would be nice to have for mbed OS 5.5 To notify, please read the above few comments what this brings into HAL and how the speed can be improved by new implementation: cc @LMESTM @stevew817 @nvlsianpu @TomoYamanaka @khj098765 @kgills @ccli8 @radhika-raghavendran |
33e9129
to
a2f480f
Compare
virtual int write(const char *tx_buffer, int tx_length, char *rx_buffer, int rx_length); The main benefit of block-level SPI writes is the performance improvement from not acquiring a mutex lock between each byte sent on the SPI bus. The block write may also be poked through the hal level for additional speed improvements.
Poking the block-level SPI writes through the HAL level gives more freedom to targets to improve SPI transaction speed.
There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so the default implementation of spi_master_block_write has been added to all targets.
93244a3
to
7369919
Compare
performance improvements: naive block writes 3.997Mbps DSPI block writes 17.809Mbps
7369919
to
e352f1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi - it looks to me like target_stm changes are missing:
targets/TARGET_STM/stm_spi_api.c (pending action to rename the file). Anyway that's a single file to update for covering all the STM32 targets.
There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so the default implementation of spi_master_block_write has been added to all targets.
/morph test |
Result: FAILUREYour command has finished executing! Here's what you wrote!
OutputExample Build failed! |
/morph test |
1 similar comment
/morph test |
Result: SUCCESSYour command has finished executing! Here's what you wrote!
OutputAll builds and test passed! |
This is based on previous work done by @simonqhughes to speed up the SD card implementation, it has a wider impact than just the SD card, so I'm making this pr against the current SPI implementation.
Replaced with assertion on current mutex owner. This requires locking in the caller, but without external locking the current behaviour is still at risk of contention for transactions > 1 byte.
Tested with sd-driver basic tests:
Note: This does create the concern that single-byte writes are no longer thread-safe when used across multiple threads. However, there is very few use cases where single-byte SPI transactions are used, and most transactions already have external locking.
cc @c1728p9, @sg-
EDIT:
Sorry about the delay, got dragged into some other issues.
I have more data for you all, there's still quite a few patches I need to put up places but I have a better test that measure block writes vs block reads vs spi traffic (this is all with the SD card). I was also tinkering with the K64F's
dmadspi api and put together a block_write like the one @kjbracey-arm was suggesting that takes advantage of the K64F'sdmadspi api (this was before the talk of using the async api).Anyways here's what I found with GCC_ARM and K64F at 40MHz:
There is definitely motivation for block writes at the hal level in terms of raw SPI throughput.
As for why we don't see a proportional increase in throughput on the SD card, well I think this image sums it up quite nicely:
![image](https://cloud.githubusercontent.com/assets/1075160/26221504/624a3568-3bdc-11e7-8bc9-53aecc682151.png)
EDIT: EDIT:
I've gone ahead and updated this pr with the hal-level block writes, a default implementation in all of the target's spi implementations, and the example block write implementation using
DMAthe DSPI API in the k64f (33e9129) that I used for evaluation.I've gone ahead and updated the PR description and title to reflect this.