spi: Add SPI block-write to C++ and HAL for performance #4207

geky · 2017-04-20T23:03:45Z

This is based on previous work done by @simonqhughes to speed up the SD card implementation, it has a wider impact than just the SD card, so I'm making this pr against the current SPI implementation.

Replaced with assertion on current mutex owner. This requires locking in the caller, but without external locking the current behaviour is still at risk of contention for transactions > 1 byte.

Tested with sd-driver basic tests:

	1MHz SPI	40 MHz SPI
with byte-locking	32.911s	25.386s
without byte-locking	25.386s	15.977s

Note: This does create the concern that single-byte writes are no longer thread-safe when used across multiple threads. However, there is very few use cases where single-byte SPI transactions are used, and most transactions already have external locking.

cc @c1728p9, @sg-

EDIT:

Sorry about the delay, got dragged into some other issues.

I have more data for you all, there's still quite a few patches I need to put up places but I have a better test that measure block writes vs block reads vs spi traffic (this is all with the SD card). I was also tinkering with the K64F's ~~dma~~ dspi api and put together a block_write like the one @kjbracey-arm was suggesting that takes advantage of the K64F's ~~dma~~ dspi api (this was before the talk of using the async api).

Anyways here's what I found with GCC_ARM and K64F at 40MHz:

	SD write speed	SD read speed	SPI speed
with byte-writes	463.263Kbps	792.419Kbps	913.783Kbps
with block-writes	757.669Kbps	2.338Mbps	3.997Mbps
with hal block-writes	878.342Kbps	4.239Mbps	17.809Mbps

There is definitely motivation for block writes at the hal level in terms of raw SPI throughput.

As for why we don't see a proportional increase in throughput on the SD card, well I think this image sums it up quite nicely:

EDIT: EDIT:

I've gone ahead and updated this pr with the hal-level block writes, a default implementation in all of the target's spi implementations, and the example block write implementation using ~~DMA~~ the DSPI API in the k64f (33e9129) that I used for evaluation.

I've gone ahead and updated the PR description and title to reflect this.

0xc0170 · 2017-04-21T12:04:13Z

+1 for the change, however this is a breaking change (should be noted here and even in the commit message), so users updating to the newer version would be aware of this.

What about the rest of drivers? I2C has the same problem (write for one byte locks/unlocks).

sg- · 2017-04-21T13:43:20Z

Is the point to then allow external synchronization around CS and override lock/unlock? The performance increase doesn't look much better but I don't know what test was used. Looks like write still calls acquire which still calls lock/unlock on a per write basis.

Would having a block write addition not provide more performance?

sg- · 2017-04-21T13:52:40Z

A nit but might consider updating the protected member naming typo (not your addition)? https://github.com/geky/mbed/blob/863717f427cb3859de5bd7091200c52d09f6ebc1/drivers/SPI.h#L263

c1728p9

Looks good to me. I would be curious to see what the performance of a block write function is compared to the single with and without a lock.

@0xc0170 This might not be as critical for I2C since it typically runs at a lower frequency than SPI.

c1728p9 · 2017-04-21T17:57:54Z

drivers/SPI.cpp

+#if MBED_CONF_RTOS_PRESENT
+    MBED_ASSERT(!_mutex_owner || _mutex_owner == rtos::Thread::gettid());
+#endif
+
    aquire();


I think this also locks and unlocks the mutex. You might be able to get more performance if the lock is used only if owner is wrong.

Ah, looks like you and @sg- are right, will update to your suggestion

geky · 2017-04-21T20:57:28Z

Sorry, I should have clarified, this was tested by timing the sd-driver-features-tests-filesystem-basic set of tests, so there is other filesystem operations going on:
https://github.com/ARMmbed/sd-driver/blob/master/features/TESTS/filesystem/basic/basic.cpp

With the locks removed from acquire this is now:

	1MHz SPI	40 MHz SPI
with byte-locking	32.911s	21.264s
without byte-locking	25.386s	10.816s

I2C should probably be updated as well for consistency, but it is less priority for the reason @c1728p9 mentioned.

I'll see what benefit there might be for a block write for the K64F, but that would be a much bigger HAL change.

c1728p9 · 2017-04-23T19:43:54Z

@geky I didn't mean test block write at the HAL level. I just meant calling spi_master_write in a loop with a block write function. That way the lock, acquire and class call are needed only once for the block.

0xc0170 · 2017-04-24T07:05:00Z

drivers/SPI.h

@@ -115,7 +115,12 @@ class SPI {
     *
     *  @returns
     *    Response from the SPI slave
-    */
+     *
+     *  @note


Is this note also relevant for transfer() function? As acquire was changed?

0xc0170 · 2017-04-24T07:09:43Z

drivers/SPI.cpp

 SingletonPtr<PlatformMutex> SPI::_mutex;
+osThreadId _mutex_owner = NULL;


Why is this global and not static ? _mutex is.

Oh huh, I must have just missed that completely. Can update for consistency.

0xc0170 · 2017-04-24T14:51:24Z

@sg- and @c1728p9 What do you think about this patch (after all the comments), what's the step forward? We shall address this low speed problem. As mentioned already we got some options.

Would having a block write addition not provide more performance?

This would be one - having block write as i2c has that would have lock/unlock. Or as is it here, lock/unlock is up to an app via lock/unlock methods, they are not invoked within write and acquire.

I'll see what benefit there might be for a block write for the K64F, but that would be a much bigger HAL change.

Would it? Having write() method that just wraps around single-byte write() with locking?

geky · 2017-04-25T00:50:13Z

As @c1728p9 pointed out, I was confused on what layer the block write was inteded for.

Here's the measurements for the sd-driver-features-tests-filesystem-basic tests with a block write implementation in the SPI class (here):

	1MHz SPI	40 MHz SPI
with byte-locking	32.911s	21.264s
without byte-locking	25.386s	10.816s
with block-writes	24.655s	10.792s

The block-writes and byte-writes without locking are comparable. There are still some patterns that don't map to block writes, such as sync patterns and commands, but these are a low percentage of SPI traffic.

I personally think taking in both removing the locking from byte-writes and the block-writes would be beneficial to have, so I don't really have a preference. Do we want to go with the block-writes instead?

0xc0170 · 2017-04-25T08:29:19Z

Thanks @geky for testing all options !

Having extending SPI by block write would be a way to go now (not a breaking change, current users who are affected by this slow speed would use it) and leaving byte write/read as they are for now (note can be added to the docs if we havent yet, something along the lines - byte writes in the loops might be slow due to locking, use block write instead).

+1 for all this work !

theotherjimmy · 2017-05-02T16:12:51Z

bump @geky Where is this going?

JanneKiiskila · 2017-05-04T11:28:44Z

Yep, can this make mbedOS 5.5 release? We should try to.

YarivCol · 2017-05-12T11:53:03Z

you should also call acquire in the method lock, i have a bug (weird signal while switching owner of spi)
this bug is caused from selecting device(cs is 0) and than acquiring the spi bus this result in weird signal(change of spi format) seen by the device to solve this issue i suggest to call acquire in the method lock to insure that acquire of the spi bus will happen before device is selected ( cs =0)

kjbracey · 2017-05-15T13:28:50Z

Over in atmel-rf-driver, we're having problems with too-slow SPI, and in our case we have already disabled the locks via override, but we're still suffering from the time to setup the hardware on each SPI call to the ST driver.

From that, it appears we do require block read/write, but actually extended into the HAL itself, to avoid programming lots of SPI peripheral registers per byte transferred.

The extension could be signalled by adding a DEVICE_SPI_BLOCK_WRITE, and giving SPI.cpp a compatibility fallback via spi_master_write if not provided.

(We'll only be doing 2-byte transfers for register accesses, but still it would be a significant gain over 1-byte transfers).

So my request is that if you're adding that new write method, please provide the hook for a spi_master_block_write().

adbridge · 2017-05-17T14:40:41Z

@geky bump

0xc0170 · 2017-05-18T09:06:53Z

So my request is that if you're adding that new write method, please provide the hook for a spi_master_block_write().

Do we need another HAL function if we already got 2 that might be sufficient for this (1byte blocking write, block async write ) or not? Would write block method using async HAL spi write function be slow as well?

kjbracey · 2017-05-18T09:37:30Z

If you can write your block API in terms of the async HAL API, then I guess that's fine. I haven't really dug through to figure out what the potential traps are there, if any. Would it preclude usage from interrupt? (Not that we do that any more).

In our particular case, it turns out all transfers in the time-critical path are only 2 bytes, so if there was significant setup overhead for the async, it might not work out. But for the ST driver it seems that their sync and async HAL calls are very similar (which is indeed why the sync setup is so slow), so maybe it would be equivalent.

As noted on the other discussion, the other pitfall is we need inter-byte gaps during bursts (250ns - about 2 bits). There's no current API to control that, afaict, so we'd probably still need to tweak the HAL.

0xc0170 · 2017-05-18T13:54:05Z

Thanks @kjbracey-arm for the inputs, this should allows us to find a solution to this. I believe async would be too overkill, same for the new HAL function.

Over in atmel-rf-driver, we're having problems with too-slow SPI, and in our case we have already disabled the locks via override, but we're still suffering from the time to setup the hardware on each SPI call to the ST driver.

This is an issue for ST spi implementation in this case, I assume that other targets do not have this problem or might have. Thus we will have a look at this (we can write simple test to measure time how long it takes for the transmission for targets). The assumption for 1 byte spi write should be - this function should be fairly small/fast and in most cases possible to be inlined (assuming lock done, SPI peripheral configured (we got that SPI* owner there in SPI class)). In case it is slow, hitting this type of problems, the implementation should be looked at.

We had something similar with GPIO, when some targets have drivers that takes long times to set gpio output high/low. @Sissors made a very nice timing GPIO table. We might need something like that for SPI byte write. Based on that table, some targets were improved.

The way forward then would be:

implement SPI write block method (this PR)
timing for SPI byte write for targets for comparison (to be done). Based on this, we can look at targets that are much slower compare to others

Does it sound good?

kjbracey · 2017-05-18T14:05:25Z

Sounds reasonable to me. If you look at the discussion on the Atmel driver PR, ST have already managed some prototype improvements - enough to suffice for our application, I believe. The graphs there comparing K64F and ST are nice to visualise.

We do have some larger transfers - the radio data itself - 6 to 125 bytes. Those aren't speed-critical though, and possibly would be better served as async anyway.

geky · 2017-05-18T20:15:04Z

Sorry about the delay, got dragged into some other issues.

I have more data for you all, there's still quite a few patches I need to put up places but I have a better test that measure block writes vs block reads vs spi traffic (this is all with the SD card). I was also tinkering with the K64F's dma and put together a block_write like the one @kjbracey-arm was suggesting that takes advantage of the K64F's dma (this was before the talk of using the async api).

Anyways here's what I found with GCC_ARM and K64F at 40MHz:

	SD write speed	SD read speed	SPI speed
with byte-writes	463.263Kbps	792.419Kbps	913.783Kbps
with block-writes	757.669Kbps	2.338Mbps	3.997Mbps
with hal block-writes	878.342Kbps	4.239Mbps	17.809Mbps

There is definitely motivation for block writes at the hal level in terms of raw SPI throughput.

As for why we don't see a proportional increase in throughput on the SD card, well I think this image sums it up quite nicely:

sg- · 2017-05-25T02:52:36Z

This looks nice! b as in bits or b as in Bytes?

DMA block writes 17.809Mbps

0xc0170 · 2017-05-25T11:21:51Z

Why was write method changed to use HAL (later added HAL write block) ? There's a lot of copy-paste now. What's the story there? (I am missing it from this PR/commits). Should we consider deprecating spi_master_write as it does not make sense much to have it, this new API supersedes it once implemented for all HAL?

These were options I believe @geky considered:

provide new blocking HAL master write blocks - this is what is now in this PR, and possibly deprecate spi_master_write (I haven't checked all targets, I assume those that use own drivers, would have block write rather than single write. Thus in many cases would be duplicated effort when porting a new target)
make spi_master_write fast as much as possible and use that (this one is tough would require some changes thus the first one is more appealing)

The speed increase in K64F for block write is because their API is suited for block writes (invoking quite some state machine there and all the preparation for it to set up transfer attributes), thus invoking spi_master_write for each byte takes too much time. This is a nice showcase that they could have implemented it via lower level to make it faster but that would loose benefit of having own driver.

DSPI_MasterTransferBlocking - is not using DMA, please correct the commit. This is polling blocking call. they provide own DMA (its separate implementation) and I believe we don't want to add anything dma related before we got DMA story ready.

geky · 2017-05-25T16:11:23Z

This looks nice! b as in bits or b as in Bytes?

b as in bits

DSPI_MasterTransferBlocking - is not using DMA, please correct the commit.

Ah! Looks like you're right! My bad. That's a surprising speedup for software. It looks like they are pushing 16-bits to the hardware at a time so that may be some of it. I'll correct the commit message.

Why was write method changed to use HAL (later added HAL write block) ? There's a lot of copy-paste now. What's the story there?

I'll try to add more to the commits, the HAL function (spi_master_block_write) was added to give targets the ability to optimize block writes (example k64f, and as @kjbracey-arm noted this is useful for ST). I left the commit that just adds the C++ block write since the HAL addition isn't required for the C++ api change.

There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so I added the default spi_master_block_write to all of the targets. These can be left up to the targets to optimize (for example k64f has been updated to use DSPI_MasterTransferBlocking).

Should we consider deprecating spi_master_write as it does not make sense much to have it, this new API supersedes it once implemented for all HAL?

That sounds like a good idea. I was trying to only change a small amount, but I can update the pr to deprecate spi_master_write. Since this is in the HAL, should it still use MBED_DEPRECATED_SINCE?

sg- · 2017-05-25T16:31:54Z

That sounds like a good idea. I was trying to only change a small amount, but I can update the pr to deprecate spi_master_write. Since this is in the HAL, should it still use MBED_DEPRECATED_SINCE?

Lets not deprecate HAL things until we actually have a specification to work against. I'd consider this benign addition as welcomed given good doxygen description and that support was added for all targets.

0xc0170 · 2017-05-25T16:46:36Z

Thanks for the explanation. Agree that deprecation later, just adding it into consideration !

LGTM

Please @geky fix the CI, this would be nice to have for mbed OS 5.5

To notify, please read the above few comments what this brings into HAL and how the speed can be improved by new implementation: cc @LMESTM @stevew817 @nvlsianpu @TomoYamanaka @khj098765 @kgills @ccli8 @radhika-raghavendran

virtual int write(const char *tx_buffer, int tx_length, char *rx_buffer, int rx_length); The main benefit of block-level SPI writes is the performance improvement from not acquiring a mutex lock between each byte sent on the SPI bus. The block write may also be poked through the hal level for additional speed improvements.

Poking the block-level SPI writes through the HAL level gives more freedom to targets to improve SPI transaction speed.

There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so the default implementation of spi_master_block_write has been added to all targets.

performance improvements: naive block writes 3.997Mbps DSPI block writes 17.809Mbps

LMESTM

hi - it looks to me like target_stm changes are missing:
targets/TARGET_STM/stm_spi_api.c (pending action to rename the file). Anyway that's a single file to update for covering all the STM32 targets.

There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so the default implementation of spi_master_block_write has been added to all targets.

0xc0170 · 2017-05-31T09:59:14Z

/morph test

mbed-bot · 2017-05-31T11:11:48Z

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test

Output

mbed Build Number: 377

Example Build failed!

sg- · 2017-05-31T16:14:06Z

/morph test

studavekar · 2017-06-01T05:55:29Z

/morph test

mbed-bot · 2017-06-01T08:05:28Z

Result: SUCCESS

Your command has finished executing! Here's what you wrote!

/morph test

Output

mbed Build Number: 393

All builds and test passed!

geky added the needs: review label Apr 20, 2017

TomoYamanaka mentioned this pull request Apr 21, 2017

Communication speed is very slow due to lock / unlock in 1 byte unit. #4009

Closed

c1728p9 approved these changes Apr 21, 2017

View reviewed changes

geky force-pushed the spi-remove-byte-locking branch from 863717f to d883f7c Compare April 21, 2017 20:50

0xc0170 reviewed Apr 24, 2017

View reviewed changes

theotherjimmy added needs: work and removed needs: review labels Apr 24, 2017

geky mentioned this pull request Apr 26, 2017

Poor stdio performance on μARM #440

Closed

0xc0170 mentioned this pull request May 15, 2017

ACK are missed on STM boards because of SPI exchange that are no fast enough PelionIoT/atmel-rf-driver#53

Closed

geky force-pushed the spi-remove-byte-locking branch from d883f7c to 5256742 Compare May 24, 2017 23:26

geky changed the title ~~drivers: Remove locking from SPI individual byte-writes for performance reasons~~ spi: Added SPI block-write to C++ and HAL for performance May 24, 2017

geky changed the title ~~spi: Added SPI block-write to C++ and HAL for performance~~ spi: Add SPI block-write to C++ and HAL for performance May 24, 2017

sg- added the release-version: 5.5.0-rc1 label May 25, 2017

geky force-pushed the spi-remove-byte-locking branch from 33e9129 to a2f480f Compare May 25, 2017 17:04

geky added 3 commits May 25, 2017 12:04

spi: Added block-level SPI writes at the HAL level

31bf0d5

Poking the block-level SPI writes through the HAL level gives more freedom to targets to improve SPI transaction speed.

spi: Added default spi_master_block_write implementation to all targets

c1de19e

There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so the default implementation of spi_master_block_write has been added to all targets.

geky force-pushed the spi-remove-byte-locking branch 2 times, most recently from 93244a3 to 7369919 Compare May 25, 2017 17:08

k64f: Added SPI block write using DSPI_MasterTransferBlocking

e352f1b

performance improvements: naive block writes 3.997Mbps DSPI block writes 17.809Mbps

geky force-pushed the spi-remove-byte-locking branch from 7369919 to e352f1b Compare May 25, 2017 17:08

LMESTM suggested changes May 29, 2017

View reviewed changes

spi: Added default spi_master_block_write implementation to stm targets

e65bb8d

There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so the default implementation of spi_master_block_write has been added to all targets.

LMESTM approved these changes May 31, 2017

View reviewed changes

0xc0170 added needs: CI and removed needs: work labels May 31, 2017

0xc0170 added ready for merge and removed needs: CI labels Jun 1, 2017

0xc0170 merged commit e229a49 into ARMmbed:master Jun 1, 2017

sg- removed the ready for merge label Jun 1, 2017

yogpan01 mentioned this pull request Jul 5, 2017

SD-Driver in mbed OS is having performance issue #4702

Closed

geky mentioned this pull request Dec 2, 2022

Discussion: SPI vs SDIO when using littlefs littlefs-project/littlefs#749

Closed

		SingletonPtr<PlatformMutex> SPI::_mutex;
		osThreadId _mutex_owner = NULL;

spi: Add SPI block-write to C++ and HAL for performance #4207

spi: Add SPI block-write to C++ and HAL for performance #4207

Conversation

geky commented Apr 20, 2017 • edited Loading

0xc0170 commented Apr 21, 2017

sg- commented Apr 21, 2017 • edited Loading

sg- commented Apr 21, 2017

c1728p9 left a comment

Choose a reason for hiding this comment

c1728p9 Apr 21, 2017

Choose a reason for hiding this comment

geky Apr 21, 2017

Choose a reason for hiding this comment

geky commented Apr 21, 2017

c1728p9 commented Apr 23, 2017

0xc0170 Apr 24, 2017

Choose a reason for hiding this comment

0xc0170 Apr 24, 2017

Choose a reason for hiding this comment

geky Apr 24, 2017

Choose a reason for hiding this comment

0xc0170 commented Apr 24, 2017 • edited Loading

geky commented Apr 25, 2017 • edited Loading

0xc0170 commented Apr 25, 2017

theotherjimmy commented May 2, 2017

JanneKiiskila commented May 4, 2017

YarivCol commented May 12, 2017

kjbracey commented May 15, 2017

adbridge commented May 17, 2017

0xc0170 commented May 18, 2017

kjbracey commented May 18, 2017

0xc0170 commented May 18, 2017 • edited Loading

kjbracey commented May 18, 2017

geky commented May 18, 2017 • edited Loading

sg- commented May 25, 2017

0xc0170 commented May 25, 2017

geky commented May 25, 2017

sg- commented May 25, 2017

0xc0170 commented May 25, 2017

LMESTM left a comment

Choose a reason for hiding this comment

0xc0170 commented May 31, 2017

mbed-bot commented May 31, 2017

Result: FAILURE

Output

sg- commented May 31, 2017

studavekar commented Jun 1, 2017

mbed-bot commented Jun 1, 2017

Result: SUCCESS

Output

geky commented Apr 20, 2017 •

edited

Loading

sg- commented Apr 21, 2017 •

edited

Loading

0xc0170 commented Apr 24, 2017 •

edited

Loading

geky commented Apr 25, 2017 •

edited

Loading

0xc0170 commented May 18, 2017 •

edited

Loading

geky commented May 18, 2017 •

edited

Loading