Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spi: Add SPI block-write to C++ and HAL for performance #4207

Merged
merged 5 commits into from
Jun 1, 2017

Conversation

geky
Copy link
Contributor

@geky geky commented Apr 20, 2017

This is based on previous work done by @simonqhughes to speed up the SD card implementation, it has a wider impact than just the SD card, so I'm making this pr against the current SPI implementation.

Replaced with assertion on current mutex owner. This requires locking in the caller, but without external locking the current behaviour is still at risk of contention for transactions > 1 byte.

Tested with sd-driver basic tests:

1MHz SPI 40 MHz SPI
with byte-locking 32.911s 25.386s
without byte-locking 25.386s 15.977s

Note: This does create the concern that single-byte writes are no longer thread-safe when used across multiple threads. However, there is very few use cases where single-byte SPI transactions are used, and most transactions already have external locking.

cc @c1728p9, @sg-


EDIT:

Sorry about the delay, got dragged into some other issues.

I have more data for you all, there's still quite a few patches I need to put up places but I have a better test that measure block writes vs block reads vs spi traffic (this is all with the SD card). I was also tinkering with the K64F's dma dspi api and put together a block_write like the one @kjbracey-arm was suggesting that takes advantage of the K64F's dma dspi api (this was before the talk of using the async api).

Anyways here's what I found with GCC_ARM and K64F at 40MHz:

SD write speed SD read speed SPI speed
with byte-writes 463.263Kbps 792.419Kbps 913.783Kbps
with block-writes 757.669Kbps 2.338Mbps 3.997Mbps
with hal block-writes 878.342Kbps 4.239Mbps 17.809Mbps

There is definitely motivation for block writes at the hal level in terms of raw SPI throughput.

As for why we don't see a proportional increase in throughput on the SD card, well I think this image sums it up quite nicely:
image


EDIT: EDIT:

I've gone ahead and updated this pr with the hal-level block writes, a default implementation in all of the target's spi implementations, and the example block write implementation using DMA the DSPI API in the k64f (33e9129) that I used for evaluation.

I've gone ahead and updated the PR description and title to reflect this.

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 21, 2017

+1 for the change, however this is a breaking change (should be noted here and even in the commit message), so users updating to the newer version would be aware of this.

What about the rest of drivers? I2C has the same problem (write for one byte locks/unlocks).

@sg-
Copy link
Contributor

sg- commented Apr 21, 2017

Is the point to then allow external synchronization around CS and override lock/unlock? The performance increase doesn't look much better but I don't know what test was used. Looks like write still calls acquire which still calls lock/unlock on a per write basis.

Would having a block write addition not provide more performance?

@sg-
Copy link
Contributor

sg- commented Apr 21, 2017

A nit but might consider updating the protected member naming typo (not your addition)? https://github.com/geky/mbed/blob/863717f427cb3859de5bd7091200c52d09f6ebc1/drivers/SPI.h#L263

Copy link
Contributor

@c1728p9 c1728p9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I would be curious to see what the performance of a block write function is compared to the single with and without a lock.

@0xc0170 This might not be as critical for I2C since it typically runs at a lower frequency than SPI.

drivers/SPI.cpp Outdated
#if MBED_CONF_RTOS_PRESENT
MBED_ASSERT(!_mutex_owner || _mutex_owner == rtos::Thread::gettid());
#endif

aquire();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also locks and unlocks the mutex. You might be able to get more performance if the lock is used only if owner is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks like you and @sg- are right, will update to your suggestion

@geky
Copy link
Contributor Author

geky commented Apr 21, 2017

Sorry, I should have clarified, this was tested by timing the sd-driver-features-tests-filesystem-basic set of tests, so there is other filesystem operations going on:
https://github.com/ARMmbed/sd-driver/blob/master/features/TESTS/filesystem/basic/basic.cpp

With the locks removed from acquire this is now:

1MHz SPI 40 MHz SPI
with byte-locking 32.911s 21.264s
without byte-locking 25.386s 10.816s

I2C should probably be updated as well for consistency, but it is less priority for the reason @c1728p9 mentioned.

I'll see what benefit there might be for a block write for the K64F, but that would be a much bigger HAL change.

@c1728p9
Copy link
Contributor

c1728p9 commented Apr 23, 2017

@geky I didn't mean test block write at the HAL level. I just meant calling spi_master_write in a loop with a block write function. That way the lock, acquire and class call are needed only once for the block.

drivers/SPI.h Outdated
@@ -115,7 +115,12 @@ class SPI {
*
* @returns
* Response from the SPI slave
*/
*
* @note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this note also relevant for transfer() function? As acquire was changed?

drivers/SPI.cpp Outdated
SingletonPtr<PlatformMutex> SPI::_mutex;
osThreadId _mutex_owner = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this global and not static ? _mutex is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh huh, I must have just missed that completely. Can update for consistency.

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 24, 2017

@sg- and @c1728p9 What do you think about this patch (after all the comments), what's the step forward? We shall address this low speed problem. As mentioned already we got some options.

Would having a block write addition not provide more performance?

This would be one - having block write as i2c has that would have lock/unlock. Or as is it here, lock/unlock is up to an app via lock/unlock methods, they are not invoked within write and acquire.

I'll see what benefit there might be for a block write for the K64F, but that would be a much bigger HAL change.

Would it? Having write() method that just wraps around single-byte write() with locking?

@geky
Copy link
Contributor Author

geky commented Apr 25, 2017

As @c1728p9 pointed out, I was confused on what layer the block write was inteded for.

Here's the measurements for the sd-driver-features-tests-filesystem-basic tests with a block write implementation in the SPI class (here):

1MHz SPI 40 MHz SPI
with byte-locking 32.911s 21.264s
without byte-locking 25.386s 10.816s
with block-writes 24.655s 10.792s

The block-writes and byte-writes without locking are comparable. There are still some patterns that don't map to block writes, such as sync patterns and commands, but these are a low percentage of SPI traffic.


I personally think taking in both removing the locking from byte-writes and the block-writes would be beneficial to have, so I don't really have a preference. Do we want to go with the block-writes instead?

@0xc0170
Copy link
Contributor

0xc0170 commented Apr 25, 2017

Thanks @geky for testing all options !

Having extending SPI by block write would be a way to go now (not a breaking change, current users who are affected by this slow speed would use it) and leaving byte write/read as they are for now (note can be added to the docs if we havent yet, something along the lines - byte writes in the loops might be slow due to locking, use block write instead).

+1 for all this work !

@theotherjimmy
Copy link
Contributor

bump @geky Where is this going?

@JanneKiiskila
Copy link
Contributor

Yep, can this make mbedOS 5.5 release? We should try to.

@YarivCol
Copy link
Contributor

you should also call acquire in the method lock, i have a bug (weird signal while switching owner of spi)
this bug is caused from selecting device(cs is 0) and than acquiring the spi bus this result in weird signal(change of spi format) seen by the device to solve this issue i suggest to call acquire in the method lock to insure that acquire of the spi bus will happen before device is selected ( cs =0)

@kjbracey
Copy link
Contributor

Over in atmel-rf-driver, we're having problems with too-slow SPI, and in our case we have already disabled the locks via override, but we're still suffering from the time to setup the hardware on each SPI call to the ST driver.

From that, it appears we do require block read/write, but actually extended into the HAL itself, to avoid programming lots of SPI peripheral registers per byte transferred.

The extension could be signalled by adding a DEVICE_SPI_BLOCK_WRITE, and giving SPI.cpp a compatibility fallback via spi_master_write if not provided.

(We'll only be doing 2-byte transfers for register accesses, but still it would be a significant gain over 1-byte transfers).

So my request is that if you're adding that new write method, please provide the hook for a spi_master_block_write().

@adbridge
Copy link
Contributor

@geky bump

@0xc0170
Copy link
Contributor

0xc0170 commented May 18, 2017

So my request is that if you're adding that new write method, please provide the hook for a spi_master_block_write().

Do we need another HAL function if we already got 2 that might be sufficient for this (1byte blocking write, block async write ) or not? Would write block method using async HAL spi write function be slow as well?

@kjbracey
Copy link
Contributor

If you can write your block API in terms of the async HAL API, then I guess that's fine. I haven't really dug through to figure out what the potential traps are there, if any. Would it preclude usage from interrupt? (Not that we do that any more).

In our particular case, it turns out all transfers in the time-critical path are only 2 bytes, so if there was significant setup overhead for the async, it might not work out. But for the ST driver it seems that their sync and async HAL calls are very similar (which is indeed why the sync setup is so slow), so maybe it would be equivalent.

As noted on the other discussion, the other pitfall is we need inter-byte gaps during bursts (250ns - about 2 bits). There's no current API to control that, afaict, so we'd probably still need to tweak the HAL.

@0xc0170
Copy link
Contributor

0xc0170 commented May 18, 2017

Thanks @kjbracey-arm for the inputs, this should allows us to find a solution to this. I believe async would be too overkill, same for the new HAL function.

Over in atmel-rf-driver, we're having problems with too-slow SPI, and in our case we have already disabled the locks via override, but we're still suffering from the time to setup the hardware on each SPI call to the ST driver.

This is an issue for ST spi implementation in this case, I assume that other targets do not have this problem or might have. Thus we will have a look at this (we can write simple test to measure time how long it takes for the transmission for targets). The assumption for 1 byte spi write should be - this function should be fairly small/fast and in most cases possible to be inlined (assuming lock done, SPI peripheral configured (we got that SPI* owner there in SPI class)). In case it is slow, hitting this type of problems, the implementation should be looked at.

We had something similar with GPIO, when some targets have drivers that takes long times to set gpio output high/low. @Sissors made a very nice timing GPIO table. We might need something like that for SPI byte write. Based on that table, some targets were improved.

The way forward then would be:

  • implement SPI write block method (this PR)
  • timing for SPI byte write for targets for comparison (to be done). Based on this, we can look at targets that are much slower compare to others

Does it sound good?

@kjbracey
Copy link
Contributor

Sounds reasonable to me. If you look at the discussion on the Atmel driver PR, ST have already managed some prototype improvements - enough to suffice for our application, I believe. The graphs there comparing K64F and ST are nice to visualise.

We do have some larger transfers - the radio data itself - 6 to 125 bytes. Those aren't speed-critical though, and possibly would be better served as async anyway.

@geky
Copy link
Contributor Author

geky commented May 18, 2017

Sorry about the delay, got dragged into some other issues.

I have more data for you all, there's still quite a few patches I need to put up places but I have a better test that measure block writes vs block reads vs spi traffic (this is all with the SD card). I was also tinkering with the K64F's dma and put together a block_write like the one @kjbracey-arm was suggesting that takes advantage of the K64F's dma (this was before the talk of using the async api).

Anyways here's what I found with GCC_ARM and K64F at 40MHz:

SD write speed SD read speed SPI speed
with byte-writes 463.263Kbps 792.419Kbps 913.783Kbps
with block-writes 757.669Kbps 2.338Mbps 3.997Mbps
with hal block-writes 878.342Kbps 4.239Mbps 17.809Mbps

There is definitely motivation for block writes at the hal level in terms of raw SPI throughput.

As for why we don't see a proportional increase in throughput on the SD card, well I think this image sums it up quite nicely:
image

@geky geky force-pushed the spi-remove-byte-locking branch from d883f7c to 5256742 Compare May 24, 2017 23:26
@geky geky changed the title drivers: Remove locking from SPI individual byte-writes for performance reasons spi: Added SPI block-write to C++ and HAL for performance May 24, 2017
@geky geky changed the title spi: Added SPI block-write to C++ and HAL for performance spi: Add SPI block-write to C++ and HAL for performance May 24, 2017
@sg-
Copy link
Contributor

sg- commented May 25, 2017

This looks nice! b as in bits or b as in Bytes?

DMA block writes 17.809Mbps

@0xc0170
Copy link
Contributor

0xc0170 commented May 25, 2017

Why was write method changed to use HAL (later added HAL write block) ? There's a lot of copy-paste now. What's the story there? (I am missing it from this PR/commits). Should we consider deprecating spi_master_write as it does not make sense much to have it, this new API supersedes it once implemented for all HAL?

These were options I believe @geky considered:

  • provide new blocking HAL master write blocks - this is what is now in this PR, and possibly deprecate spi_master_write (I haven't checked all targets, I assume those that use own drivers, would have block write rather than single write. Thus in many cases would be duplicated effort when porting a new target)
  • make spi_master_write fast as much as possible and use that (this one is tough would require some changes thus the first one is more appealing)

The speed increase in K64F for block write is because their API is suited for block writes (invoking quite some state machine there and all the preparation for it to set up transfer attributes), thus invoking spi_master_write for each byte takes too much time. This is a nice showcase that they could have implemented it via lower level to make it faster but that would loose benefit of having own driver.

DSPI_MasterTransferBlocking - is not using DMA, please correct the commit. This is polling blocking call. they provide own DMA (its separate implementation) and I believe we don't want to add anything dma related before we got DMA story ready.

@geky
Copy link
Contributor Author

geky commented May 25, 2017

This looks nice! b as in bits or b as in Bytes?

b as in bits

DSPI_MasterTransferBlocking - is not using DMA, please correct the commit.

Ah! Looks like you're right! My bad. That's a surprising speedup for software. It looks like they are pushing 16-bits to the hardware at a time so that may be some of it. I'll correct the commit message.

Why was write method changed to use HAL (later added HAL write block) ? There's a lot of copy-paste now. What's the story there?

I'll try to add more to the commits, the HAL function (spi_master_block_write) was added to give targets the ability to optimize block writes (example k64f, and as @kjbracey-arm noted this is useful for ST). I left the commit that just adds the C++ block write since the HAL addition isn't required for the C++ api change.

There is an easy default implementation of spi_master_block_write that just calls spi_master_write in a loop, so I added the default spi_master_block_write to all of the targets. These can be left up to the targets to optimize (for example k64f has been updated to use DSPI_MasterTransferBlocking).

Should we consider deprecating spi_master_write as it does not make sense much to have it, this new API supersedes it once implemented for all HAL?

That sounds like a good idea. I was trying to only change a small amount, but I can update the pr to deprecate spi_master_write. Since this is in the HAL, should it still use MBED_DEPRECATED_SINCE?

@sg-
Copy link
Contributor

sg- commented May 25, 2017

That sounds like a good idea. I was trying to only change a small amount, but I can update the pr to deprecate spi_master_write. Since this is in the HAL, should it still use MBED_DEPRECATED_SINCE?

Lets not deprecate HAL things until we actually have a specification to work against. I'd consider this benign addition as welcomed given good doxygen description and that support was added for all targets.

@0xc0170
Copy link
Contributor

0xc0170 commented May 25, 2017

Thanks for the explanation. Agree that deprecation later, just adding it into consideration !

LGTM

Please @geky fix the CI, this would be nice to have for mbed OS 5.5

To notify, please read the above few comments what this brings into HAL and how the speed can be improved by new implementation: cc @LMESTM @stevew817 @nvlsianpu @TomoYamanaka @khj098765 @kgills @ccli8 @radhika-raghavendran

@geky geky force-pushed the spi-remove-byte-locking branch from 33e9129 to a2f480f Compare May 25, 2017 17:04
geky added 3 commits May 25, 2017 12:04
    virtual int write(const char *tx_buffer, int tx_length,
            char *rx_buffer, int rx_length);

The main benefit of block-level SPI writes is the performance
improvement from not acquiring a mutex lock between each byte sent on
the SPI bus. The block write may also be poked through the hal level for
additional speed improvements.
Poking the block-level SPI writes through the HAL level gives more
freedom to targets to improve SPI transaction speed.
There is an easy default implementation of spi_master_block_write that
just calls spi_master_write in a loop, so the default implementation
of spi_master_block_write has been added to all targets.
@geky geky force-pushed the spi-remove-byte-locking branch 2 times, most recently from 93244a3 to 7369919 Compare May 25, 2017 17:08
performance improvements:
naive block writes    3.997Mbps
DSPI block writes    17.809Mbps
@geky geky force-pushed the spi-remove-byte-locking branch from 7369919 to e352f1b Compare May 25, 2017 17:08
Copy link
Contributor

@LMESTM LMESTM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi - it looks to me like target_stm changes are missing:
targets/TARGET_STM/stm_spi_api.c (pending action to rename the file). Anyway that's a single file to update for covering all the STM32 targets.

There is an easy default implementation of spi_master_block_write that
just calls spi_master_write in a loop, so the default implementation
of spi_master_block_write has been added to all targets.
@0xc0170
Copy link
Contributor

0xc0170 commented May 31, 2017

/morph test

@mbed-bot
Copy link

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test

Output

mbed Build Number: 377

Example Build failed!

@sg-
Copy link
Contributor

sg- commented May 31, 2017

/morph test

1 similar comment
@studavekar
Copy link
Contributor

/morph test

@mbed-bot
Copy link

mbed-bot commented Jun 1, 2017

Result: SUCCESS

Your command has finished executing! Here's what you wrote!

/morph test

Output

mbed Build Number: 393

All builds and test passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet