Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tickless tick computation #9786

Merged
merged 2 commits into from Feb 27, 2019

Conversation

Projects
None yet
8 participants
@c1728p9
Copy link
Contributor

commented Feb 21, 2019

Description

Optimize the tick computation by computing the elapsed ticks based on relative time rather than absolute time. On the NUCLEO_L073RZ this reduces the execution time of SysTimer::resume from >35us to less than 15us 20us.

Also, to ensure the relative time calculations do not overflow set the max time the OS can suspend to 1 hour.

Pull request type

[x ] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

@pan-

@ciarmcom ciarmcom requested review from pan- and ARMmbed/mbed-os-maintainers Feb 21, 2019

@ciarmcom

This comment has been minimized.

Copy link
Member

commented Feb 21, 2019

@c1728p9, thank you for your changes.
@pan- @ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers please review.

@ciarmcom ciarmcom requested a review from ARMmbed/mbed-os-core Feb 21, 2019

@kjbracey-arm
Copy link
Contributor

left a comment

This saves a fair bit on the M0, where it converts a 64-bit software division to a 32-bit software division, but on M3 and M4 it should save a lot more - it should be optimised into a "multiply by reciprocal" using UMULL. (I know GCC and ARMC5 do this, hopefully IAR and ARMC6 do too).

This is slightly at odds with something I'm looking to do with SysTimer - that was actually going to make it more 64-bit, to help support a manual suspend API, where I was anticipating long sleep times. But I think I could deal with that by having conditional behaviour - switch between optimised 32-bit and heavy 64-bit computations depending on whether deep sleep is locked. If deep sleep is permitted, we have a free reign on interrupt latency.

Show resolved Hide resolved rtos/TARGET_CORTEX/SysTimer.cpp Outdated
@jeromecoutant

This comment has been minimized.

Copy link
Contributor

commented Feb 21, 2019

@janjongboom
Maybe you can also have a look as I think you are using tickless for L072 Lora board ?

@LMESTM

@c1728p9 c1728p9 force-pushed the c1728p9:tickless_optimization branch Feb 21, 2019

rtos/TARGET_CORTEX/SysTimer.cpp Outdated
// 32 bit division is used here for better performance since this is in a critical section.
const timestamp_t elapsed_us = (timestamp_t)(ticker_read_us(_ticker_data) - _time_us);
uint32_t elapsed_ticks = (elapsed_us + _remainder_us) / US_IN_TICK;
_remainder_us = elapsed_us - elapsed_ticks * US_IN_TICK;

This comment has been minimized.

Copy link
@c1728p9

c1728p9 Feb 21, 2019

Author Contributor

I think this remainder calculation is wrong.

@c1728p9 c1728p9 force-pushed the c1728p9:tickless_optimization branch Feb 22, 2019

@c1728p9

This comment has been minimized.

Copy link
Contributor Author

commented Feb 22, 2019

I updated this PR to preserver the long sleep times without having an impact on interrupt latency. Bringing back 64 bit division brought execution time of SysTimer::resume up to to a bit under 20us. I also added a separate commit to remove the critical section in places where it wasn't needed, such as SysTimer::resume.

@cmonr cmonr requested a review from kjbracey-arm Feb 22, 2019


uint64_t new_tick = (ticker_read_us(_ticker_data) - _start_time) * OS_TICK_FREQ / 1000000;
if (new_tick > _tick) {
uint64_t elapsed_ticks = (ticker_read_us(_ticker_data) - _time_us) / US_IN_TICK;

This comment has been minimized.

Copy link
@kjbracey-arm

kjbracey-arm Feb 22, 2019

Contributor

Gah!!! I briefly wondered whether this might be a solution, but immediately shot it down because I assumed SysTimer::resume was being called in a critical section from outside. This is great.

I think it's still worth optimising for the common short case - if this is taking 20us every millisecond, that is 2% of CPU time just for the division. (On a 4MHz system I saw about 10% of CPU time being eaten up by the millisecond tickers.).

How about:

uint64_t elapsed_us = ticker_read_us(_ticker_data) - _time_us;
uint64_t elapsed_ticks;
if (elapsed_us > 0xFFFFFFFF) {
    // This will typically invoke full 64/64 software division
    elapsed_ticks = elapsed_us / US_IN_TICK;
} else {
    // This will will invoke 32/32 software division on ARMv6-M or ARMv8-M baseline, and
    // on ARMv7-M or ARMv8-M mainline can be optimised to a UMULL by 2^n/US_IN_TICK, or
    // at worst be a UDIV instruction. Thus it will be faster, maybe greatly.
    elapsed_ticks = (uint32_t) elapsed_us / US_IN_TICK;
}

This comment has been minimized.

Copy link
@c1728p9

c1728p9 Feb 22, 2019

Author Contributor

I assumed SysTimer::resume was being called in a critical section from outside.

It used to be in a critical section, but had to be changed when the RTX API it relied on became unavailable. I missed it at this time as well.

Initially I had it avoid division all together in the fast case - when only a single tick had elapsed:

if (elapsed_time < US_IN_TICK) {
    elapsed_ticks = 0;
} else if (elapsed_time < US_IN_TICK * 2) {
    elapsed_ticks = 1;
} else {
    elapsed_ticks = elapsed_time / US_IN_TICK;
}

I ended up going with the simpler implementation you see now since I was just worried about reducing worst case execution time. Also, I was only setup to profile only worst-case execution time, which did not change much from this.

It sounds like you have profiled the average execution time on a couple of devices. Would you be able to profile the different optimizations and create a PR for the one which looks the most promising?

This comment has been minimized.

Copy link
@kjbracey-arm

kjbracey-arm Feb 25, 2019

Contributor

I'll have to see if those extra cases help too. My feeling is that division tends to have early termination, so might not help much, but worth checking.

I was just worried about reducing worst case execution time.

But we don't care about worst-case any more if we're outside of the critical section, right?

This comment has been minimized.

Copy link
@c1728p9

c1728p9 Feb 25, 2019

Author Contributor

But we don't care about worst-case any more if we're outside of the critical section, right?

True, worst case doesn't matter anymore. At the time I was only setup to profile worst-case, so I wasn't able to evaluate the impact of this change.

Show resolved Hide resolved rtos/TARGET_CORTEX/SysTimer.cpp Outdated

@c1728p9 c1728p9 force-pushed the c1728p9:tickless_optimization branch Feb 22, 2019

@c1728p9

This comment has been minimized.

Copy link
Contributor Author

commented Feb 22, 2019

Rebased to master and updated US_IN_TICK calculation/compiler assert.

c1728p9 added some commits Feb 20, 2019

Optimize tickless tick computation
Optimize the tick computation in the following ways:
1. Use relative time rather than absolute time
2. Replace multiplication/division pair with just multiplication
    or division:
    "* 1000000ULL / OS_TICK_FREQ"   ->   "* US_IN_TICK"
    "* OS_TICK_FREQ / 1000000"      ->   "/ US_IN_TICK"
Optimize tickless interrupt latency
Remove unnecessary critical sections from the SysTimer code since
the access should already be serialized.

@c1728p9 c1728p9 force-pushed the c1728p9:tickless_optimization branch to f6ed7ce Feb 23, 2019

@janjongboom

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

@jeromecoutant I don't feel comfortable reviewing this at this level. But I trust @c1728p9

@cmonr

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

CI started whilst waiting on reviews.

@mbed-ci

This comment has been minimized.

Copy link

commented Feb 26, 2019

Test run: SUCCESS

Summary: 12 of 12 test jobs passed
Build number : 1
Build artifacts

@cmonr

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2019

@mbed-os-core @pan- Thoguhts?

@cmonr cmonr merged commit 3352b43 into ARMmbed:master Feb 27, 2019

27 checks passed

continuous-integration/jenkins/pr-head This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-ci/build-ARM Success
Details
jenkins-ci/build-ARMC6 Success
Details
jenkins-ci/build-GCC_ARM Success
Details
jenkins-ci/build-IAR Success
Details
jenkins-ci/cloud-client-test Success
Details
jenkins-ci/dynamic-memory-usage RTOS ROM(+0 bytes) RAM(+0 bytes)
Details
jenkins-ci/exporter Success
Details
jenkins-ci/greentea-test Success
Details
jenkins-ci/mbed2-build-ARM Success
Details
jenkins-ci/mbed2-build-GCC_ARM Success
Details
jenkins-ci/mbed2-build-IAR Success
Details
jenkins-ci/unittests Success
Details
travis-ci/astyle Local astyle testing has passed
Details
travis-ci/docs Local docs testing has passed
Details
travis-ci/doxy-spellcheck Local doxy-spellcheck testing has passed
Details
travis-ci/events Passed, runtime is 9026 cycles (-1181 cycles)
Details
travis-ci/gitattributestest Local gitattributestest testing has passed
Details
travis-ci/include_check Local include_check testing has passed
Details
travis-ci/licence_check Local licence_check testing has passed
Details
travis-ci/littlefs Passed, code size is 8408B (+0.00%)
Details
travis-ci/psa-autogen Local psa-autogen testing has passed
Details
travis-ci/tools-py2.7 Local tools-py2.7 testing has passed
Details
travis-ci/tools-py3.5 Local tools-py3.5 testing has passed
Details
travis-ci/tools-py3.6 Local tools-py3.6 testing has passed
Details
travis-ci/tools-py3.7 Local tools-py3.7 testing has passed
Details

@cmonr cmonr removed the needs: review label Feb 27, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.