New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update idle loop to reduce calls to suspend #6534

Merged
merged 4 commits into from Apr 25, 2018

Conversation

@c1728p9
Contributor

c1728p9 commented Apr 3, 2018

Change tickless handling to use public RTX calls and to prevent unnecessary calls to suspend/resume by looping until the kernel needs to be resumed.

@cmonr cmonr added the needs: review label Apr 3, 2018

@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Apr 3, 2018

This change also allows #6536 to work correctly, as the low power timer no longer needs to written to on every wakeup.

@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Apr 3, 2018

@kjbracey-arm

Not familiar with the SysTimer, but I've looked at the RTX end.

if (!osRtxInfo.kernel.pendSV) { // check devices didn't run any interrupt handlers kicking the OS between suspend and critical section
sleep(); // uses WFI - will wake as soon as any pending interrupts
}
core_util_critical_section_exit();

This comment has been minimized.

@kjbracey-arm

kjbracey-arm Apr 4, 2018

Contributor

There might be call for an ISB here - you're wanting pending interrupts to trigger between this line and the next one, and an ISB should make sure that happens.

In theory I think you could redisable interrupts before any have a chance to trigger - formally the sequence "enable IRQs / disable IRQs" needs an ISB somewhere in the middle to guarantee they have time to trigger.

Relevant ARM ARM quote:

If execution of a CPS instruction:

  • Increases the execution priority, the CPS execution serializes that change to the instruction stream.
  • Decreases the execution priority, the architecture guarantees only that the new priority is visible to instructions executed after either executing an ISB, or performing an exception entry or exception return.

(In practice they're likely to trigger before you hit the next disable, which means you'd see pendSV set by the next time you hit line 104.)

os_timer->cancel_tick();
// calculate how long we slept
elapsed_ticks = os_timer->update_tick();
if (osRtxInfo.kernel.pendSV) {

This comment has been minimized.

@kjbracey-arm

kjbracey-arm Apr 4, 2018

Contributor

This could just be part of the while condition? Could be true on first entry to the while

@c1728p9 c1728p9 force-pushed the c1728p9:rtos_suspend branch from 2521a6c to 2f61a93 Apr 4, 2018

Update idle loop to reduce calls to suspend
Change tickless handling to use public RTX calls and to prevent
unnecessary calls to suspend/resume by looping until the kernel
needs to be resumed.

@c1728p9 c1728p9 force-pushed the c1728p9:rtos_suspend branch from 2f61a93 to 953bbeb Apr 4, 2018

@0xc0170 0xc0170 requested a review from bulislaw Apr 5, 2018

while (!os_timer->suspend_time_passed() && !event_pending) {

core_util_critical_section_enter();
if (osRtxInfo.kernel.pendSV) {

This comment has been minimized.

@JonatanAntoni

JonatanAntoni Apr 5, 2018

Member

osRtxInfo.kernel.pendSV is not public API. Perhaps @RobertRostohar can justify if this solution is solid and a small API change would be valuable.

This comment has been minimized.

@0xc0170

0xc0170 Apr 5, 2018

Member

Might help with the review, details are here #6273 (comment)

This comment has been minimized.

@kjbracey-arm

kjbracey-arm Apr 5, 2018

Contributor

Agree that there should be better way - pendSV peeking was the best I could think of with current APIs.

The "simple" API change I would suggest is to permit osKernelSuspend/Resume to be callable from critical section (which is what happens via direct svcKernelSuspend calls at the moment, and this replaces), or to have them or equivalents actually return from the suspend call such that the thread is now in a critical section.

I guess in general one also needs to think about priviliged/unprivileged operation, in case a system is using User mode - system suspension, critical sections and the WFI instruction are privileged operations anyway, so this whole "normal thread code does suspend" only works if the idle thread is privileged (or the suspend call makes you privileged?), or you have a WFE-only version (which I believe is less flexible and can reduce power saving opportunities).

This comment has been minimized.

@c1728p9

c1728p9 Apr 5, 2018

Contributor

@JonatanAntoni when I was initially working on tickless I proposed an extension to the RTOS to allow this, which you might be able to use:
https://github.com/ARMmbed/mbed-os/pull/5031/files#diff-415fd494349b5414064e0cc0d9f0d596R189

As for this PR, how do you want to proceed? Can we move forward using this internal API until RTX has a public API which supports this?

This comment has been minimized.

@JonatanAntoni

JonatanAntoni Apr 5, 2018

Member

Sure, for the time being you can access this internal bit. But as it is not public API it might change in the future without being warned. So we should align the low power operation, soon. We need to address this privileged/unprivileged issue nonetheless talking about MPU protection.

@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Apr 6, 2018

/morph build

@mbed-ci

This comment has been minimized.

mbed-ci commented Apr 6, 2018

@0xc0170

This comment has been minimized.

Member

0xc0170 commented Apr 6, 2018

mbedmicro-rtos-mbed needs an update?

@0xc0170 0xc0170 added needs: work and removed needs: review labels Apr 6, 2018

@LMESTM LMESTM referenced this pull request Apr 9, 2018

Closed

Waking STM32L073 from sleep #6362

@cmonr

This comment has been minimized.

Contributor

cmonr commented Apr 9, 2018

@0xc0170 Would appear to be the case.

Update the SysTimer test
Update the SysTimer test to match the updated API. Changes are:
- increment_tick() renamed to _increment_tick() and explicitly
    synchronized
-update_tick() replaced with resume() and a call to suspend() was added
    before this
@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Apr 10, 2018

Even with the test update, this PR intermittently fails on test_update_tick on the nrf51. From profiling this test, I found that the call to suspend is taking at times +500us. I suspect that a tick is passing and that is why 1 is returned from suspend rather than 0 - FAIL: Expected 0 Was 1.

If I pull in #5629 suspend takes around 133us to complete, which is still a lot, but the test consistently passes. @fkjagodzinski you were the author of this test right? Do you have any recommendations on how I can get this test to pass consistently?

@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Apr 10, 2018

/morph build

@mbed-ci

This comment has been minimized.

mbed-ci commented Apr 10, 2018

Build : SUCCESS

Build number : 1714
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6534/

Triggering tests

/morph test
/morph uvisor-test
/morph export-build
/morph mbed2-build

@mbed-ci

This comment has been minimized.

@fkjagodzinski

This comment has been minimized.

Member

fkjagodzinski commented Apr 11, 2018

@c1728p9 test_update_tick() is supposed to check just the update_tick() method with, and without a delay in between calls. Since update_tick() functionality was moved to resume() it's enough to update method name in the test. No calls to suspend() needed.

With this patch 5bfaf52 all tests-mbedmicro-rtos-mbed-systimer consistently pass on NRF51_DK.

@0xc0170

This comment has been minimized.

Member

0xc0170 commented Apr 11, 2018

With this patch 5bfaf52 all tests-mbedmicro-rtos-mbed-systimer consistently pass on NRF51_DK.

Shall this be added here?

@c1728p9

This comment has been minimized.

Contributor

c1728p9 commented Apr 17, 2018

Thanks for the fix @fkjagodzinski. I pulled it into this PR.

@cmonr

This comment has been minimized.

Contributor

cmonr commented Apr 17, 2018

/morph build

@mbed-ci

This comment has been minimized.

mbed-ci commented Apr 17, 2018

Build : SUCCESS

Build number : 1775
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6534/

Triggering tests

/morph test
/morph uvisor-test
/morph export-build
/morph mbed2-build

@mbed-ci

This comment has been minimized.

@mbed-ci

This comment has been minimized.

@bulislaw

The idle loop and systimer changes looks good.

@0xc0170

This comment has been minimized.

Member

0xc0170 commented Apr 18, 2018

Ah too slow to remove the comment to trigger another build :-) I invoked build instead of test to retest 😖

@mbed-ci

This comment has been minimized.

mbed-ci commented Apr 18, 2018

Build : SUCCESS

Build number : 1789
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6534/

Triggering tests

/morph test
/morph uvisor-test
/morph export-build
/morph mbed2-build

@mbed-ci

This comment has been minimized.

@mbed-ci

This comment has been minimized.

@fkjagodzinski

This comment has been minimized.

Member

fkjagodzinski commented Apr 19, 2018

tests-mbed_drivers-timeout / 'Multiple timeouts running in parallel (attach)'test case compiled with ARM toolchain failed again on NRF52_DK board.

I'm investigating. So far, I can tell that the issue is also present on current master branch. #6515 might help here -- with this PR merged I couldn't reproduce the issue locally any more.

@0xc0170

This comment has been minimized.

Member

0xc0170 commented Apr 19, 2018

I'm investigating. So far, I can tell that the issue is also present on current master branch. #6515 might help here -- with this PR merged I couldn't reproduce the issue locally any more.

Interesting, and that fix is blocked by another failure 😕

@0xc0170

This comment has been minimized.

Member

0xc0170 commented Apr 19, 2018

I'm investigating. So far, I can tell that the issue is also present on current master branch.

@adbridge @cmonr Are you aware of this failure? I don't recall seeing this in CI, I'll look at the logs. @fkjagodzinski If this is present on master, and these tests were added not long time ago, were you aware of this issue?

@fkjagodzinski

This comment has been minimized.

Member

fkjagodzinski commented Apr 19, 2018

@0xc0170 No, not until I saw morph failures for this PR. It didn't look related to this PR, that's why I checked master. The issue does not show up very often, but can be reproduced.

@cmonr

This comment has been minimized.

Contributor

cmonr commented Apr 24, 2018

/morph test

@fkjagodzinski

This comment has been minimized.

Member

fkjagodzinski commented Apr 24, 2018

I've created an issue #6732 based on the failing test case tests-mbed_drivers-timeout / 'Multiple timeouts running in parallel (attach)'.

@mbed-ci

This comment has been minimized.

@cmonr cmonr added ready for merge and removed needs: work labels Apr 24, 2018

@0xc0170 0xc0170 merged commit 675528b into ARMmbed:master Apr 25, 2018

12 checks passed

AWS-CI uVisor Build & Test Success
Details
ci-morph-build build completed
Details
ci-morph-exporter build completed
Details
ci-morph-mbed2-build build completed
Details
ci-morph-test test completed
Details
continuous-integration/jenkins/pr-head This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
travis-ci/docs Local docs testing has passed
Details
travis-ci/events Passed, runtime is 8903 cycles
Details
travis-ci/gitattributestest Local gitattributestest testing has passed
Details
travis-ci/littlefs Passed, code size is 10092B
Details
travis-ci/tools Local tools testing has passed
Details
void SysTimer::cancel_tick()
{
remove();
schedule_tick(ticks);

This comment has been minimized.

@stevew817

stevew817 May 25, 2018

Contributor

@c1728p9 Why is there no remove() call here? I'm seeing issues on my system because of this.

The SysTimer is based on a TimerEvent class, which has a static member for the timer's event data struct. Additionally, the call to suspend may happen before the previously set timer expires (for example when all threads go idle and the idle handler gets activated). At that point, the timer event data is still present in the timer queue. Calling schedule_tick here will insert the same event structure again in the queue. And since the queue is by reference, that'll lead to a recursive timer that breaks the system.

This comment has been minimized.

@c1728p9

c1728p9 May 26, 2018

Contributor

Hi @stevew817 I gave this a try but was unable to reproduce the failure. Do you have example code which reproduces the issue?

From both code analysis and running the code can I cant find a sequence which causes this. The basic flow of the idle loop should unconditionally call remove before each insert as indicated below:

while (true) {
    osKernelSuspend();      // removes event
    os_timer->suspend();    // inserts event
    sleep();
    os_timer->resume();     // removes event
    osKernelResume();       // inserts event
}

With a debugger I can confirm the following sequence is being met by setting a breakpoint in TimerEvent::remove, TimerEvent::insert and TimerEvent::insert_absolute. Below is the callstack from each:

osKernelSuspend()
    svcRtxKernelSuspend()
        KernelBlock()
            OS_Tick_Disable()
                os_timer->cancel_tick()
                    remove()

os_timer->suspend()
    schedule_tick()
        insert_absolute()

os_timer->resume()
    remove()

osKernelResume()
    svcRtxKernelResume()
        KernelUnblock()
            OS_Tick_Enable()
                schedule_tick()
                    insert_absolute();

This comment has been minimized.

@stevew817

stevew817 May 26, 2018

Contributor

I ran across this when running cloud-client-example using Thread as a sleepy node in tickless mode. The code would hang, and attaching a debugger confirmed it was stuck in the tick handler IRQ because somehow the tick event had managed to add itself recursively (the next pointer of the tick event data was pointing to the same event data, causing the remove call to not do anything since the head just always ands up being the tick event data).

I fixed the issue by adding a cancel_tick() call before the schedule_ticks call in SysTimer::suspend. I didn't investigate further after solving my issue, so I'm afraid I don't have a reproducer ready.

This comment has been minimized.

@c1728p9

c1728p9 May 27, 2018

Contributor

@stevew817 I'm still unable to reproduce this problem, but I created #7027 just to be safe. Could you provide further details on how to replicate this failure (such as git shas used in cloud-client-example and mbed-os along with config, any local mods you made and the hardware you were using) or an example application?

@cmonr

This comment has been minimized.

Contributor

cmonr commented May 25, 2018

@c1728p9 Please take a look at this and if this needs a fix and/or update, we can take a fix in for RC2.

CC: @ARMmbed/mbed-os-maintainers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment