New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove windup behavior from break_dispatch #6238

Merged
merged 4 commits into from Mar 15, 2018

Conversation

Projects
None yet
8 participants
@pauluap

pauluap commented Feb 28, 2018

Change break_dispatch from a counting semaphore which needs the same number of calls to dispatch_forever() as number of calls to break_dispatch() to a squashable flag which will only break once for any number of calls to break_dispatch() between calls to dispatch_forever()

Resolves #6204

Pull request type

  • Fix
  • Refactor
  • New target
  • Feature
  • Breaking change
@kjbracey-arm

This comment has been minimized.

Contributor

kjbracey-arm commented Mar 1, 2018

Looks fine to me - outstanding thought from #6204 - should the "break" flag also be cleared if it exits due to timeout, which is tested before the break flag?

@@ -73,7 +73,7 @@ int equeue_create_inplace(equeue_t *q, size_t size, void *buffer) {
q->queue = 0;
q->tick = equeue_tick();
q->generation = 0;
q->breaks = 0;
q->breaks = false;

This comment has been minimized.

@kjbracey-arm

kjbracey-arm Mar 1, 2018

Contributor

It's now "break", rather than "breaks", but that's a keyword. How about break_requested? Always good to name booleans as something that makes sense to be true or false.

This comment has been minimized.

@geky

geky Mar 1, 2018

Member

Ah! Good catch, that would explain why the CI was failing

@@ -366,7 +366,7 @@ void equeue_cancel(equeue_t *q, int id) {
void equeue_break(equeue_t *q) {
equeue_mutex_lock(&q->queuelock);
q->breaks++;
q->breaks = true;

This comment has been minimized.

@kjbracey-arm

kjbracey-arm Mar 1, 2018

Contributor

Could avoid semaphore signalling if break already set. (Avoiding pointless ISR queue use, as per my suggestion ARM-software/CMSIS_5#283 )

@geky

This comment has been minimized.

Member

geky commented Mar 1, 2018

I think this looks great!

@pauluap

This comment has been minimized.

pauluap commented Mar 1, 2018

Ha! Of course, all the other issues now emerge from the woodwork.

So, so far we have:

  • variable rename - break_requested sounds good to me
  • squash with timeout
  • avoid semaphore signaling

Would it make sense for me to do those as well?

I did take a look at the failing CI test, but it's not too clear to me why the failure occurred, is there anything else I need to do to make the tests happy?

@kjbracey-arm

This comment has been minimized.

Contributor

kjbracey-arm commented Mar 2, 2018

Didn't think it would be that easy, did you?

I'd like @geky's thought on the timeout clear. It seems correct to me, but my brain sometimes malfunctions. If we're going to change break behaviour, probably best to make 1 combined change.

Variable should be renamed because breaks no longer fits.

I think avoiding the semaphore signal is worthwhile, just because we have recurring bug reports of that ISR overflow problem. Although the real place that occurs here would be in equeue_post (assuming your equeue is bigger than RTX's ISR queue, so you hit RTX's limit first). So maybe leave it until a future combined patch that does it for both post and break. (Do we have any open issues for this?)

@geky

This comment has been minimized.

Member

geky commented Mar 2, 2018

If you have a linux machine handy, you can run the tests locally with make test inside the equeue directory. In this case the break test (here) is segfaulting, so unfortunately it needs to be ran inside a debugger to see what's going wrong. make DEBUG=1 test ; gdb tests/tests will get you to where the failure is happening.

I think I found the issue, will comment on the line.

I'd like @geky's thought on the timeout clear.

I think you're right we need timeout to also clear. That's the only way to guarantee that a break from inside the queue is cleared before dispatch exits.

Here's the other return statement, should just need a breaks = false before it:
https://github.com/pauluap/mbed-os/blob/c00061a2f8f5a2e5d0f7ec452c43ceafe3de761e/events/equeue/equeue.c#L421

Sidenote: There is no similar guarantee for a break from a different thread, since it's a tossup if the break could land after dispatch exits.

I think avoiding the semaphore signal is worthwhile [..] Although the real place that occurs here would be in equeue_post

IMO this sounds like an RTOS issue if binary semaphores have overflow problems. If we need a workaround I'd prefer that goes in the porting layer (with a flag?) so this issue is fixed for all calls to equeue_sema_signal:
https://github.com/ARMmbed/mbed-os/blob/master/events/equeue/equeue_mbed.cpp#L111

Also break_requested sounds good to me too.

equeue_mutex_unlock(&q->queuelock);
return;
}
q->breaks = false;

This comment has been minimized.

@geky

geky Mar 2, 2018

Member

So, unfortunately this logic is subtlely complex...

if (q->breaks) {    // <- This condition is outside of mutex so it can't be
                    // trusted. Even if this if is true, we're only _probably_
                    // sure that a break was signalled (concurrent dispatch calls).
                    // The reason for the extra if statement is so that we can avoid
                    // getting a mutex in the common case (no break events).
                    
    equeue_mutex_lock(&q->queuelock);
    if (q->breaks > 0) { // <- Once we're in the mutex, we confirm that break
                         // was signalled
        q->breaks--;
        equeue_mutex_unlock(&q->queuelock);
        return;          // <- This return actually exits equeue_dispatch. Otherwise
                         // the queue happily continues to run.
    }
    equeue_mutex_unlock(&q->queuelock);
}

So two problems:

  1. You still need the second if statement inside the mutex
  2. You still need the return statement to actually break out of the dispatch loop

@pauluap pauluap force-pushed the pauluap:break_dispatch_flag branch from c00061a to 5d98d22 Mar 2, 2018

Paul Thompson added some commits Mar 2, 2018

@pauluap pauluap force-pushed the pauluap:break_dispatch_flag branch from ffbf969 to bb0d540 Mar 2, 2018

@pauluap

This comment has been minimized.

pauluap commented Mar 2, 2018

Allright, I chose to amend my commit because I didn't want any false noise around the return cleanup.

I also added the clearing on timeout, but I also chose to put in within the mutex lock.

I went ahead and added two tests - one for the break_dispatch windup behavior and the other for clearing break_request on timeout. Good thing that I did because I originally put the timeout clearing code within the background update mutex which didn't fire, of course.

@geky

It looks great 👍 Thanks for adding tests. I noticed some style difference if you are able to fix those, but then it should be good to go in.

void simple_breaker(void *p) {
CountAndQueue* caq = (CountAndQueue*)p;
equeue_break(caq->q);
usleep(10000);

This comment has been minimized.

@geky

geky Mar 2, 2018

Member

nit: It looks like there's a mixture of tabs + spaces here. Can you change this to just 4 spaces for indention.

equeue_t* q;
};
typedef struct sCaQ CountAndQueue;

This comment has been minimized.

@geky

geky Mar 2, 2018

Member

nit: This is inconsistent with other structs in this file. Could you change this to struct count_and_queue without a typedef.

Paul Thompson
@pauluap

This comment has been minimized.

pauluap commented Mar 2, 2018

grr, that's what I get for editor-hopping, I fall in the space-over-tabs camp, so thanks for letting me know :)

@geky

geky approved these changes Mar 2, 2018

No problem, looks good to me 👍

Next step is @kjbracey-arm's review, CI, and then it should be merged shortly after.

@@ -418,6 +418,7 @@ void equeue_dispatch(equeue_t *q, int ms) {
q->background.active = true;
equeue_mutex_unlock(&q->queuelock);
}
q->break_requested = false;

This comment has been minimized.

@kjbracey-arm

kjbracey-arm Mar 5, 2018

Contributor

I feel a bit nervous about actually writing to this variable outside the mutex - one step worse than reading it. In both cases it has no synchronisation protection, which could in principle cause problems with multiple CPUs. Some C11/C++11 atomic would sort it out, but we don't have that available.

However, I can't actually construct a plausible failure mode looking at it - all the mutexes and semaphores that do exist around it add quite a lot of ordering constraints. And I guess we assume only 1 thread ever dispatches the event queue - we're synchronising between multiple queuers and 1 dispatcher, not multiple dispatchers?

This comment has been minimized.

@pauluap

pauluap Mar 5, 2018

I did originally move it to be within the mutex block in 413-419, but of course that led to failing tests because that mutex is only entered if there's an update object attached.

How about if I expand the mutex lock to outside the update conditional? It hurts the normal case, but doesn't impact the worst case.

As in...

        // check if we should stop dispatching soon
        if (ms >= 0) {
            deadline = equeue_tickdiff(timeout, tick);
            if (deadline <= 0) {
                equeue_mutex_lock(&q->queuelock);
                // update background timer if necessary
                if (q->background.update) {
                    if (q->background.update && q->queue) {
                        q->background.update(q->background.timer,
                                equeue_clampdiff(q->queue->target, tick));
                    }
                    q->background.active = true;
                }
                q->break_requested = false;
                equeue_mutex_unlock(&q->queuelock);
                return;
            }
        }

This comment has been minimized.

@geky

geky Mar 5, 2018

Member

The original implementation seemed fine to me. The case we're concerned about is a break from inside the event queue, which by definition is synchronous.

With multiple dispatchers all we're gauranteed is that at least one thread will break, which is already protected by a mutex. This race condition can only happen after a thread has already decided to exit, but even with a critical section it's still a race since you don't know if the break will be consumed before or after the timeout.

I'd be fine with extending the mutex to be safe, except it does look like it has a negative performance impact (using make prof):

beginning profiling...
baseline_prof: 23 cycles (+0%)
equeue_tick_prof: 43 cycles (+0%)
equeue_alloc_prof: 61 cycles (+0%)
equeue_post_prof: 227 cycles (+0%)
equeue_post_future_prof: 227 cycles (+0%)
equeue_dispatch_prof: 312 cycles (-34%)   <-- ouch, was 232 cycles
equeue_cancel_prof: 121 cycles (+0%)
equeue_alloc_many_prof: 64 cycles (+0%)
equeue_post_many_prof: 221 cycles (+2%)
equeue_post_future_many_prof: 220 cycles (+3%)
equeue_dispatch_many_prof: 6952 cycles (+0%)
equeue_cancel_many_prof: 122 cycles (+0%)
equeue_alloc_size_prof: 56 bytes (+0%)
equeue_alloc_many_size_prof: 64000 bytes (+0%)
equeue_alloc_fragmented_size_prof: 64000 bytes (+0%)
done!

Although note this is on Linux, where mutex acquisition isn't cheap.

This comment has been minimized.

@kjbracey-arm

kjbracey-arm Mar 6, 2018

Contributor

Makes sense to me - leave it as is.

@cmonr

This comment has been minimized.

Contributor

cmonr commented Mar 6, 2018

/morph build

@cmonr cmonr added needs: CI and removed needs: review labels Mar 6, 2018

@mbed-ci

This comment has been minimized.

mbed-ci commented Mar 6, 2018

Build : SUCCESS

Build number : 1357
Build artifacts/logs : http://mbed-os.s3-website-eu-west-1.amazonaws.com/?prefix=builds/6238/

Triggering tests

/morph test
/morph uvisor-test
/morph export-build
/morph mbed2-build

@mbed-ci

This comment has been minimized.

@mbed-ci

This comment has been minimized.

@studavekar

This comment has been minimized.

Collaborator

studavekar commented Mar 6, 2018

/morph mbed2-build

@cmonr cmonr added ready for merge and removed needs: CI labels Mar 7, 2018

@cmonr

This comment has been minimized.

Contributor

cmonr commented Mar 7, 2018

@adbridge @0xc0170 Marking this as targeting 5.9.0-rc1 since according to the issue this will fix, behavior will change, but I'm not sure about merging a PR that won't be put in use for a long time.

@adbridge

This comment has been minimized.

Contributor

adbridge commented Mar 7, 2018

@pauluap Please provide a proper description of this fix and the impact. Also only one checkbox should be ticked. Thanks

@adbridge

This comment has been minimized.

Contributor

adbridge commented Mar 7, 2018

@cmonr Whether we should merge for 5.9 so far in advance really depends on the likelihood of the files in this PR being touched again for other bug fixes... @0xc0170 what is your opinion?

@pauluap

This comment has been minimized.

pauluap commented Mar 7, 2018

I updated the description, but I wasn't the one who added the flags, dunno which one to pick.

Behavior did change, but it's arguable that the previous behavior wasn't the intended behavior in the first place. I'm okay either way, if it's designated as a breaking change, then clearly the fix flag has to go.

@0xc0170

This comment has been minimized.

Member

0xc0170 commented Mar 7, 2018

I updated the description, but I wasn't the one who added the flags, dunno which one to pick.

I edited it , it's now fixed. Please use PR type for future.

@cmonr Whether we should merge for 5.9 so far in advance really depends on the likelihood of the files in this PR being touched again for other bug fixes... @0xc0170 what is your opinion?

Changes like this should be fine.

@cmonr cmonr merged commit 1580774 into ARMmbed:master Mar 15, 2018

19 checks passed

AWS-CI uVisor Build & Test Success
Details
ci-morph-build build completed
Details
ci-morph-exporter build completed
Details
ci-morph-mbed2-build build completed
Details
ci-morph-test test completed
Details
continuous-integration/jenkins/pr-head This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
travis-ci/docs Local docs testing has passed
Details
travis-ci/events Local events testing has passed
Details
travis-ci/littlefs Passed, code size is 10060B (+0.00%)
Details
travis-ci/mbed2-ATMEL Local mbed2-ATMEL testing has passed
Details
travis-ci/mbed2-MAXIM Local mbed2-MAXIM testing has passed
Details
travis-ci/mbed2-NORDIC Local mbed2-NORDIC testing has passed
Details
travis-ci/mbed2-NUVOTON Local mbed2-NUVOTON testing has passed
Details
travis-ci/mbed2-NXP Local mbed2-NXP testing has passed
Details
travis-ci/mbed2-RENESAS Local mbed2-RENESAS testing has passed
Details
travis-ci/mbed2-SILICON_LABS Local mbed2-SILICON_LABS testing has passed
Details
travis-ci/mbed2-STM Local mbed2-STM testing has passed
Details
travis-ci/tools Local tools testing has passed
Details

@cmonr cmonr removed the ready for merge label Mar 15, 2018

@pauluap pauluap deleted the pauluap:break_dispatch_flag branch Apr 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment