Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ble advertising logic #625

Merged
merged 7 commits into from
Sep 13, 2021
Merged

Conversation

evergreen22
Copy link
Contributor

This is my WIP attempt to resolve the ble re-connect issue and get more reliable ble behavior. Comments, discussion, and suggestions are welcome. I'm testing on both my watch and dev-kit. The watch has been connected now for 3+ days. The dev-kit still fails periodically. I'm using both nRFConnect and GB for testing. I am (still) waiting on the breakout to connect my Segger JLink and gaze into the dumpster fire that is ble.

  • Refactor ble advertising based on ble standards and conventions.
  • Changes are based on the bleprph example code, bluetooth docs, and nimble docs.

Refactor ble advertising based on ble standards and conventions.
Changes are based on the bleprph example code, bluetooth docs, and nimble docs.
@evergreen22
Copy link
Contributor Author

I re-enabled the assert statements that have been commented out in the advertising function and I have not observed them fire.

My nRFConnect shows the watch advertising in 64ms intervals. I set the interval in the call to ble_gap_adv_start to 5000 based on my nRFConnect logs which showed 5000 as the agreed upon parameters. I;m not clear on why the agreed connection parameters are 5000 ms and the actual interval on the watch is 64 ms.

@evergreen22
Copy link
Contributor Author

I implemented bonding with a passkey in my local branch and tested it to see if that impacted the re-connection issue. The only change I observed was that everytime the watch restarted advertising (i.g. central out or range) I would have to delete the bond to successfully re-connect. I believe this can be fixed by persisting the keys on the watch.

@geekbozu
Copy link
Member

On Sun, 29 Aug 2021, Tim Keller wrote: Not sure if this is the PR to do it but after some time spent advertising at 64ms, the spec suggests dropping to a much lower advertisement interval. Maybe take a timestamp on the last disconnect and clear it on connect. Setting adv_params.itvl_min is the parameter to change that i believe
I see where that is suggested in the docs. First, I'm trying to resolve the re-connect issue then I thought we could design an appropriate algorithm for backing off on the adv interval.

Yeah getting proper bonding working would be great! Feel free to nag me on the discord as well I'm almost always around :)

@evergreen22
Copy link
Contributor Author

evergreen22 commented Aug 30, 2021 via email

@evergreen22
Copy link
Contributor Author

After 4 days, 18 hours, and some minutes of continous bluetooth connectivity my watch and GB disconnected. I noticed the bluetooth icon was gone on the watch while looking at the time and pulled up GB to see "waiting for reconnect". I opened NRFConnect and the watch was not advertising. :-(

I'm confident the watch received a disconnect event from nimble (because the ble icon was off) and it should have been advertising (the code says so). This is very frustrating.

@evergreen22
Copy link
Contributor Author

Another unexplained disconnect today that could only be fixed by rebooting the watch.

I started this WIP to discover the root cause of the disconnects and now I do not believe this PR will resolve that issue. However, based on the specs, docs, and example nimble apps, I do feel that this is the right way to manage advertising.

Hopefully @JF002 will discover a useful clue with the logic probe and sniffer. I am in need of some inspiration ;-|

@geekbozu
Copy link
Member

geekbozu commented Sep 2, 2021

I'm in agreement,
Sad that its not solving the issue and that it lies elsewhere...however the changes are still very much appreciated and needed. Just means we have to keep looking!

Do you think its in the scope of this PR to limit the advertising speed after a few attempts? To "save" power when its just idling advertising.

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 3, 2021 via email

@lman0
Copy link

lman0 commented Sep 4, 2021

hi @evergreen22 ,

here https://github.com/espruino/Espruino/tree/master/targetlibs/nrf5x_12/components/ble is the bangle.js source code related to Bluetooth ,
It might help you since pinetime and banglejs smartwatch share nearly the same component .

it's based on open source nordic source code in order to achieve proper ble stack and may help to understand what pinetime lack.

@JF002 since we use nordic , why we don't use their stack? since from i can see it's open source (otherwise esprino would not be able to modifiy it and share it) , and would resolve some ble problem.
Otherwise i think it should be possible to use the nnordic source code used/modified by esprino for nimble in pinetime since it's not forbidden by the license.
At very least it may help people to create a better algo for ble .

best regards

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 4, 2021

The "linear decrease..." commit c32ba84 slows down the rate of advertisements to conserve battery when no connection is available. See the commit message for details on how it works. This chart visually describes the advertisement intervals versus time (smaller interval is faster advertising).

interval_chart

I've tested this on my dev-kit and sealed unit with numerous connect-disconnect and power cycles and it is working.

@evergreen22
Copy link
Contributor Author

Thanks @lman0.

Start advertising aggressively when powered on then
slow down linearly over 75 seconds. This will conserve
battery by not advertising rapidly the whole time we
are seeking a connection. The slowest rate is
approximately once every 4.5 seconds to balance
responsiveness and battery life.

We use a fixed advertising duration of 5 seconds and start
with a 62.5 ms advertising interval. Every 5 seconds
(the advertising duration) we step up to a larger
advertising interval (slower advertising). We continue
to increase the advertising interval linearly for
75 seconds from the start of advertising. At 75 seconds
we have an advertising interval of 4.44 seconds which we
keep until connected. A reboot will restart the sequence.

When we receive a disconnect event we restart the sequence
with fast advertising and then slow down as described above.
Note that we are not using the BLE high duty cycle setting to
change the advertising rate. The rate is managed by repeatedly
setting the minimum and maximum intervals.

The linear rate of decrease and the slowest interval size
were determined experimentally by the author. The 5.3 Core
spec suggests that you not advertise slower than once
every 1.2 seconds to preserve responsiveness but we
ignored that suggestion.
@geekbozu
Copy link
Member

geekbozu commented Sep 4, 2021

I feel like we should be doing the 62.5 Advertising interval for about 30 seconds every-time the screen turns on since its a "priority" style thing. And then back off either linearly or in a jump after that. I'm not sure of the benefit of the linear drop...Some rough numbers are ~1.35mA -> 1.18mA after it fully scales down.
Which is ~5days and 6 hours to 6 days or about 18 hours extra life compared to full bore advertising!

Any input on why you picked a linear drop? thanks for putting this effort in :)

@evergreen22
Copy link
Contributor Author

Note to @geekbozu, somehow I've broken my client, so I'm recreating my response to you here (assuming you got the response in the first place).

Thanks for the feedback and the discussion. I do appreciate it.

I feel like we should be doing the 62.5 Advertising interval for about 30 seconds every-time the
screen turns on since its a "priority" style thing. And then back off either linearly or in a jump
after that.

I don't see the point of advertising when we are already connected. We can only have one connection at a time on the watch.

What do you mean by "priority style thing"?

Some rough numbers are ~1.35mA -> 1.18mA after it fully scales down.
Which is ~5days and 6 hours to 6 days or about 18 hours extra life compared to full bore
advertising!

Thanks for measuring this! 18 hours is nice and more than I expected. If anything, this confirms my intution that advertising is fairly energy efficient on this hardware.

Any input on why you picked a linear drop?

Linear is simple. Simple code is easier for humans to understand and less likely to contain defects.

My first approach was based on Ethernet's exponential backoff algorithm, However, that was too fast of a decrease imo for this application (and unnecessarily complex). This specific linear curve gives folks a full minute of fairly quick advertising to get connected but still feels responsive even after the 75 seconds. The worst case, after the advertising reaches the slowest interval, is that a connection will take 4.43 seconds. The average case will be 2.2 seconds and both feel pretty snappy imo.

@geekbozu
Copy link
Member

geekbozu commented Sep 5, 2021

Ahh I explained poorly then.

So all of this only applies when not connected and we should be advertising.

So when not connected, We initially advertise at the 62.5 Interval for about 30 seconds. After those 30 seconds we back off to a more power friendly speed. Say 1.2 seconds like recommended by the spec. Or less if we determine its snappy enough.

Now while not connected, When the screen turns on. We could consider going back up to the 62.5interval for responsiveness and compatibility. (Thats what I meant by "priority" )

Yeah 18 hours is quite a bit id say its a good trade off for the watch always advertising.

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 5, 2021 via email

@evergreen22
Copy link
Contributor Author

Now while not connected, When the screen turns on. We could consider going back up to the
62.5interval for responsiveness and compatibility. (Thats what I meant by "priority" )

While not connected is the key part I missed above.

I think I see what you're saying now.

  • When powered on, advertise at 62.5ms interval for 30 seconds then
    change the interval to 1.2 seconds until connected.

  • When a disconnect event is received, restart the two step sequence.

  • When NOT connected AND the screen is on, restart the sequence.

On power up, advertise aggressively for at least 30 seconds then switch
to a longer interval to conserve battery life. This fast/slow pattern
is designed to balance connection response time and battery life.

When a disconnect event is received restart the fast/slow pattern.

When a failed connect event is received, restart the fast/slow pattern.

When the screen is activated and ble is not connected, restart the fast/slow pattern.

This pattern is consistent with Apple's BLE developer standards (QA 1931).
@evergreen22
Copy link
Contributor Author

I've pushed the reworked advertising backoff scheme to PR #625. It now advertises agressively for at least 30 seconds then backs down to about one advertisement a second. It also restarts aggressive advertising under selected conditions. The chosen advertising intervals are consistent with Apple's developers guidelines for BLE and since we have a iOS app now that seems right.

I'm ready to remove the 'WIP' designation if a dev is willing to give it an 'approve' review.

Thanks to @geekbozu and @hubmartin for the feedback.

@kieranc
Copy link
Contributor

kieranc commented Sep 6, 2021

I'm ready to remove the 'WIP' designation if a dev is willing to give it an 'approve' review.

I'm not qualified to provide a review however I'm testing this PR (plus #621) and will report back once I have > 24h uptime

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 6, 2021 via email

@JF002 JF002 added this to the 1.5.0 milestone Sep 6, 2021
@geekbozu
Copy link
Member

geekbozu commented Sep 7, 2021

Code looks great! I will actually pull some Power plots of this shortly so we can calculate a new "average" battery life!

@geekbozu
Copy link
Member

geekbozu commented Sep 8, 2021

So i have been running this as a "daily driver' for a very small sample size of 2 days and its working great. Pairing is way more consistent and much more responsive. I use a self compiled version of gadget-bridge with the Auto Reconnect on Out of range enabled and I have yet to notice BT going down/not coming back after those events either now. Where before it would sometimes take a good minute to actually connect.

Does not solve the 18Hour bug as i have come to call it, But hey still super nice!

@lman0
Copy link

lman0 commented Sep 10, 2021

hi, @evergreen22

i have tested your pull and it do advertising faster
but only once i reset the smartwatch ,
because some time after be connected , if it disconnect
and try for example to connect the pinetime to an Bluetooth hr app
the advertising is not working anymore ( i can't see the infinitime in the app but i do see the miwatch)

since it not possible to sknow if the advertising is indeed restarted
can you add an icon like
((-))
when it advertising in pinetime top bar?
it would show that it truly do advertising
and when connected it would show the bluethooth icon instead .

it would help other people , to help them unserstand that it do adversting..
(and if it not showing when it should , that would mean there is a effective bug /regression

thanks for your understanding

@geekbozu
Copy link
Member

geekbozu commented Sep 10, 2021

lman0 there is a bug with InfiniTime where ALL BLE stops working after about 18 hours of runtime.
There are a few of us actively trying to hunt this bug down as its quite frustrating, as far as InfiniTime is concerned it is still advertising. But the radio just stops actually sending data out.

As in such we have no easy way to know in firmware when this issue happens. As it would still show as advertising. Even thought bluetooth is broken.
Currently you can assume, If BLE is not broken, and you are not connected. We are advertising with this pr. As that is the goal of it!

@lman0
Copy link

lman0 commented Sep 10, 2021

lman0 there is a bug with InfiniTime where ALL BLE stops working after about 18 hours of runtime.

, @geekbozu, in my case , it's not in 18 h but
in mere 1-2h after i reset the pinetime , that the advertising stop working
(and since there is no icon that show that it's do advertising , i think there is a bug lurking in this pull)

that's why i wish that @evergreen22 add an icon when it's advertising

so i can reliably said that the icon is not here when it should
allowing me to help him resolve the bug

otherwise it will be word again word , and none wil be able to help properly , since there is no way to send log
by regular user.
and no visual hint (like an advertising icon) , that prove the pinetime do advertise

@lman0
Copy link

lman0 commented Sep 10, 2021

Currently you can assume, If BLE is not broken, and you are not connected. We are advertising with this pr. As that is the goal of it!

then i can say that , sadly , that this pull is not working properly on the long run but only a few times after reset
and it my word against yours
since
there is no visual way , on pinetime, to prove me or you @geekbozu (and @evergreen ) otherwise
that the advertising have truly restarted after a disconnect
(and a race condtion /bug have not made it not starting)

@lman0
Copy link

lman0 commented Sep 10, 2021

by the way , @evergreen22 , if like @geekbozu say , the bluetooth stop working after 18h of runtime ,
you still can show the icon if the uptime < 18h of runtime in code
so we can still be able to check the adversting is restarting properly in the time the ble is working

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 10, 2021 via email

@geekbozu
Copy link
Member

Yeah...there is no way to know when we are advertising. In the code, on any connection drop or advertisement finish event finish we start advertising again.

evergreen22@22571d4#diff-c69ced976eada3d4c4a15fd5ce542651ff381e21e2bb8cee746ac198c8847f89R152
https://github.com/evergreen22/InfiniTime/blob/22571d4b384e40d647cd994202956f08ed32d925/src/components/ble/NimbleController.cpp#L160
https://github.com/evergreen22/InfiniTime/blob/22571d4b384e40d647cd994202956f08ed32d925/src/components/ble/NimbleController.cpp#L176

We check if the call to advertise works, When it does we continue. If it does not we just reset the watch completely. So we have no way of knowing if we are actually advertising or not. Because as far as InfiniTime is concerned we are advertising. So an indicator will show as advertising even when its not. Since we actually have no way of knowing what the radio it self is actually doing.

@lman0
Copy link

lman0 commented Sep 10, 2021

at very least @evergreen22 , you could make a icon showing when the "startAdvetising()" is launched , right?
(and not showing when it connected or stopped),

it should help regard less , to pin point use case that you may overlooked...
because in my case the pull seem not working properly

@lman0
Copy link

lman0 commented Sep 10, 2021

by the way , @geekbozu what you say should be true then with the blutooth icon when we are connected ,
yet it's still shown , so i think @evergreen22 that an advertising icon is still legit

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 10, 2021 via email

@lman0
Copy link

lman0 commented Sep 10, 2021

how i can prove my , claim if i have no way to prove that the the advertising have been started?
again it's my word against yours .... and i have no visual way way to prove otherwise...
it's should not that hard of a dev of your calibre to add such icon ....

if you don't do it , so be it ,
but you @evergreen22 should not think that yours test are perfect
because even big corporation can overlook an use case in their unit test

once a dev , think his code or test are 'perfect' and reject any way to prove otherwise
then user can assume he will always say that the user is at fault....
and none people want to talk to such stubborn guy.

i pray that you will not become like that

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 10, 2021 via email

@kieranc
Copy link
Contributor

kieranc commented Sep 10, 2021

at very least @evergreen22 , you could make a icon showing when the "startAdvetising()" is launched , right?
(and not showing when it connected or stopped),

You seem to understand this pretty well, it should be fairly trivial to display an icon, why don't you do it?

@lman0
Copy link

lman0 commented Sep 10, 2021

i'm an user @evergreen22 , not a dev , and i feel like talking to support that say their program is 'perfect"'
and if i want them to add an icon that show that the programm is connected
they say i need to prove them ithe code (that i don't understand) to found where is the bug , that prove the the program is disconnected
then they say : you can't found it , so we won't do it ....because our unit test prove use otherwise that it's not disconnect
(that is until the big chief , force them to do it because he , as well, found the bug)

@lman0
Copy link

lman0 commented Sep 10, 2021

at very least @evergreen22 , you could make a icon showing when the "startAdvetising()" is launched , right?
(and not showing when it connected or stopped),

You seem to understand this pretty well, it should be fairly trivial to display an icon, why don't you do it?

i think @kieranc that @evergreen22 is good dev since he tackle a problem know to be very difficult to resolve
that why i thinked that he could be able to do it as well

even more when he say that his unit test / use case are perfect ...

@kieranc
Copy link
Contributor

kieranc commented Sep 10, 2021

even more when they say that their unit test / use case are perfect ...

Who said that?

@lman0
Copy link

lman0 commented Sep 10, 2021

when he say , @kieranc , that his assert are working properly
and say i have to prove that in my use case , the advertising is not working
it's like he mean that his assert are perfect , and i have to prove him otherwise....

by the way , @evergreen22 say "Neither of my test units are restarting. No one else has reported their watch is restarting in the advertising function. "
when i didn't say , even once , that the watch is restarting by itself
but only that the advertising stop working after 1 -2 hour and i need to reset the watch to see it again

@kieranc
Copy link
Contributor

kieranc commented Sep 10, 2021

when he say that his assert are working properly
and say i have to prove that my use case , the advertising is not working
it's like he mean that his assert are perfect , and i have to prove him otherwise....

Its not 'his' assert, its an assert, one of hundreds in the code, and if it wasn't working as expected then many many other things would be broken.

by the way , @evergreen22 say "Neither of my test units are restarting. No one else has reported their watch is restarting in the advertising function. "
when i didn't say , even once , that the watch is restarting by itself
but only that the advertising stop working after 1 -2 hour and i need to reset the watch to see it again

He means he has 2 watches running code which should reboot the watch if advertising is not running (ie, the same approach as display an icon if its running/not) and no watches are rebooting, so as far as the code is concerned, the watch is advertising. We know the advertising is not working correctly, and while this pr doesn't fix it, it fixes other things which will hopefully make finding the issue with advertising easier.

@Riksu9000
Copy link
Contributor

I don't really know how this works, but obviously StartAdvertising() must have succeeded because of the asserts. The only other thing I can think of is if StartAdvertising() somehow doesn't get called. This could stop the advertising from working until reset if there are no new events that call it right? Previously waking the watch up would restart advertising. However with how BLE_GAP_EVENT_DISCONNECT and BLE_GAP_EVENT_ADV_COMPLETE are set up, it would at least seem like it wouldn't fail right?

when he say that his assert are working properly
and say i have to prove that my use case , the advertising is not working
it's like he mean that his assert are perfect , and i have to prove him otherwise....

StartAdvertising() hasn't changed significantly. It still works basically the same way as before, but with more asserts.

by the way , @evergreen22 say "Neither of my test units are restarting. No one else has reported their watch is restarting in the advertising function. "
when i didn't say , even once , that the watch is restarting by itself
but only that the advertising stop working after 1 -2 hour and i need to reset the watch to see it again

If it isn't restarting, that means StartAdvertising() is working.

@evergreen22
Copy link
Contributor Author

You seem to understand this pretty well, it should be fairly trivial to display an icon, why don't you do it?

As explained previously, the 'icon' would be on all the time.

@lman0 is taking this personally and I don't mean it that way. I assumed they were a programmer and would understand my ask, However, since @lman0 is not a programmer, I concede it is an unreasonable request and I apologize.

Thank you @lman0. You have inspired a new line of thought about troubleshooting the disconnects. My test builds enable the assertions. They are not enabled in the production build. You have caused me to wonder if they are enabled in nimble - I shall investigate - thank you.

@lman0
Copy link

lman0 commented Sep 10, 2021

thanks @evergreen22 i wish you good luck , in your investigation !
and thanks as well to @kieranc , @Riksu9000 and @geekbozu for them to trying to explain me what i didn't understood.
and thanks all of you for your patience when talking to me

@evergreen22
Copy link
Contributor Author

evergreen22 commented Sep 10, 2021 via email

@JF002
Copy link
Collaborator

JF002 commented Sep 11, 2021

@evergreen22 Thank you so much for this PR! Even if it doesn't fix all our BLE issues, it'll probably improve the reliability of the (re)connection to the phone 👍

I'll try to answer to some of the comments I've read in this review.

@lman0

@JF002 since we use nordic , why we don't use their stack? since from i can see it's open source (otherwise esprino would not be able to modifiy it and share it) , and would resolve some ble problem.

Nordic stack (SoftDevice) is not open source, but the SDK is. The SoftDevice is provided as a binary file (/components/softdevice/s132/hex/s132_nrf52_6.1.1_softdevice.hex) you have to flash at a specific location in flash. Then, the software (using the open source SDK) is able to call functions from this binary file using some linker script magic. This is a smart way to allow open source software to use a proprietary stack, as the software does not link directly with the proprietary code.

At the beginning, InfiniTime used the NRF SoftDevice as BLE stack, but I decided to swtich to NimBLE because I wanted InfiniTime to be a fully FOSS firmware ;) The switch was done in InfiniTime 0.5.

@evergreen22 @geekbozu Regarding fast/slow advertising, I think you chose the best compromise (fast advertising for 30s, then slow advertising). 👍

@lman0 Regarding the advertising icon, it could probably be useful to tell the user the watch is actively advertising and trying to the connect to a host device. Maybe some users will find this additional feedback useful, maybe others will just find it confusing... I don't really know. However, in this case, it won't be useful at all as our BLE issues cause inconsistencies between InfiniTime and the radio : InfiniTime (and NimBLE) really think that the watch is advertising while the radio outputs nothing into the air. Which mean that the advertising icon will be displayed even when advertising stopped working for any reason.
Anyway, I think this subject is closed, and I thank everyone for keeping a cordial and respectful tone in this discussion!

It's a bit late, I'll continue the review tomorrow !

@kieranc
Copy link
Contributor

kieranc commented Sep 12, 2021

I'm not qualified to provide a review however I'm testing this PR (plus #621) and will report back once I have > 24h uptime

FWIW, I've been running this PR for nearly a week and haven't noticed any regressions, it seems to reconnect automatically more reliably than before

@JF002 JF002 merged commit 9afa412 into InfiniTimeOrg:develop Sep 13, 2021
@JF002
Copy link
Collaborator

JF002 commented Sep 13, 2021

Here's a DFU file for testing this PR (built on 6356c7f).
pinetime-mcuboot-app-dfu-1.4.99.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants