Battery drains because of infinite retries in case of failures #1

ThomasFarstrike · 2023-09-02T07:20:16Z

Throughout the piggy code, there are several places where infinite retries are attempted, and this can drain the battery if it keeps failing.

Examples:

wifi signal issue: if the access point has very poor signal, the ESP32 will keep trying to connect, forever
wifi misconfigurations: if the user configured a wrong wifi SSID or password, it will also keep trying to connect
wifi issue on ESP32: the ESP32 has good, but not great, wifi, which from time to time fails to connect, no matter how long it tries, often Reason: 4 - ASSOC_EXPIRE or Reason: 2 - AUTH_EXPIRE, sometimes other errors like 4WAY_HANDSHAKE_TIMEOUT. After a restart, is usually works fine.
lnbits protocol: if the server is down, or sends some invalid response, our HTTPS handling code can go in an infinite loop
lnbits replies: sometimes, the lnbits server might accept the TCP connection, without replying, resulting in an infinite wait
update checker: if for some reason the server doesn't reply, it will hang
(possibly more)

To fix these issues, better error handling of these specific cases would be good, where possible. For example, if the wifi credentials are wrong, the user should be notified on the screen.

Additionally, to prevent anything from causing the ESP32 to get stuck forever, the ESP32 watchdog should be activated and programmed. This will ensure the board reboots in case of an exceptionally long action.

To do it properly and prevent infinite watchdog reboots from draining the battery, a watchdog reboot counter should be kept somewhere. This watchdog reboot counter should be incremented in cases of a "watchdog" reset cause. And it should be reset to 0 in case of a regular (non-watchdog triggered) reboot.

If the watchdog reboot counter exceeds some configured value (example: 3) then the device should immediately go into a long sleep/hibernate (example: 6 hours) so that it wakes up at a time when whatever is causing the problem might hopefully be resolved.

ThomasFarstrike · 2024-02-19T07:44:49Z

With the deadline of Feb 25, 2024 only 6 days away and to have this ready in time for Bitcoin Atlantis, I'm thinking of starting work on this issue. Unless there are others working on it? Please speak up!

This brings it in line with the issue description at #1 "If the watchdog reboot counter exceeds some configured value (example: 3) then the device should immediately go into a long sleep/hibernate (example: 6 hours) so that it wakes up at a time when whatever is causing the problem might hopefully be resolved."

ThomasFarstrike · 2024-02-24T17:21:02Z

https://github.com/LightningPiggy/lightning-piggy/tree/issue-1-battery-drain

ThomasFarstrike · 2024-02-25T13:09:00Z

I implemented the above.

Initially, I used the typical ESP32 "task" watchdog, but if that one triggers a restart, it's not knowable from rtc_get_reset_reason(). So I switched to the more unusual and convoluted "RTC watchdog", which is normally used by the lower-level ESP32 boot functions to detect hung boots, but can be repurposed.

More info: https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-reference/system/wdts.html

I also spent a lot of effort in getting it to work without writing any state (boot counters etc) to the flash memory, because that has limited (as low as 10k?) write cycles. Also NVM and EEPROM were out of the question because these are also implemented in dedicated flash regions on the ESP32.

I found "noinit DRAM" in the docs which is an area of RAM which is preserved across watchdog restarts BUT not across deepsleeps. Then I found RTC_DATA_ATTR memory, which is preserved across deepsleeps, but not across watchdog restarts. In the end, I used both of these concepts in tandem, moving state from one variable to the other at the right times, to achieve persistence across both occurrences.

More info: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/memory-types.html

Then I also did a lot of wireless network testing with wrong access point names, wrong passwords, wrong encryption types, and after a lot of wireless event callback parsing, was able to convert those protocol-level issues into usable feedback for the user, on the display. This is a bit out of scope for this issue, but it should help the users debug the most common wifi issues more easily.

ThomasFarstrike · 2024-02-26T13:23:09Z

This is ready and deployed in v2.0.0 in the webinstaller.

ThomasFarstrike added the bug Something isn't working label Sep 2, 2023

ThomasFarstrike closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Battery drains because of infinite retries in case of failures #1

Battery drains because of infinite retries in case of failures #1

ThomasFarstrike commented Sep 2, 2023

ThomasFarstrike commented Feb 19, 2024

ThomasFarstrike commented Feb 24, 2024

ThomasFarstrike commented Feb 25, 2024

ThomasFarstrike commented Feb 26, 2024

Battery drains because of infinite retries in case of failures #1

Battery drains because of infinite retries in case of failures #1

Comments

ThomasFarstrike commented Sep 2, 2023

ThomasFarstrike commented Feb 19, 2024

ThomasFarstrike commented Feb 24, 2024

ThomasFarstrike commented Feb 25, 2024

ThomasFarstrike commented Feb 26, 2024