Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORRUPT HEAP with 0.14.1-b2, ESP32, maybe related to MQTT #3637

Open
1 task done
jwaeles opened this issue Jan 3, 2024 · 19 comments
Open
1 task done

CORRUPT HEAP with 0.14.1-b2, ESP32, maybe related to MQTT #3637

jwaeles opened this issue Jan 3, 2024 · 19 comments
Labels
bug external Not part of WLED itself - an external plugin/remote etc. waiting for feedback addition information needed to better understand the issue

Comments

@jwaeles
Copy link

jwaeles commented Jan 3, 2024

What happened?

Hello,

All my 5 ESP32's running WLED_0.14.1-b2_ESP32.bin keep rebooting randomly, sometimes after only a few hours. They're all connected to my MQTT broker, with moderate traffic.

I have another ESP32 with WLED_0.14.1-b2_ESP32_audioreactive.bin, on that one MQTT isn't enabled, and since the upgrade from 0.14 to 0.14.1-b2, it's stable so far (3 days uptime).

I have managed to capture a stacktrace, but i don't know how to decode it.

This stacktrace was generated from WLED_0.14.1-b2_ESP32.bin, at least the binary from install.wled.me

To Reproduce Bug

  • Install WLED_0.14.1-b2_ESP32.bin on ESP32
  • Led type: WS2812b
  • Connect to WiFi
  • Enable MQTT (not secure), connect to broker with login/password
  • Set an effect, for example "Chase" with 3 colors
  • After a few hours, WLED crashes and returns to static orange

Expected Behavior

No crash

Install Method

Binary from WLED.me

What version of WLED?

WLED 0.14.1-b2 (build 2312290)

Which microcontroller/board are you seeing the problem on?

ESP32

Relevant log/trace output

CORRUPT HEAP: Bad head at 0x3ffd7838. Expected 0xabba1234 got 0x3ffd7864
abort() was called at PC 0x4008eb39 on core 0

ELF file SHA256: 0000000000000000

Backtrace: 0x40089af8:0x3ffb5d10 0x40089e55:0x3ffb5d30 0x4008eb39:0x3ffb5d50
0x4008543a:0x3ffb5d70 0x40085805:0x3ffb5d90 0x4000bec7:0x3ffb5db0
0x4016def2:0x3ffb5dd0 0x4016df29:0x3ffb5df0 0x4014974d:0x3ffb5e10
0x4008b89e:0x3ffb5e40

Rebooting...
ets Jul 29 2019 12:21:46

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DOUT, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:10124
load:0x40080400,len:5828
entry 0x400806a8
Ada

Anything else?

Thank you for your help!

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jwaeles jwaeles added the bug label Jan 3, 2024
@blazoncek
Copy link
Collaborator

Please use debug build and exception decoder to help out with crash diagnosis.

@jwaeles
Copy link
Author

jwaeles commented Jan 3, 2024

Please use debug build and exception decoder to help out with crash diagnosis.

Is there a debug build binary, or do i have to build it myself? i looked at exception decoder, seems like it's an arduino plugin? but i also looked at building wled, and now it's only supported on platformIO? so how can I decode a platformio binary with arduino IDE... i'm confused or i need some pointers

@blazoncek
Copy link
Collaborator

You'll need to compile yourself to get a meaningful output from exception decoder.
Use something similar in your platformio.ini file:

[env:debug]
extends = env:esp32dev
monitor_filters = esp32_exception_decoder 
build_flags = ${common.build_flags_esp32}
  -D WLED_DEBUG 
... any other build flags

Then just use PIO's monitor tool.

@softhack007
Copy link
Collaborator

also looked at building wled, and now it's only supported on platformIO?

Yes, you need VSCode+platformio for building + installing wled from source code. The KB has some guidance for getting started:

@jwaeles
Copy link
Author

jwaeles commented Jan 3, 2024

Thank you, it's built with debug enabled and running, we should soon find out whether it produces any useful output. I do see some extra logging, so that's a good start...

@jwaeles
Copy link
Author

jwaeles commented Jan 4, 2024

I had to disable the debug output, as i suspect it delayed execution a little and prevented my bug to reproduce.... ran 24 hours, lost wifi a bunch of times (another issue?) but didn't see a crash.

Until I removed WLED_DEBUG.

Then, only 2 hours in, got this nice stacktrace

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x40157372  PS      : 0x00060930  A0      : 0x8014e915  A1      : 0x3ffb5da0
A2      : 0x3ffdae70  A3      : 0x00450008  A4      : 0x0000004e  A5      : 0x00000000
A6      : 0x3ffb52a8  A7      : 0x00000000  A8      : 0x00702903  A9      : 0x00702929
A10     : 0x00000000  A11     : 0x0000030a  A12     : 0x0070261f  A13     : 0x00000b38
A14     : 0x00060920  A15     : 0x00000000  SAR     : 0x00000019  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00450018  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff

ELF file SHA256: 0000000000000000

Backtrace: 0x4015736f:0x3ffb5da0 0x4014e912:0x3ffb5dd0 0x4011e33a:0x3ffb5df0 0x4014974d:0x3ffb5e10 0x4008b89e:0x3ffb5e40
  #0  0x4015736f:0x3ffb5da0 in tcp_output at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp_out.c:1025
  #1  0x4014e912:0x3ffb5dd0 in tcp_recved at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp.c:1765
  #2  0x4011e33a:0x3ffb5df0 in _tcp_recved_api(tcpip_api_call_data*) at .pio\libdeps\debug\AsyncTCP\src/AsyncTCP.cpp:1153
  #3  0x4014974d:0x3ffb5e10 in tcpip_thread at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/api/tcpip.c:483
  #4  0x4008b89e:0x3ffb5e40 in vPortTaskWrapper at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)

Rebooting...

@blazoncek
Copy link
Collaborator

Not in WLED code. Check your MQTT broker. There was an issue with old Windows implementation of Mosquitto broker in the past.

@blazoncek blazoncek added the external Not part of WLED itself - an external plugin/remote etc. label Jan 4, 2024
@jwaeles
Copy link
Author

jwaeles commented Jan 4, 2024

Here is another one, still from the same build

CORRUPT HEAP: Bad head at 0x3ffde190. Expected 0xabba1234 got 0x3ffde608
abort() was called at PC 0x4008eb39 on core 0

ELF file SHA256: 0000000000000000

Backtrace: 0x40089af8:0x3ffb5d10 0x40089e55:0x3ffb5d30 0x4008eb39:0x3ffb5d50 0x4008543a:0x3ffb5d70 0x40085805:0x3ffb5d90 0x4000bec7:0x3ffb5db0 0x4016def2:0x3ffb5dd0 0x4016df29:0x3ffb5df0 0x4014974d:0x3ffb5e10 0x4008b89e:0x3ffb5e40
  #0  0x40089af8:0x3ffb5d10 in invoke_abort at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:648
  #1  0x40089e55:0x3ffb5d30 in abort at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c:648
  #2  0x4008eb39:0x3ffb5d50 in multi_heap_free at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap_poisoning.c:321
  #3  0x4008543a:0x3ffb5d70 in heap_caps_free at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/heap/heap_caps.c:232
  #4  0x40085805:0x3ffb5d90 in _free_r at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/newlib/syscalls.c:42
  #5  0x4000bec7:0x3ffb5db0 in ?? ??:0
  #6  0x4016def2:0x3ffb5dd0 in _udp_pcb_deinit at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/mdns/mdns_networking.c:202
  #7  0x4016df29:0x3ffb5df0 in _mdns_pcb_deinit_api at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/mdns/mdns_networking.c:267
  #8  0x4014974d:0x3ffb5e10 in tcpip_thread at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/api/tcpip.c:483
  #9  0x4008b89e:0x3ffb5e40 in vPortTaskWrapper at /home/cschwinne/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c:355 (discriminator 1)

Rebooting...

What confuses me is that WLED_0.13.3_ESP32 doesn't crash and doesn't disconnect from the wifi. I have 8 different WLED 0.13.3 running on ESP32 and they all have an uptime of 14 days (last power outage). All my 3-4 instances of WLED 0.14 are rebooting, drop out of the network, glitch on the output after a couple of days. They all use the same WiFi access point and the same broker (mosquitto 2.0.15 on linux (docker))

@jwaeles
Copy link
Author

jwaeles commented Jan 4, 2024

I've just updated mosquitto to the latest version available on dockerhub, 2.0.18, we'll see if WLED still crashes

@blazoncek
Copy link
Collaborator

Very likely related to #3641

@jwaeles
Copy link
Author

jwaeles commented Jan 9, 2024

Since i have updated mosquitto, i haven't seen any stack trace. I however can't test for long, as after a few hours, i always loose WiFi connectivity with 0.14*; I will check later if someone reported a bug on that. Once WiFi is lost, obviously there is no chance for the network packets to be malformed or misread, since there are none reaching the IP stack.

However, the corrupted heap crash was occurring much earlier than when the WiFi dropped, so the stability issues are probably indeed related to the broker, and/or to mdns (which i saw mentioned in one of the stack traces)

@blazoncek
Copy link
Collaborator

blazoncek commented Jan 9, 2024

If not yet, please use 0.14.1-b3
It may be relevant.

EDIT: WiFi issues are not WLED related but rather your network set-up/hardware.

@jwaeles
Copy link
Author

jwaeles commented Jan 9, 2024

EDIT: WiFi issues are not WLED related but rather your network set-up/hardware.

I get your point of view, but like i said earlier, i have 8x wled instances on 0.13.x running for very long without any wifi connectivity issues. My 4x ESP32's with 0.14.1-b2 all drop consistently from the wifi. The ESP32's are sourced from 3 different vendors (some are standard esp32 dev board, 3 are quinled dig-uno/quad, 3 are my own PCB with ESP32 assembled by JLCPCB). Only those with 0.14 lose connectivity and can't recover until I powercycle them.

I know on my network i have a very short DHCP lease time (15 minutes), i had forgotten it from some older network manipulation i was doing, but all the ESP32 & ESP8266 running WLED 0.13.x + shelly + amazon echo + sonos + ... that are on this network are happily staying connected since forever... except for all occurences of WLED 0.14 which consistently lose wifi after a few hours. I don't want to change the DHCP lease time until I get to the bottom of this issue.

Access points are Ubiquiti Unifi 6 something, can't remember the exact model, but pretty much top of the line for 2 years ago, and indeed my wifi coverage is pretty solid since i installed those. Router/DHCP server is Netgate, also pretty much top of the line. Both access points and router were recently rebooted and seem snappy and happy.

I know you get a lot of users with weird setups coming to nag here, but please don't dismiss so fast, because from my analysis, all clues point to the version of WLED running on the ESP32.

@softhack007
Copy link
Collaborator

softhack007 commented Jan 9, 2024

Hi, the two crashes both happen deep inside the TCP and UDP core, without any WLED source code in the trace.

The second crash (with multi_heap_free()) could be a consequence of low memory and heap fragmentation. WLED 0.14.x needs more RAM than 0.13.x - due to added features.

To preserve memory, it usually helps to disable some "bells and whistles" - like

  • disable AP always (select "no connection after boot")
  • try with/without "disable wifi sleep"
  • disable mDNS, by clearing the mDNS address field
  • disable MQTT, Alexa, realtime UDP, and usermods (if any)
  • disable Arduino OTA (security & updates)
  • build with -DWLED_DISABLE_WEBSOCKETS -DWLED_DISABLE_ADALIGHT -DWLED_DISABLE_MQTT -DWLED_DISABLE_ESPNOW
  • Don't use HomeAssistant (really)
  • disable "Use global LED buffer" (LED preferences)

👉 Did you try with the latest beta 0.14.1-b3? We have fixed some use-after-free problems recently, so the latest beta might behave better.

As last resort, you could wipe your device completely with esptool erase_flash, then re-install from the development environment. This sometimes improves wifi connectivity - don't know why but it sometimes helps. Make sure to backup config & presets before esptool erase_flash.

Finally, some wifi problems go away when using a newer espressif framework - buildenv esp32dev_V4_dio80.
The "V4" environment is still experimental for classic esp32, due to limited testing. It will also increases firmware size by 300kB so might not always fit into 4MB flash.

@blazoncek
Copy link
Collaborator

I did not dismiss you out of blue.
I have 30+ WLED instances, from ESP01 to ESP32 (including variants C3 and S2) on various controllers including QuinLED, Shields and other pre-assembled devices. And except one ESP01, 30cm from AP (UAP-AC-M), all have zero WiFi issues and never loose connectivity. My network consists of, like yours, Ubiquiti UniFi and EdgeRouter.

So I will insist on WiFi or other network traffic issues which WLED cannot solve. For clarification: network parts have not been modified since 0.12. The only addition was a signal strength fix for newer ESP32 models like C3,S2 & S3 which is a compile time option.

@blazoncek
Copy link
Collaborator

blazoncek commented Jan 9, 2024

FYI having "Fast roaming" or BSS Transition enabled is known to cause issues with non-compilant hardware. WLED does not support those protocols.

@blazoncek
Copy link
Collaborator

This sometimes improves wifi connectivity - don't know why but it sometimes helps. Make sure to backup config & presets before esptool erase_flash.

A newer bootloader may be needed as it initialises hardware prior to firmware. If your devices have old bootloader (pre 0.13) then they may need bootloader update.

@softhack007 softhack007 added the waiting for feedback addition information needed to better understand the issue label Jan 10, 2024
@jwaeles
Copy link
Author

jwaeles commented Feb 9, 2024

As an update, I have used https://wled-install.github.io/ and flashed the version "Standard version 0.14.1 V4 (ESP IDF 4.4.3 based, experimental, should resolve reboot issues)" and so far it seems stable. I was losing connectivity or seeing reboots much much faster, and so far it's running 24h and still online, responsive and snappy.

@jwaeles
Copy link
Author

jwaeles commented Feb 14, 2024

6 days uptime going strong, i think this is it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug external Not part of WLED itself - an external plugin/remote etc. waiting for feedback addition information needed to better understand the issue
Projects
None yet
Development

No branches or pull requests

3 participants