Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp32 heap corruption (caused by brightness calculation?) #734

Closed
sansillusion opened this issue Feb 28, 2020 · 35 comments
Closed

esp32 heap corruption (caused by brightness calculation?) #734

sansillusion opened this issue Feb 28, 2020 · 35 comments
Labels
bug stale This issue will be closed soon because of prolonged inactivity

Comments

@sansillusion
Copy link

HI, thanks for the great code !
I run a esp8266 with 120 ws2812b and file server enabled without issues.
I aslo run a esp32 with 89 ws2812b anf file server enabled but get random reboots.
I have just tried to pull latest mster and recompile using cpu at 160mhz instead of 240 like before and flash at 40mhz instead of 80mhz. It did not fix the issue unfortunately but I had pulled arduino serial monitor just for fun and did not expect anything since I dont have debug enabled but got the following after a little while (mabe 20min)...

CORRUPT HEAP: Bad head at 0x3ffd9fcc. Expected 0xabba1234 got 0x3ffda740
abort() was called at PC 0x40083809 on core 0

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x4008a3f8:0x3ffaf780 0x4008a675:0x3ffaf7a0 0x40083809:0x3ffaf7c0 0x4008392d:0x3ffaf7f0 0x4011797f:0x3ffaf810 0x40113d11:0x3ffafad0 0x40108c8d:0x3ffafb20 0x40092031:0x3ffafb50 0x40083e82:0x3ffafb70 0x40083749:0x3ffafb90 0x4000bec7:0x3ffafbb0 0x401422a7:0x3ffafbd0 0x4014232b:0x3ffafbf0 0x40142373:0x3ffafc10 0x4014341a:0x3ffafc30 0x40143482:0x3ffafc50 0x400f31f5:0x3ffafc70 0x4013d82c:0x3ffafc90 0x4008f431:0x3ffafcc0

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:10044
load:0x40080400,len:5872
entry 0x400806ac
Ada

I will try erasing flash just in case but I suspect it might be esp32 related.

@sansillusion
Copy link
Author

backtrace decoded :

Decoding 19 results
0x4008a3f8: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 695
0x4008a675: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 695
0x40083809: lock_acquire_generic at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c line 143
0x4008392d: _lock_acquire_recursive at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c line 171
0x4011797f: _vfiprintf_r at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c line 860 (discriminator 2)
0x40113d11: fiprintf at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/fiprintf.c line 50
0x40108c8d: __assert_func at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c line 59 (discriminator 8)
0x40092031: multi_heap_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap_poisoning.c line 315
0x40083e82: heap_caps_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/heap_caps.c line 232
0x40083749: _free_r at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/syscalls.c line 42
0x401422a7: mem_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/mem.c line 176
0x4014232b: do_memp_free_pool at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/memp.c line 459
0x40142373: memp_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/memp.c line 489
0x4014341a: tcp_close_shutdown at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp.c line 1759
0x40143482: tcp_close at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp.c line 1759
0x400f31f5: _tcp_close_api(tcpip_api_call_data*) at C:\Users\Steve\Documents\Arduino\libraries\AsyncTCP-master\src/AsyncTCP.cpp line 874
0x4013d82c: tcpip_thread at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/api/tcpip.c line 483
0x4008f431: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 355 (discriminator 1)

@sansillusion
Copy link
Author

Still get reboots after erased flash. "Lake" effect seems to speed up crash.
I was on lake clicked "preview" preview loaded ok, clicked again to stop preview and it crashed.

@sansillusion
Copy link
Author

I have compiled without the file server option and it seems to be stable for now, will report back if it crashes.

@weini22
Copy link

weini22 commented Feb 28, 2020

I´m experiencing reboots with the provide esp32 binary since release 0.9.1. I had no such issues with 0.9 b1.

@sansillusion
Copy link
Author

Still crashed even with file server disabled. Just took a little longer.

@Aircoookie
Copy link
Owner

@sansillusion sorry for the issue!

Are you using MQTT or the Alexa integration? Sometimes the "Enable MQTT" checkbox in Sync settings is active even though no server is configured. For Alexa, changing the Invocation name from "Light" to something else may help. If you don't use it, I would even recommend completely disabling it.

@sansillusion
Copy link
Author

None of these settings are activated. I only use sync to sync the esp8266. I will compile a version with everything disabled just to see but I do plan on using other functionality in the future.

@weini22
Copy link

weini22 commented Mar 1, 2020

I now tested with MQTT disabled: Same issue, it reboots after 3-10h. I even reboots when it is switchedd to "off" in the GUI. Not sure, if it processes the animations in this case at all.

@weini22
Copy link

weini22 commented Mar 1, 2020

If you want to replicate the issue: The time until reboot seems to depend on the effect chosen. When I run the "Pride 2015", it reboots after 1-2h. When I use other, less resource intensive effects, it takes much longer.

@sansillusion
Copy link
Author

Still crarshes even when all features disabled in compile time. Just take a little longer. Looks like a potential memory leak or maybe multitasking issue. I am just an amateur so I can't really help except by testing.

@sansillusion
Copy link
Author

I find it funny that my esp8266 is not crashing even with more features enabled.

@Aircoookie
Copy link
Owner

Aircoookie commented Mar 2, 2020

I'll have to do some serious debugging to do on this one....
@sansillusion what is your esp32 arduino core version?

@sansillusion
Copy link
Author

I will have to check once I get back home.
I had pulled the latest master Saturday.

@sansillusion
Copy link
Author

The esp8266 is still going strong without crashes but the esp32 in unstable unfortunately. Current esp32 core version : v3.3.1-61-g367c3c09c

@sansillusion
Copy link
Author

I have tried multiple different configuration options and it always crashes.
from the decoded backtrace it looks like it would be AsyncTCP that is bugged but I am no programmer and dont really know where to look for a solution.

@sansillusion
Copy link
Author

I also got these in the console :
19:27:26.160 -> [W][AsyncTCP.cpp:630] _poll(): rx timeout 4
19:27:43.706 -> [W][AsyncTCP.cpp:630] _poll(): rx timeout 4

@sansillusion
Copy link
Author

These seem to have appeared when I enabled NTP time server.

@weini22
Copy link

weini22 commented Mar 19, 2020

Impressive, looks to me you identified the root cause. I disabled the NTP option and have now 10h uptime with continous usage. That was not possible before. I´ll keep you posted how this develops further on.

@Aircoookie
Copy link
Owner

Interesting, my ESP32 testing device has 26 days uptime with NTP disabled. Just enabled it and it synced the time successfully. Now we'll see if it reboots.

@sansillusion
Copy link
Author

sansillusion commented Mar 20, 2020 via email

@Aircoookie
Copy link
Owner

@sansillusion you are right. After enabling NTP yesterday it did not crash for 14 more hours. However I noticed that the free heap was 177776 so I rebooted and right after rebooting the heap was reported as 202028, a 24kB difference that can only be explained with a memory leak. Can you send me the link to that closed issue?

@sansillusion
Copy link
Author

sansillusion commented Mar 20, 2020 via email

@weini22
Copy link

weini22 commented Mar 20, 2020

I see, we´re not yet there. My esp32 rebootet after about 26h (with NTP disabled).

@weini22
Copy link

weini22 commented May 7, 2020

It looks to me like the issue is partially gone with 0.10!
If I disable the ntp server, I´ve 3 days uptime now. That was unthinkable with 0.9.1. However, when I´ve the ntp server activated, it still reboots after a few hours.
Free heap size is always around 194-196k on my esp32.

As I don´t need the ntp server, I´m more than happy personally!

@Aircoookie: Thanks a lot again for your great work!

@sansillusion
Copy link
Author

Ok I think I finally found what was my problem. Up till recently they where still crashing regularly. When I changed router to see if it was the issue I also changed some settings in one of my wled device but not the second one. Well turns out deactivating "Auto brightness adjustment" fixed my issue. No more reboot or hang in more than a week after these changes. I just changed the setting in my second device since it was still crashing/rebooting often. Will report back in a while to see if it worked on the second one too this would confirm that for me.

@sansillusion
Copy link
Author

The only side effect that gives me (it's not a problem at all) is that home assistant reports a current usage of 0ma. Now even running Hyperion for a week it still respond all the time. Btw thanks for your great work again ! You have made such a complete solution for lighting i love it !

@Aircoookie Aircoookie added the bug label Jul 10, 2020
@Aircoookie Aircoookie changed the title esp32 heap corruption esp32 heap corruption (caused by brightness calculation?) Jul 10, 2020
@Aircoookie
Copy link
Owner

Aircoookie commented Jul 10, 2020

Thank you for finding a link to the brightness limiter and awesome that you seem to be able to get rid of the problem!
Yeah, the current calculation is only active if the limiter is enabled. What current limit did you have the brightness adjuster set to when the issues occured?

@sansillusion
Copy link
Author

Go in settings/led preferences and uncheck "Enable automatic brightness limiter" I am noe at over a week without rebooting/hanging.

@sansillusion
Copy link
Author

@Aircoookie I had it set at 7000 since my power supply is 8A (kept it safe) but realised that with only 61 LED's I could run it at full brightness without risking going over. I decided to turn it off to try to maximize cpu cycles so Hyperion would run as smoothly as possible. Now no more reboot/hang and Hyperion is smooth as f%#&.

@sansillusion
Copy link
Author

Yep ! I have not had any crash with none of my esp32's ! So it does seem that the feature that calculate the current consumption is part of the issue. If you think you get the issue fixed just let me know I'll give it a test. But I am now extremely happy with WLED ! It's such a complete solution ! And now that I can have home assistant and Hyperion working without a single issue I can keep improving my setup and add more controllers to the mix ! Thanks a lot for your great work. Will do another donation soon. (Next month)

@Legsmaniac
Copy link

This is really strange. I don't understand how a simple current calculation can cause a crash. I've been having some crash and WiFi drop problems too so nevertheless, I decided to try this. Sadly, it didn't make any difference for me but what did was turning off Alexa and/or making sure the MCU is properly cleared prior to a clean upload of WLED. More research needed I think.

@weini22
Copy link

weini22 commented Jul 23, 2020

I fully agree with @Legsmaniac: My reboot issues are gone since I disabled the "send to Alexa" and "send to hue". I still have the current calculation active all the time and have now uptimes of 20+ days. I also experienced the situation that I had reboot issues even with Alexa disabled after changing and saving some settings. Simply changes something -> changing back -> saving again, solved that issue.
And I always had at least 185k of heap space available on my NodeMCU when I checked the setting.

My personal conclusions:

  1. I don´t believe that this is a heap issue, it looks more to me that a certain combination of features is causing the issue when they´re activated at the same time
  2. There seems to be a sporadic issue with saved settings are causing stability problems. It seems to be not based on the settings, but looks more like the saved representation is causing the issue.

Having a backup/restore option for the full set of configuration data would be a great support to further track this down.
However, I have a fully stable setup now with 2 NodeMCUs, so I´m more than happy.

@sansillusion
Copy link
Author

I am not saying that it's a definitive solution for everyone but for me disabling Alexa or Hue, in fact I tried disabling everything I could and nothing worked for me. But now that I have disabled the current limiting feature, everything else is enabled and the esp32's I am running are all stable and very responsive now. Before disabling the current limiting feature, I had some stiffness in Hyperion but now It's live and smooth.

@sansillusion
Copy link
Author

I must add that after disabling current limiting, my second esp crashed a few hours after but I had not rebooted the device after the change. Once rebooted it stayed connected till today (my fault I rebooted my router trying to find what was that new esp connected to my router, turns out it was my new sonoff switch lol)

@stale
Copy link

stale bot commented Nov 21, 2020

Hey! This issue has been open for quite some time without any new comments now. It will be closed automatically in a week if no further activity occurs.
Thank you for using WLED!

@stale stale bot added the stale This issue will be closed soon because of prolonged inactivity label Nov 21, 2020
@stale stale bot closed this as completed Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale This issue will be closed soon because of prolonged inactivity
Projects
None yet
Development

No branches or pull requests

4 participants