Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential lwip / networking issue? #2654

Closed
pljakobs opened this issue Jul 24, 2023 · 18 comments · Fixed by #2658
Closed

Potential lwip / networking issue? #2654

pljakobs opened this issue Jul 24, 2023 · 18 comments · Fixed by #2658

Comments

@pljakobs
Copy link
Contributor

while trying to port my project to Sming5, I ran into a problem:

when building for the esp8266, accessing the config page on the SoftAP does not work.

How to reproduce:

  • Build and flash httpServer_ConfigNetwork to an esp8266
  • connect to the "SmingConfiguration" accesspoint
  • browse to http://192.168.4.1 and watch the page not load.

To verify:
change the code so it immediately connects to your local wlan and access the ip it gets there. watch the correct page load.

What I know so far:

@pljakobs
Copy link
Contributor Author

can really nobody reproduce this issue?

@mikee47
Copy link
Contributor

mikee47 commented Jul 30, 2023

I can confirm this behaviour. FTP works fine with this sample, and fetching 192.168.4.1/wifi-sprites.png works (1769 bytes).

@pljakobs
Copy link
Contributor Author

pljakobs commented Jul 30, 2023

thank you @mikee47, I was wondering if it was something in my setup.
I am thinking that, the way this presents, it might even be all the way down in the wifi code in esp-idf? Since I don't see packets being sent from the esp, nor the send buffer to be ever freed, it might be a bug in the soft AP code. However, that has worked for so so long.
Btw, I'm using esp-idf-4.4
Just checked it with idf 5.0 - still seems to be the same behaviour

@mikee47
Copy link
Contributor

mikee47 commented Jul 30, 2023

Not sure what's going on here... NB. ESP-IDF isn't used for ESP8266, it runs on the NON-OS SDK (Sming/Arch/Esp8266/Components/esp8266/sdk).

@mikee47
Copy link
Contributor

mikee47 commented Jul 30, 2023

@slaff Any thoughts?

@pljakobs
Copy link
Contributor Author

oh, my fault then, I thought of idf due to the build system using it's components

@slaff
Copy link
Contributor

slaff commented Aug 1, 2023

Any thoughts?

Have to check if this is reproducible in the Host arch... Will try to allocate some time later this week.

@pljakobs
Copy link
Contributor Author

pljakobs commented Aug 2, 2023

I would tend to think this happens below sming, given my debug results on the link above.
The problem only manifests when connected to the AP, and I've traced it back into lwip.
A few years ago, I built about 1500 LED controllers running with the original firmware and we had what probably was the same issue but not very often (maybe 5%?) and relatively spurious, so it would sometimes happen and other times it would not, making us think back then that it might be something about the HF environment. This now, at least here, seems to be 100% reproducible, so it may or may not be the same thing.
If it is in or below lwip, I wonder what has so drastically changed over the years. esp nonos should have been mostly frozen for the last few years, and lwip has been very stable itself.
in short: I have no good idea. I was thinking to add an espconn_regist_sentcb callback to see, if anything is being sent, but then again, I'm pretty sure the answer will be no.

@pljakobs
Copy link
Contributor Author

pljakobs commented Aug 3, 2023

so, while trying to add some debug to the actual lwip / espconn portion, I accessed the device yesterday with my tablet, that, I think, has never accessed this content before and - on the first attempt - got it. I was very puzzled, but it clearly did work.
I disconnected from the AP, reconnected and - same behavior as initially, no content.
Makes me think it might have something to do with caching, but then again, I don't see anything being sent, not even an unmodified cache-control header.

@pljakobs
Copy link
Contributor Author

pljakobs commented Aug 9, 2023

slowly digging a little deeper (and I think this is something I saw already early on) - the issue is, as assumed, with sending larger amounts of data, requiring tcp segmentation.

unsent_frame_len: 151

[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:435] 
pcb->mss: 1240
optlen: 0
space: 1089
[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:446] 
pcb unsent: 1239
oversize: 1239

[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:427] 
unsent_frame_len: 1175

[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:435] 
pcb->mss: 1240
optlen: 0
space: 65
[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:446] 
pcb unsent: 215
oversize: 215

[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:427] 
unsent_frame_len: 2199

[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:435] 
pcb->mss: 1240
optlen: 0
space: 64577
[Arch/Esp8266/Components/esp-open-lwip/esp-open-lwip/lwip/core/tcp_out.c:446] 
pcb unsent: 581
oversize: 366

so, from what I can see, the last unsent frame is reported to be 2199 bytes long whith the mss being 1240 bytes which leads to an overflow in the available space for the frame.

I will have to dig further into why this oversize frame is not broken up, but this seems to be close to the issue.

tcp_out.c

  if (pcb->unsent != NULL) {
    u16_t space;
    u16_t unsent_optlen;

    /* @todo: this could be sped up by keeping last_unsent in the pcb */
    debug_i("\nunsent_frame_len: %i\n", pcb->unsent->len);
    for (last_unsent = pcb->unsent; last_unsent->next != NULL;last_unsent = last_unsent->next){
      debug_i("unsent_frame_len: %i\n",last_unsent->len);
    }

    /* Usable space at the end of the last unsent segment */
    unsent_optlen = LWIP_TCP_OPT_LENGTH(last_unsent->flags);
    space = pcb->mss - (last_unsent->len + unsent_optlen);
    debug_i("\npcb->mss: %i\noptlen: %i\nspace: %i",pcb->mss,unsent_optlen,space);
    /*
     * Phase 1: Copy data directly into an oversized pbuf.
     *
     * The number of bytes copied is recorded in the oversize_used
     * variable. The actual copying is done at the bottom of the
     * function.
     */
#if TCP_OVERSIZE
#if TCP_OVERSIZE_DBGCHECK
    /* check that pcb->unsent_oversize matches last_unsent->unsent_oversize */
    debug_i("\npcb unsent: %i\noversize: %i\n", pcb->unsent_oversize,last_unsent->oversize_left);
    LWIP_ASSERT("unsent_oversize mismatch (pcb vs. last_unsent)",
                pcb->unsent_oversize == last_unsent->oversize_left);
#endif /* TCP_OVERSIZE_DBGCHECK */

@pljakobs
Copy link
Contributor Author

pljakobs commented Aug 9, 2023

ok, I think I found the bug:
in tcp_out.c, space is the amount of space left in the current segment. It is declared as

u16_t space;

however, later, there's a condition (space > 0) that does not make sense for a uint, also, in my debugging it clearly shows that it wraps around when the bug hits.
Declaring it instead as

int16_t space;

fixes the bug as far as I can see.

pull request here: lwip-tcpip/lwip#18

@mikee47
Copy link
Contributor

mikee47 commented Aug 9, 2023

Did a manual comparison with more recent v2 lwip, found the offiical fix here lwip-tcpip/lwip@8e8571d. This works for me.

@mikee47
Copy link
Contributor

mikee47 commented Aug 9, 2023

esp-open-lwip is pretty old so there may be other bugs lurking...

@mikee47
Copy link
Contributor

mikee47 commented Aug 9, 2023

@mikee47
Copy link
Contributor

mikee47 commented Aug 9, 2023

@pljakobs Good find!

@pljakobs
Copy link
Contributor Author

pljakobs commented Aug 9, 2023

whoa, this has been around since 2016?
is sming dependent on the old code, or can we pull in the more recent stuff? While finding this was fun, debugging it was not (albeit I gained quite some respect for lwip, amazingly fully featured and flexible for such a small platform)

slaff pushed a commit to slaff/esp-open-lwip that referenced this issue Aug 10, 2023
slaff pushed a commit to slaff/Sming that referenced this issue Aug 10, 2023
Original fix is here: lwip-tcpip/lwip@8e8571d

Issue spotted and documented here: SmingHub#2654

PR submitted to upstream: pfalcon/esp-open-lwip#10
@pljakobs
Copy link
Contributor Author

looking at this fix, it seems that "space" is still declared as unsigned, and while this

 LWIP_ASSERT("inconsistent oversize vs. space", oversize_used <= space);

makes sure that it is not decremented to <0, in line 488, there's still a comparison if space is greater than zero. I have not fully looked at this, but there still seems to be an edge case issue if space should ever wrap around, although I think it's all guarded by ASSERTS.

at least, it would be much better readable (and potentially safe) to explicitly test space for 0 here:

    if ((pos < len) && (space > 0) && (last_unsent->len > 0)) {

as in

    if ((pos < len) && !(space == 0) && (last_unsent->len > 0)) {

the same is probably true for last_unsent, but that might be a stylistic choice.

@mikee47
Copy link
Contributor

mikee47 commented Aug 10, 2023

whoa, this has been around since 2016? is sming dependent on the old code, or can we pull in the more recent stuff? While finding this was fun, debugging it was not (albeit I gained quite some respect for lwip, amazingly fully featured and flexible for such a small platform)

You can use lwip version 2 make ENABLE_CUSTOM_LWIP=2. This is still old though, from 2017.

slaff added a commit that referenced this issue Aug 10, 2023
Original fix is here: lwip-tcpip/lwip@8e8571d

Issue spotted and documented here: #2654

PR submitted to upstream: pfalcon/esp-open-lwip#10
xeonqq pushed a commit to xeonqq/Sming that referenced this issue Sep 11, 2023
Original fix is here: lwip-tcpip/lwip@8e8571d

Issue spotted and documented here: SmingHub#2654

PR submitted to upstream: pfalcon/esp-open-lwip#10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants