Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LWIP TCP socket close - disconnecting fix #10476

Merged
merged 1 commit into from May 21, 2019

Conversation

Projects
None yet
7 participants
@tymoteuszblochmobica
Copy link
Contributor

commented Apr 25, 2019

Description

Currently LWIP TCP socket close doesn't care TCP close handshaking is properly completed.
If network interface disconnect follows immediately after TCP socket_close and TCP FSM is in
ESTABLISHED state this causes too early eth/wifi driver stop.
In result FIN ACK sequence is being corrupted and TCP FSM stucks in FIN_WAIT_1 state until LWIP netif is deleted. This is a reason of failing TCP tests due to lack of memory because TCP PCB mempool is not freed on time.

This fix check if socket is TCP and FSM is in ESTABLISHED state.
Then give extra time for connection close handshaking until TIME_WAIT state or timeout occurs.

Pull request type

[x ] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

@SeppoTakalo
@mikaleppanen
@kjbracey-arm

Release Notes

@ciarmcom ciarmcom requested review from kjbracey-arm, mikaleppanen, SeppoTakalo and ARMmbed/mbed-os-maintainers Apr 25, 2019

@ciarmcom

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

@SeppoTakalo

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2019

@mikaleppanen @kjbracey-arm Any opinions about this?

For me, this sounds like a dirty workaround. We are going to do a busy loop, where non-blockiness is expected.

This surely fix the issue, but I wonder if there is any more elegant solution?

do {
wait_ms(5);
check_limit--;
} while (tcp != NULL && tcp->state < TIME_WAIT && check_limit);

This comment has been minimized.

Copy link
@kjbracey-arm

kjbracey-arm Apr 26, 2019

Contributor

This is potentially racey - the PCB that tcp is pointing at could be deallocated/reused while you're waiting, because you called delete.

It seems that lwIP does have netconn_shutdown - use that prior to the wait, and then keep monitoring s->conn->pcb.tcp. (I'm guessing that can become null due to the shutdown):

Something like:

if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP) {
    netconn_shutdown(s->conn, true, true);
    while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) {
          wait();
    }
}

Still got a minor/theoretical thread-safety issue on polling s->conn->pcb.tcp though - not sure about how we lock against the tcpip thread. Is there a callback we could get to know when the state changes?

Alternatively, have you considered configuring SO_LINGER in lwIP? Even if we don't expose the option via NSAPI's setsockopt, you could probably get lwIP to give this wait behaviour by setting it based on your config option.

I think you'd have to poke the option flag and linger timeout value directly in, imitating the code in lwip_getsockopt_impl, and then mark the socket blocking before the delete to activate it.

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2019

@SeppoTakalo - feels a bit icky to me too, but probably not that bad in practice. Don't forget that lwip's connect is also blocking when it shouldn't be...

And many off-board stacks probably do stall for a while in their "close" operations.

This is at least controllable via an option, and doesn't preclude adding more direct control later like public SO_LINGER setsockopt API.

@0xc0170 0xc0170 added needs: work and removed needs: review labels Apr 29, 2019

@0xc0170

This comment has been minimized.

Copy link
Member

commented May 2, 2019

@tymoteuszblochmobica Please update

@tymoteuszblochmobica

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2019

I checked that

if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP) {
netconn_shutdown(s->conn, true, true);
while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) {
wait();
}
}

doesn't work due to :

netconn_shutdown also calls lwip_netconn_do_close_internal which sets

conn->pcb.tcp to NULL

So it doesn't provide useful information about PCB tcp FSM and waiting loop exits immediately.

Unfortunately there is no callback from tcp FSM state change.

Making socket blocking and enabling LINGER option also cannot work without LWIP code modification.
SO_LINGER requires unsent or unacked data to trigge tcp_pooling
if ((conn->linger >= 0) && (conn->pcb.tcp->unsent || conn->pcb.tcp->unacked)) {

I this case there is no unsent and unacked data and tcp_shutdown sends FIN to server and immediately returns.

Maybe my solution proposal is ugly but i wanted to keep LWIP code unchanged.

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

Okay, that shutdown implementation is unfortunate. As the comment in "netconn_close" says "shutting down both ends is the same as closing". :( Why???

You could just shutdown write and then poll - that is what actually is important in terms of the TCP protocol and state machine - it sends the FIN. Shutting down read is an assertion that you don't expect any more data (just a FIN) and makes it send RST if they do (to indicate what they sent has been ignored). That's less critical - it doesn't directly affect the state machine, except in that "drop with reset" case where more data arrives.

If data does come in while you're polling, because you only shutdown write, I believe it should still send a RST when you close without having read it. (rst_on_unacked_data in tcp_close_shutdown).

@tymoteuszblochmobica tymoteuszblochmobica force-pushed the tymoteuszblochmobica:closetcp branch from c7c5bc3 to 9bb4526 May 6, 2019

@tymoteuszblochmobica

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

If only shutdown write is selected netconn_shutdown(s->conn, false, true) works now as we expected.
It doesn't set conn->pcb.tcp to NULL until call lwip_netconn_do_close, and now waiting loop works fine.

@kjbracey-arm
Copy link
Contributor

left a comment

Glad that technique works. Should be fine. Bit more refinement.

@@ -100,6 +100,10 @@
"help": "Maximum number of retransmissions of SYN segments. Current default (used if null here) is set to 6 in opt.h",
"value": null
},
"tcp-close-timeout": {
"help": "Maximum number of 5ms intervals for TCP close handshaking timeout",

This comment has been minimized.

Copy link
@kjbracey-arm

kjbracey-arm May 7, 2019

Contributor

Make the config be milliseconds - exposing 5ms interval multiple is a bit weird. Even if you have 5ms granularity, you can subtract 5 from check_limit the loop (but make it signed).

Or I'd be quite happy with 1ms polling anyway; I think being more responsive to close - shutting down 4ms faster - is probably more of a power saving than only checking every 5 milliseconds.

if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP && s->conn->pcb.tcp->state == ESTABLISHED) {
netconn_shutdown(s->conn, false, true);
while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) {
wait_ms(5);

This comment has been minimized.

Copy link
@kjbracey-arm

kjbracey-arm May 7, 2019

Contributor

ThisThread::sleep_for(5) is better - wait_ms is on the way to deprecation.

Although, I wonder if you now get an event from lwIP when the state changes, so you can do an EventFlags::wait(TCP_CLOSE_TIMEOUT) rather than a loop. Probably no biggie.

@tymoteuszblochmobica tymoteuszblochmobica force-pushed the tymoteuszblochmobica:closetcp branch from 9bb4526 to c1230bd May 9, 2019

@tymoteuszblochmobica

This comment has been minimized.

Copy link
Contributor Author

commented May 9, 2019

Replaced wait_ms ThisThread::sleep_for(1 ms)

Tried to use callback for EventFlags eg. my_event_member::set(TCP_FLAG) and use my_event_member.wait_any(TCP_FLAG,timeout) instead of waiting loop.

Calling sequence:
tcp_input
TCP_EVENT_CLOSED( )
pcb->recv assigned recv_tcp

recv_tcp calls
sys_mbox_trypost(&conn->recvmbox ) -not used yet it requires more code modify
or
API_EVENT(conn->NETCONN_EVT_RCV_PLUS
netconn->callback assigned LWIP::socket_callback for setting flag my_event_member::set(TCP_FLAG)

It exits my_event_member.wait_any(TCP_FLAG only on timeout and LWIP::socket_callback is blocked (but its called from tcpip_thread context ).
Do i miss something?

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

Not quite following your comment. Sounds like you attempted what I suggested - setting an event flag in socket_callback. My hope was that the callback would be called on the state transition you cared about so you could do something like if (nc is TCP and finished) EventFlags::set(FLAG_TCP_FINISHED) in socket_callback.

That should work, but only if lwIP is actually sending an event at the right time. Maybe it isn't. If not, not worth messing with it.

@tymoteuszblochmobica tymoteuszblochmobica force-pushed the tymoteuszblochmobica:closetcp branch 2 times, most recently from 1806702 to 7fb87bd May 10, 2019

Fixed TCP connection close.
If TCP FSM is in ESTABLISHED state,  waits  for TCP close handshaking until TIME_WAIT
The purpose is to prevent eth/wifi driver stop and  FIN ACK corrupt.
This may happend if network interface disconnect follows immediately after socket_close.

@tymoteuszblochmobica tymoteuszblochmobica force-pushed the tymoteuszblochmobica:closetcp branch from 7fb87bd to 51610cc May 10, 2019

@tymoteuszblochmobica

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2019

After code cleanup and applying EventFlag again it started working as expected.
Probably code revert removed spurious changes used for tests.

@tymoteuszblochmobica

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

Kevin could You check is it OK?

@adbridge adbridge added needs: CI and removed needs: work labels May 17, 2019

@0xc0170

This comment has been minimized.

Copy link
Member

commented May 20, 2019

CI started

@mbed-ci

This comment has been minimized.

Copy link

commented May 20, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 1
Build artifacts

@0xc0170 0xc0170 merged commit 02eaad4 into ARMmbed:master May 21, 2019

26 checks passed

continuous-integration/jenkins/pr-head This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-ci/build-ARM Success
Details
jenkins-ci/build-GCC_ARM Success
Details
jenkins-ci/build-IAR Success
Details
jenkins-ci/cloud-client-test Success
Details
jenkins-ci/dynamic-memory-usage Success
Details
jenkins-ci/exporter Success
Details
jenkins-ci/greentea-test Success
Details
jenkins-ci/mbed2-build-ARM Success
Details
jenkins-ci/mbed2-build-GCC_ARM Success
Details
jenkins-ci/mbed2-build-IAR Success
Details
jenkins-ci/unittests Success
Details
travis-ci/astyle Success!
Details
travis-ci/docs Success!
Details
travis-ci/doxy-spellcheck Success!
Details
travis-ci/events Success! Runtime is 8638 cycles.
Details
travis-ci/gitattributestest Success!
Details
travis-ci/include_check Success!
Details
travis-ci/licence_check Success!
Details
travis-ci/littlefs Success! Code size is 8448B.
Details
travis-ci/psa-autogen Success!
Details
travis-ci/tools-py2.7 Success!
Details
travis-ci/tools-py3.5 Success!
Details
travis-ci/tools-py3.6 Success!
Details
travis-ci/tools-py3.7 Success!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.