Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gnrc_tcp: gnrc_tcp_recv() never generates -ECONNABORTED #17896

Open
benpicco opened this issue Apr 2, 2022 · 7 comments
Open

gnrc_tcp: gnrc_tcp_recv() never generates -ECONNABORTED #17896

benpicco opened this issue Apr 2, 2022 · 7 comments
Assignees
Labels
Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@benpicco
Copy link
Contributor

benpicco commented Apr 2, 2022

Description

The connection timeout (_sched_connection_timeout()) is only scheduled inside functions and removed when exiting the function, e.g in gnrc_tcp_recv().

That means that if the user timeout supplied is smaller than the connection timeout, the connection timeout will never be generated.

This is a problem if the user timeout is set to a low value for interactive applications.
Those can never detect if the connection is lost.

Expected results

If there was no interaction with the remote within CONFIG_GNRC_TCP_CONNECTION_TIMEOUT_DURATION_MS, the connection must be terminated.

Actual results

The connection is kept open indefinitely.

Steps to reproduce the issue

Consider the telnet server:

while (1) {
    res = sock_tcp_read(client, rx_buf, sizeof(rx_buf), 50);
    if (res == -ETIMEDOUT) {
        /* no character received, try again */
        continue;
    }

    if (res < 0) {
        /* break the loop, terminate the connection */
        break;
    }

To ensure interactivity there is a low user timeout (otherwise the sock_tcp_read() call will block until the whole rx_buf is filled). If the remote connection is lost, the connection will be terminated to listen for a new connection, otherwise try to read more bytes.

For easier reproducibility I lowered CONFIG_GNRC_TCP_CONNECTION_TIMEOUT_DURATION_MS to 10s.

I ran examples/telnet_server on same54-xpro. The board has an Ethernet interface, so loss of network connection can be forced by disconnecting the cable.

  • connect to the board with e.g telnet fe80::fec2:3dff:fe23:22df%eno1
  • disconnect the Ethernet cable
  • close the telnet application
  • wait ~10s (or whatever you have set for CONFIG_GNRC_TCP_CONNECTION_TIMEOUT_DURATION_MS)
  • connect the ethernet cable again
  • try to connect again with e.g. telnet fe80::fec2:3dff:fe23:22df%eno1
    • the connection can not be established as the server is still connected to the stale socket

Versions

@benpicco benpicco added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: network Area: Networking labels Apr 2, 2022
@brummer-simon
Copy link
Member

I'll take a look into this

@brummer-simon
Copy link
Member

Hi @benpicco,

After taking a look into the code this behavior works as intended and is inline with the TCP RFC. In TCP, it is valid behavior if there is silence on the wire for long periods of time. There is no mechanism intended to check if the endpoints are available outside of the execution time of the given TCP functions.

Receive acts only on incoming data, it does not send anything. For a receive call, it is therefore not possible to determine if the remote endpoint is available. Each time a user calls receive, it implied that there was or will be a data transmission in the near future this leads to the following behaviors:

  1. No Timeout (non-block): Return immediately. It is implied that the connection is still available.
  2. User-Timeout: A user specified timeout. If this times out, it is implied that the connection is still available.
  3. No Timeout: Wait until there is data. This call returns after the connection timeout to prevent infinite blocking. It is implied that if this timeout fires, the remote Endpoint is not available anymore.

This basically means: If you either receive non-blocking or with a user-timeout, it is the users responsibility to figure out if the remote endpoint is still there. One simple way to figure this out, is a call to send() with an empty payload. This causes the remote Endpoint to send an acknowledge as in reply. If this acknowledge was received (and send therefore does not timeout), we know that the endpoint is available.

That being said, after taking a look into the gnrc_tcp code base, send returns immediately if the payload is empty. From my point of view, this is the real Issue here and this should be solved. Any thoughts on that?

@benpicco
Copy link
Contributor Author

benpicco commented Apr 3, 2022

I tried with LWIP (#17899) and it manages to detect that the remote disconnected after ~10s.
It is however somehow broken in different ways (only sends data after data has been received).
There seem to be some inconsistencies with the two TCP stacks.

@brummer-simon
Copy link
Member

It might be possible to accumulate timeouts between multiple receive attempts and to check if the accumulated value is greater than the connect timeout, but I need to think this through if this is a legit way to solve this.

@benpicco Can LWIP detect a disconnect if you read non-blocking? Might be interesting.

@benpicco
Copy link
Contributor Author

benpicco commented Apr 3, 2022

Can LWIP detect a disconnect if you read non-blocking? Might be interesting.

Yes.

--- a/sys/net/application_layer/telnet/telnet_server.c
+++ b/sys/net/application_layer/telnet/telnet_server.c
@@ -204,7 +204,7 @@ static void *telnet_thread(void *arg)
         uint8_t is_option = 0;
         while (1) {
             _acquire();
-            res = sock_tcp_read(client, rx_buf, sizeof(rx_buf), SOCK_TCP_TIMEOUT_MS);
+            res = sock_tcp_read(client, rx_buf, sizeof(rx_buf), SOCK_NO_TIMEOUT);
             _release();
             if (res == -ETIMEDOUT) {
                 continue;
2022-04-03 20:26:22,970 # RIOT telnet example application
2022-04-03 20:26:22,984 # ╔═══════════════════════════════════════════════════╗
2022-04-03 20:26:22,989 # ║telnet is entirely unencrypted and unauthenticated.║
2022-04-03 20:26:22,994 # ║Do not use this on public networks.                ║
2022-04-03 20:26:23,008 # ╚═══════════════════════════════════════════════════╝
2022-04-03 20:26:23,013 # Iface ET0 HWaddr: fc:c2:3d:23:22:df Link: up State: up
2022-04-03 20:26:23,015 #         Link type: wired
2022-04-03 20:26:23,020 #         inet6 addr: fe80:0:0:0:fec2:3dff:fe23:22df scope: link
2022-04-03 20:26:23,024 # All up, awaiting connection on port 23
2022-04-03 20:26:26,518 # Local shell disabled> fe80::fec2:3dff:fe23:22df connected
# disconnect ethernet cable
2022-04-03 20:26:37,994 # disconnected

@brummer-simon
Copy link
Member

Interesting. I am reading the TCP RFC and as far as I can see the timeout handling here is up to the implementation. Therefore the above approach seems to be valid. I'll try to come up with a solution although the next weeks are busy :/.

Thanks for finding this.

@benpicco
Copy link
Contributor Author

benpicco commented Dec 1, 2023

Looks like TCP Keep-Alive packets are a thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

2 participants