[BUG]: tcp_client_endpoint must unset is_sending_ during restart #668

joeyoravec · 2024-04-10T15:24:37Z

vSomeip Version

v3.4.10

Boost Version

1.82

Environment

Android and QNX

Describe the bug

I've used 3.1.20, 3.3.8 successfully but starting with 3.4.10 I'm seeing an occasional failure on a two-node network. Each node is running routingmanagerd with TCP network

Most of the time the system works properly, but when this happens:

tcpdump shows Service Discovery packets working properly between the nodes in terms of: offer, subscribe, and subscribeack
netstat shows that a TCP socket is established properly from routingmanagerd on nodeA to routingmanagerd on nodeB
some/ip packets are transmitted from routingmanagerd on nodeA and received by routingmanagerd nodeB
some/ip packets are not transmitted from routingmanagerd on nodeB

End user observation, because data is only flowing one direction over the TCP socket but not the other, is that:

notifications (eg. attribute updates) are succesfully transmitted from service on nodeA and received by client on nodeB because the packets flowed properly in this direction
method calls from client on nodeB to service on nodeA get a REMOTE_ERROR because the packets did not flow in this direction

In this situation we were able to use strace and confirm that nodeB routingmanagerd was not even making a sendto() call on the file descriptor of the TCP socket.

I have not yet observed this behavior on older vsomeip versions 3.1.20 or 3.3.8, only on the newer 3.4.10. Note the capi and boost versions also vary across these observations.

Reproduction Steps

I reproduce this behavior with:

vsomeip 3.4.10
capicxx-someip-runtime 3.2.3r8
capicxx-core-runtme 3.2.3.r7
boost 1.82

My two nodes are:

nodeA == QNX using a replacement, completely custom boost-asio reactor based on ionotify
nodeB == Android effectively unmodified stock

I've already ruled out my custom QNX boost-asio reactor because I've observed this problem on both nodes / both directions. I've seen separately where both nodeA (QNX) and nodeB (Android) each get "stuck" in the same way where routingmanagerd will receive but not transmit. It was helpful to know that two machines running totally different operating system reproduce.

The reproduction rate is very low. I cannot reproduce on-demand on my desk.

Expected behaviour

No response

Logs and Screenshots

No response

The text was updated successfully, but these errors were encountered:

joeyoravec · 2024-05-04T00:13:28Z

See linked pull request with a proposed solution. After adding some debug prints, I concluded that it's possible for is_sending_ to be true across restart() of a TCP client endpoint:

routingmanagerd: tcp_client_endpoint receive_cbk restarting.
routingmanagerd: [TCP-DEBUG:cei::shutdown_and_close_socket_unlocked] is_sending_=0
routingmanagerd: [TCP-DEBUG:tcei::wait_until_sent] Calling restart
routingmanagerd: tce::restart: dropping message: remote:10.6.0.3:30510 (30fd): [0402.0001.018e] size: 23
routingmanagerd: [TCP-DEBUG:cei::connect_cbk] not calling send_queued, is_sending_=1

This is a problem later in connect_cbk() when:

auto its_entry = get_front();
if (its_entry.first) {
    is_sending_ = true;
    strand_.dispatch(std::bind(&client_endpoint_impl::send_queued,
            this->shared_from_this(), its_entry));
    VSOMEIP_WARNING << __func__ << ": resume sending to: "
            << get_remote_information();
}

there's nothing queued, no reason to call send_queued, and no handler would ever clear the flag. The train logic remains blocked from sending. From a user perspective the client endpoint would receive notifications but never transmit anything again.

It's necessary to clear this flag during restart() when the queue is drained.

joeyoravec added the bug label Apr 10, 2024

joeyoravec linked a pull request May 4, 2024 that will close this issue

unset is_sending_ during tce restart #689

Draft

joeyoravec changed the title ~~[BUG]: vsomeip 3.4.10 occasionally stops transmitting~~ [BUG]: tcp_client_endpoint must unset is_sending_ during restart May 4, 2024

joeyoravec mentioned this issue May 13, 2024

[BUG]: vsomeip slow to restart due to VSOMEIP_MAX_TCP_SENT_WAIT_TIME #701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: tcp_client_endpoint must unset is_sending_ during restart #668

[BUG]: tcp_client_endpoint must unset is_sending_ during restart #668

joeyoravec commented Apr 10, 2024

joeyoravec commented May 4, 2024

[BUG]: tcp_client_endpoint must unset is_sending_ during restart #668

[BUG]: tcp_client_endpoint must unset is_sending_ during restart #668

Comments

joeyoravec commented Apr 10, 2024

vSomeip Version

Boost Version

Environment

Describe the bug

Reproduction Steps

Expected behaviour

Logs and Screenshots

joeyoravec commented May 4, 2024