Allow MLLP connections to be proactively monitored with `is_closed?/1` #65

jgautsch · 2023-04-19T20:46:43Z

We have a simple service built around this library that translates HTTP API calls into MLLP messages.

On the receiving end of these messages, our peer sometimes hangs up. We want to reconnect as quickly as possible (and notify ourselves whenever the connection disconnects) instead of waiting until we attempt to send another message.

To do so, we need to introspect the status of a client connection. This PR adds a function to Client to do so.

Requested points of feedback/review:

Is checking the TCP connection status with recv(socket, _length = 0, _timeout = 1) appropriate? Does it create a possible race-condition whereby checking the status accidentally intercepts bytes from the socket that should be read elsewhere?
Is it necessary to also add a similar is_closed? function to MLLP.TLS?

cc @mfos239

starbelly · 2023-04-19T22:05:52Z

We have a simple service built around this library that translates HTTP API calls into MLLP messages.

On the receiving end of these messages, our peer sometimes hangs up. We want to reconnect as quickly as possible (and notify ourselves whenever the connection disconnects) instead of waiting until we attempt to send another message.

Is it possible you're not using the latest version? We do provide auto-connect features. If the client is truly closing the connection, we would automatically re-connect by default.

To do so, we need to introspect the status of a client connection. This PR adds a function to Client to do so.

Requested points of feedback/review:
* Is checking the TCP connection status with `recv(socket, _length = 0, _timeout = 1)` appropriate? Does it create a possible race-condition whereby checking the status accidentally intercepts bytes from the socket that should be read elsewhere?

There is a rub in here. Let's say you've sent a message, but you've gotten no response (seemingly), you now call is_closed? and read off a single byte, the single byte may be the start of a reply and in the current implementation in this PR that would get thrown away.

I should also note there are two recent PR(s) (#63 and #64), both of which deal with edge / corner cases possibly related to your case. In a nutshell, sometimes it's best to assume you've entered a bad state and hang up the phone :)

* Is it necessary to also add a similar `is_closed?` function to `MLLP.TLS`?

It might be better to just put this on MLLP.Client vs on tcp / tls since we dynamically dispatch to the transport module anyway and they would both behave in the same way regardless.

jgautsch · 2023-04-19T23:18:30Z

We do provide auto-connect features. If the client is truly closing the connection, we would automatically re-connect by default.

Unfortunately we're working with a connection peer that simply disappears rather than sending us a FIN or RST packet. I just tested with the latest on main branch, and the following reproduction/simulation steps do not result in the MLLP.Client attempting to reconnect until attempting to send a new message:

# in terminal 1:
iex> {:ok, r4090} = MLLP.Receiver.start(port: 4090, dispatcher: MLLP.EchoDispatcher)

# in terminal 2:
iex> {:ok, s1} = MLLP.Client.start_link({127,0,0,1}, 4090)
iex> MLLP.Client.is_connected?(s1)
true

# terimal 1: now ctrl+c to kill the MLLP.Receiver

# terminal 2
iex> MLLP.Client.is_connected?(s1)
true

The MLLP.Client still thinks it's connected because it never learned that the receiver is no longer listening. Proactively checking the connection allows us to initiate reconnect before we have a new message to attempt to send.

There is a rub in here. Let's say you've sent a message, but you've gotten no response (seemingly), you now call is_closed? and read off a single byte, the single byte may be the start of a reply and in the current implementation in this PR that would get thrown away.

Taking a closer look, it looks like this isn't actually the case, at least for sync sends (I could certainly be wrong/misunderstanding, so your review is much appreciated!). A given MLLP.Client GenServer process will only handle one call at a time, so recv(socket, _length = 0, _timeout = 1) will never happen in between the client sending and receiving bytes from the connection in the following snippet (thus there's no opportunity to "intercept", right?):

# lib/mllp/client.ex

def handle_call({:send, message, options}, _from, state) do
    ...
    case state.tcp.send(state.socket, payload) do
      :ok ->
        # there's no opportunity to intercept bytes right here...
        ...
        case recv_ack(state, timeout) do

starbelly · 2023-04-20T00:28:54Z

We do provide auto-connect features. If the client is truly closing the connection, we would automatically re-connect by default.

Unfortunately we're working with a connection peer that simply disappears rather than sending us a FIN or RST packet. I just tested with the latest on main branch, and the following reproduction/simulation steps do not result in the MLLP.Client attempting to reconnect until attempting to send a new message:
# in terminal 1:
iex> {:ok, r4090} = MLLP.Receiver.start(port: 4090, dispatcher: MLLP.EchoDispatcher)

# in terminal 2:
iex> {:ok, s1} = MLLP.Client.start_link({127,0,0,1}, 4090)
iex> MLLP.Client.is_connected?(s1)
true

# terimal 1: now ctrl+c to kill the MLLP.Receiver

# terminal 2
iex> MLLP.Client.is_connected?(s1)
true

Right. That's expected, but wasn't following you. A fin is actually sent, but since we're not polling the socket (not in any active mode) we don't get a nice little tcp closed message from gen_tcp, it's instead on us.

The MLLP.Client still thinks it's connected because it never learned that the receiver is no longer listening. Proactively checking the connection allows us to initiate reconnect before we have a new message to attempt to send.

Correct, 100%. You will not know until you try to send again. I suppose, this is somehow problematic for your case.

There is a rub in here. Let's say you've sent a message, but you've gotten no response (seemingly), you now call is_closed? and read off a single byte, the single byte may be the start of a reply and in the current implementation in this PR that would get thrown away.

Taking a closer look, it looks like this isn't actually the case, at least for sync sends (I could certainly be wrong/misunderstanding, so your review is much appreciated!). A given MLLP.Client GenServer process will only handle one call at a time, so recv(socket, _length = 0, _timeout = 1) will never happen in between the client sending and receiving bytes from the connection in the following snippet (thus there's no opportunity to "intercept", right?):
# lib/mllp/client.ex

def handle_call({:send, message, options}, _from, state) do
    ...
    case state.tcp.send(state.socket, payload) do
      :ok ->
        # there's no opportunity to intercept bytes right here...
        ...
        case recv_ack(state, timeout) do

That is correct. I was thinking of an edge case, which is a moot point though, since we a) don't expose a recv function right now and b) the new default behaviour that's about to be merged into main.

I'm tracking what you're saying now though and sounds like you're trying to avoid some latency cost, which makes sense. I think I'm fine with adding this function, it probably should be moved to MLLP.Client though.

A slight twist on this might simply be to do this check as part of the maintain reconnect timer polling. Then it would just do the "right thing", optimally the user doesn't need to check anything. However, that may be worth deferring, as we've just about implemented gen_tcp active behaviour at this point, which may be worth doing :)

I'd like to hear @vikas15bhardwaj 's thoughts on this as well.

vikas15bhardwaj · 2023-04-20T13:40:23Z

@jgautsch @starbelly

Couple of points:

We already have an api function is_connected?, I think we should change this function to make sure socket is connected, rather than adding too more functions.
Also changing maintain_reconnect_timer/1 won't help because we only use this in case of errors. If we need to add this to MLLP.Client we would need to check this constantly. I think as of now update to is_connected? function and let client check and reconnect early if required sounds good to me. If we change this implementation to use gen_tcp active option then it would be handled in different way.

WDYT?

starbelly · 2023-04-20T16:20:38Z

Couple of points:

* We already have an api function `is_connected?`, I think we should change this function to make sure socket is connected, rather than adding too more functions.

Agreed

* Also changing `maintain_reconnect_timer/1` won't help because we only use this in case of errors. If we need to add this to `MLLP.Client` we would need to check this constantly. I think as of now update to `is_connected?` function and let client check and reconnect early if required sounds good to me. If we change this implementation to use `gen_tcp` `active` option then it would be handled in different way.

Agreed.

bokner · 2023-07-11T18:46:46Z

Unfortunately we're working with a connection peer that simply disappears rather than sending us a FIN or RST packet. I just tested with the latest on main branch, and the following reproduction/simulation steps do not result in the MLLP.Client attempting to reconnect until attempting to send a new message:
# in terminal 1:
iex> {:ok, r4090} = MLLP.Receiver.start(port: 4090, dispatcher: MLLP.EchoDispatcher)

# in terminal 2:
iex> {:ok, s1} = MLLP.Client.start_link({127,0,0,1}, 4090)
iex> MLLP.Client.is_connected?(s1)
true

# terimal 1: now ctrl+c to kill the MLLP.Receiver

# terminal 2
iex> MLLP.Client.is_connected?(s1)
true

@jgautsch, PR #68 might (at least partially) address the issue. The socket in active state will poll the connection, which should result in the last line of your snippet returning false.
It would be great to get your feedback if you could test that PR with a real thing :-)

mfos239 · 2023-07-14T21:31:17Z

@bokner Nice work! With bokner:active_socket, the client seems to more reliably detect a disconnect and reconnect now. I managed to get this warning if send message is attempted while connection to peer is active and the peer hangs / stops responding rather than disconnecting:

[warning] Event: {:call, {#PID<0.604.0>, #Reference<0.2742583075.1029963780.188054>}} in state receiving. Unknown message received => :is_connected

bokner · 2023-07-15T17:48:37Z

@mfos239 Thank you, good catch! That was a bug, which has hopefully been fixed.
The test case:
6f6383f

starbelly · 2023-07-15T18:47:59Z

@jgautsch Thank you once again for opening this PR and bringing issues to our attention. I think you've found that Boris's refactor / rewrite of the client solves probably all your issues, thus I'm closing this in favor of #68 and let's take any remaining bits of the conversation there 😀

Thank you! ❤️🧡💛💚💙💜

jgautsch · 2023-07-15T18:50:26Z

Thanks Bryan and Boris! And thanks for open sourcing such an awesome library!

…

On Sat, Jul 15 2023 at 11:48 AM, Bryan Paxton ***@***.***> wrote: Closed #65 <#65>. — Reply to this email directly, view it on GitHub <#65 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATKDJ3YQC7PND2IVLHVJLLXQLQWXANCNFSM6AAAAAAXES3KZM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mfos239 and others added 2 commits April 5, 2023 13:56

BEARS-180: Proactively monitor MLLP connection

ea79da3

refactor, add tests

fecf4b6

starbelly mentioned this pull request Apr 20, 2023

Consider re-working client to use acitve mode with gen_tcp #66

Closed

mfos239 force-pushed the mfoster/bears-180-proactively-monitor-mllp-connection branch 2 times, most recently from 4a8f8ee to 20272ec Compare April 27, 2023 19:17

BEARS-180: Capture all possible tcp closed reasons

1c2036c

mfos239 force-pushed the mfoster/bears-180-proactively-monitor-mllp-connection branch from 20272ec to 1c2036c Compare April 27, 2023 19:33

BEARS-180: Explicitly check :ok, :timeout, log other error reasons

3aba907

starbelly closed this Jul 15, 2023

mfos239 mentioned this pull request Jul 16, 2023

Active socket, implementation with state machine #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow MLLP connections to be proactively monitored with `is_closed?/1` #65

Allow MLLP connections to be proactively monitored with `is_closed?/1` #65

jgautsch commented Apr 19, 2023

starbelly commented Apr 19, 2023

jgautsch commented Apr 19, 2023

starbelly commented Apr 20, 2023

vikas15bhardwaj commented Apr 20, 2023

starbelly commented Apr 20, 2023

bokner commented Jul 11, 2023 •

edited

Loading

mfos239 commented Jul 14, 2023

bokner commented Jul 15, 2023

starbelly commented Jul 15, 2023

jgautsch commented Jul 15, 2023 via email

Allow MLLP connections to be proactively monitored with is_closed?/1 #65

Allow MLLP connections to be proactively monitored with is_closed?/1 #65

Conversation

jgautsch commented Apr 19, 2023

starbelly commented Apr 19, 2023

jgautsch commented Apr 19, 2023

starbelly commented Apr 20, 2023

vikas15bhardwaj commented Apr 20, 2023

starbelly commented Apr 20, 2023

bokner commented Jul 11, 2023 • edited Loading

mfos239 commented Jul 14, 2023

bokner commented Jul 15, 2023

starbelly commented Jul 15, 2023

jgautsch commented Jul 15, 2023 via email

Allow MLLP connections to be proactively monitored with `is_closed?/1` #65

Allow MLLP connections to be proactively monitored with `is_closed?/1` #65

bokner commented Jul 11, 2023 •

edited

Loading