Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow MLLP connections to be proactively monitored with is_closed?/1 #65

Conversation

jgautsch
Copy link
Contributor

We have a simple service built around this library that translates HTTP API calls into MLLP messages.

On the receiving end of these messages, our peer sometimes hangs up. We want to reconnect as quickly as possible (and notify ourselves whenever the connection disconnects) instead of waiting until we attempt to send another message.

To do so, we need to introspect the status of a client connection. This PR adds a function to Client to do so.

Requested points of feedback/review:

  • Is checking the TCP connection status with recv(socket, _length = 0, _timeout = 1) appropriate? Does it create a possible race-condition whereby checking the status accidentally intercepts bytes from the socket that should be read elsewhere?
  • Is it necessary to also add a similar is_closed? function to MLLP.TLS?

cc @mfos239

@starbelly
Copy link
Contributor

We have a simple service built around this library that translates HTTP API calls into MLLP messages.

On the receiving end of these messages, our peer sometimes hangs up. We want to reconnect as quickly as possible (and notify ourselves whenever the connection disconnects) instead of waiting until we attempt to send another message.

Is it possible you're not using the latest version? We do provide auto-connect features. If the client is truly closing the connection, we would automatically re-connect by default.

To do so, we need to introspect the status of a client connection. This PR adds a function to Client to do so.

Requested points of feedback/review:

* Is checking the TCP connection status with `recv(socket, _length = 0, _timeout = 1)` appropriate? Does it create a possible race-condition whereby checking the status accidentally intercepts bytes from the socket that should be read elsewhere?

There is a rub in here. Let's say you've sent a message, but you've gotten no response (seemingly), you now call is_closed? and read off a single byte, the single byte may be the start of a reply and in the current implementation in this PR that would get thrown away.

I should also note there are two recent PR(s) (#63 and #64), both of which deal with edge / corner cases possibly related to your case. In a nutshell, sometimes it's best to assume you've entered a bad state and hang up the phone :)

* Is it necessary to also add a similar `is_closed?` function to `MLLP.TLS`?

It might be better to just put this on MLLP.Client vs on tcp / tls since we dynamically dispatch to the transport module anyway and they would both behave in the same way regardless.

@jgautsch
Copy link
Contributor Author

We do provide auto-connect features. If the client is truly closing the connection, we would automatically re-connect by default.

Unfortunately we're working with a connection peer that simply disappears rather than sending us a FIN or RST packet. I just tested with the latest on main branch, and the following reproduction/simulation steps do not result in the MLLP.Client attempting to reconnect until attempting to send a new message:

# in terminal 1:
iex> {:ok, r4090} = MLLP.Receiver.start(port: 4090, dispatcher: MLLP.EchoDispatcher)

# in terminal 2:
iex> {:ok, s1} = MLLP.Client.start_link({127,0,0,1}, 4090)
iex> MLLP.Client.is_connected?(s1)
true

# terimal 1: now ctrl+c to kill the MLLP.Receiver

# terminal 2
iex> MLLP.Client.is_connected?(s1)
true

The MLLP.Client still thinks it's connected because it never learned that the receiver is no longer listening. Proactively checking the connection allows us to initiate reconnect before we have a new message to attempt to send.

There is a rub in here. Let's say you've sent a message, but you've gotten no response (seemingly), you now call is_closed? and read off a single byte, the single byte may be the start of a reply and in the current implementation in this PR that would get thrown away.

Taking a closer look, it looks like this isn't actually the case, at least for sync sends (I could certainly be wrong/misunderstanding, so your review is much appreciated!). A given MLLP.Client GenServer process will only handle one call at a time, so recv(socket, _length = 0, _timeout = 1) will never happen in between the client sending and receiving bytes from the connection in the following snippet (thus there's no opportunity to "intercept", right?):

# lib/mllp/client.ex

def handle_call({:send, message, options}, _from, state) do
    ...
    case state.tcp.send(state.socket, payload) do
      :ok ->
        # there's no opportunity to intercept bytes right here...
        ...
        case recv_ack(state, timeout) do

@starbelly
Copy link
Contributor

We do provide auto-connect features. If the client is truly closing the connection, we would automatically re-connect by default.

Unfortunately we're working with a connection peer that simply disappears rather than sending us a FIN or RST packet. I just tested with the latest on main branch, and the following reproduction/simulation steps do not result in the MLLP.Client attempting to reconnect until attempting to send a new message:

# in terminal 1:
iex> {:ok, r4090} = MLLP.Receiver.start(port: 4090, dispatcher: MLLP.EchoDispatcher)

# in terminal 2:
iex> {:ok, s1} = MLLP.Client.start_link({127,0,0,1}, 4090)
iex> MLLP.Client.is_connected?(s1)
true

# terimal 1: now ctrl+c to kill the MLLP.Receiver

# terminal 2
iex> MLLP.Client.is_connected?(s1)
true

Right. That's expected, but wasn't following you. A fin is actually sent, but since we're not polling the socket (not in any active mode) we don't get a nice little tcp closed message from gen_tcp, it's instead on us.

The MLLP.Client still thinks it's connected because it never learned that the receiver is no longer listening. Proactively checking the connection allows us to initiate reconnect before we have a new message to attempt to send.

Correct, 100%. You will not know until you try to send again. I suppose, this is somehow problematic for your case.

There is a rub in here. Let's say you've sent a message, but you've gotten no response (seemingly), you now call is_closed? and read off a single byte, the single byte may be the start of a reply and in the current implementation in this PR that would get thrown away.

Taking a closer look, it looks like this isn't actually the case, at least for sync sends (I could certainly be wrong/misunderstanding, so your review is much appreciated!). A given MLLP.Client GenServer process will only handle one call at a time, so recv(socket, _length = 0, _timeout = 1) will never happen in between the client sending and receiving bytes from the connection in the following snippet (thus there's no opportunity to "intercept", right?):

# lib/mllp/client.ex

def handle_call({:send, message, options}, _from, state) do
    ...
    case state.tcp.send(state.socket, payload) do
      :ok ->
        # there's no opportunity to intercept bytes right here...
        ...
        case recv_ack(state, timeout) do

That is correct. I was thinking of an edge case, which is a moot point though, since we a) don't expose a recv function right now and b) the new default behaviour that's about to be merged into main.

I'm tracking what you're saying now though and sounds like you're trying to avoid some latency cost, which makes sense. I think I'm fine with adding this function, it probably should be moved to MLLP.Client though.

A slight twist on this might simply be to do this check as part of the maintain reconnect timer polling. Then it would just do the "right thing", optimally the user doesn't need to check anything. However, that may be worth deferring, as we've just about implemented gen_tcp active behaviour at this point, which may be worth doing :)

I'd like to hear @vikas15bhardwaj 's thoughts on this as well.

@vikas15bhardwaj
Copy link
Contributor

@jgautsch @starbelly

Couple of points:

  • We already have an api function is_connected?, I think we should change this function to make sure socket is connected, rather than adding too more functions.
  • Also changing maintain_reconnect_timer/1 won't help because we only use this in case of errors. If we need to add this to MLLP.Client we would need to check this constantly. I think as of now update to is_connected? function and let client check and reconnect early if required sounds good to me. If we change this implementation to use gen_tcp active option then it would be handled in different way.

WDYT?

@starbelly
Copy link
Contributor

Couple of points:

* We already have an api function `is_connected?`, I think we should change this function to make sure socket is connected, rather than adding too more functions.

Agreed

* Also changing `maintain_reconnect_timer/1` won't help because we only use this in case of errors. If we need to add this to `MLLP.Client` we would need to check this constantly. I think as of now update to `is_connected?` function and let client check and reconnect early if required sounds good to me. If we change this implementation to use `gen_tcp` `active` option then it would be handled in different way.

Agreed.

@mfos239 mfos239 force-pushed the mfoster/bears-180-proactively-monitor-mllp-connection branch 2 times, most recently from 4a8f8ee to 20272ec Compare April 27, 2023 19:17
@mfos239 mfos239 force-pushed the mfoster/bears-180-proactively-monitor-mllp-connection branch from 20272ec to 1c2036c Compare April 27, 2023 19:33
@bokner
Copy link
Contributor

bokner commented Jul 11, 2023

Unfortunately we're working with a connection peer that simply disappears rather than sending us a FIN or RST packet. I just tested with the latest on main branch, and the following reproduction/simulation steps do not result in the MLLP.Client attempting to reconnect until attempting to send a new message:

# in terminal 1:
iex> {:ok, r4090} = MLLP.Receiver.start(port: 4090, dispatcher: MLLP.EchoDispatcher)

# in terminal 2:
iex> {:ok, s1} = MLLP.Client.start_link({127,0,0,1}, 4090)
iex> MLLP.Client.is_connected?(s1)
true

# terimal 1: now ctrl+c to kill the MLLP.Receiver

# terminal 2
iex> MLLP.Client.is_connected?(s1)
true

@jgautsch, PR #68 might (at least partially) address the issue. The socket in active state will poll the connection, which should result in the last line of your snippet returning false.
It would be great to get your feedback if you could test that PR with a real thing :-)

@mfos239
Copy link

mfos239 commented Jul 14, 2023

@bokner Nice work! With bokner:active_socket, the client seems to more reliably detect a disconnect and reconnect now. I managed to get this warning if send message is attempted while connection to peer is active and the peer hangs / stops responding rather than disconnecting:

[warning] Event: {:call, {#PID<0.604.0>, #Reference<0.2742583075.1029963780.188054>}} in state receiving. Unknown message received => :is_connected

@bokner
Copy link
Contributor

bokner commented Jul 15, 2023

@mfos239 Thank you, good catch! That was a bug, which has hopefully been fixed.
The test case:
6f6383f

@starbelly
Copy link
Contributor

@jgautsch Thank you once again for opening this PR and bringing issues to our attention. I think you've found that Boris's refactor / rewrite of the client solves probably all your issues, thus I'm closing this in favor of #68 and let's take any remaining bits of the conversation there 😀

Thank you! ❤️🧡💛💚💙💜

@starbelly starbelly closed this Jul 15, 2023
@jgautsch
Copy link
Contributor Author

jgautsch commented Jul 15, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants