New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent subscriptions in flight msgs #1971
Conversation
Also correct the total in flight messages stat to the messages in flight on the push clients.
Reproduction steps:These are the steps that I've found most reliably reproduces this issue.
Result:If testing on the first commit of this PR, the If testing on current master, the
Some of the connections will have in flight messages that just never complete processing. |
Multiple connections is totally fine on a retry ... In fact its by design. Removing from all seems legit... "This can lead to the connection's buffer filling up with events it can never complete." How can it never complete them?! I am a bit confused on this part. |
@gregoryyoung Sorry, I worded that badly. It's not that the messages would never get 'completed', more that they would never be removed from the client's outstanding queue. The PersistentSubscription tracks all the outstanding messages, and handles them when they are acked/nacked/timed out. Each PersistentSubscriptionClient (connection) keeps a list of the messages that it has sent to its client, and relies on the PersistentSubscription to tell it when they have been handled so they can be removed. (This list of outstanding messages is used to determine how many available slots there are in the client's buffer) In the case that I've described here, multiple connections have the same message, but the PersistentSubscription is only telling one of them to remove it from their list. |
Ah this makes more sense.
…On Tue, Aug 6, 2019 at 5:52 AM Hayley Campbell ***@***.***> wrote:
@gregoryyoung <https://github.com/gregoryyoung> Sorry, I worded that
badly. It's not that the messages would never get 'completed', more that
they would never be removed from the client's outstanding queue.
The PersistentSubscription tracks all the outstanding messages, and
handles them when they are acked/nacked/timed out.
Each PersistentSubscriptionClient (connection) keeps a list of the
messages that it has sent to its client, and relies on the
PersistentSubscription to tell it when they have been handled so they can
be removed.
In the cases that I've described here, multiple connections have the same
message, but the PersistentSubscription is only telling *one* of them to
remove it from their list.
This leaves the outstanding message on the other connections.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1971?email_source=notifications&email_token=AAC5CWUGNQ57W3PPYEYZ4Z3QDFCUXA5CNFSM4II6OKBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3UTAAY#issuecomment-518598659>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAC5CWRKBWTSZXCTC5XVOZDQDFCUXANCNFSM4II6OKBA>
.
--
Studying for the Turing test
|
66b06bc
to
be24823
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch, nice work 👍
After inspection of the event buffers, it appears that the root cause is not an event being present on two clients at the same time. The issue can manifest itself as follows:
Consider a scenario with 2 clients C1 and C2 connected to a persistent subscription with a message timeout of 0.5s and a single event E in the stream.
- Event E is pushed to C1
- C1 takes 1 second to process event E, then acknowledges it.
- During this 1 second period, on the server side event E will timeout and will be retried by the persistent subscription. Retrying will remove the event E from C1's buffer and then it may be added again to either C1's or C2's buffer. Let's assume it's added to C2's buffer.
- The Acknowledgement from C1 for event E is now received. The persistent subscription will attempt to remove the event E from C1's buffer but it's no longer there. The event E will thus stay in client C2's buffer forever.
The fix appears correct under the above scenario.
In terms of performance: Although removing from all clients is a linear operation, I assume that the number of clients connected to a persistent subscription will be usually small.
…flight-msgs Persistent subscriptions in flight msgs
What is the consequence of this event remaining in the buffer? Would this take up one of the 'in flight' messages or similar? |
@samholder Yes, it would take up one of the 'in flight' messages on the connection. If enough events get stuck in this way and completely fill the buffer, the subscription connection will not be able to process any more events. |
ok, thanks! |
This is possibly related to issue #1392.
I'm keeping this PR in 2 commits for now, as the first commit makes the issue more obvious for testing.
The first commit in this PR:
outstandingMessagesCount
to persistent subscription statstotalInFlightMessages
to be the total count of in flight messages on the push clients.Ideally, these two should be the same number, however it is possible for these two to get out of sync which causes problems.
It appears that the original connection does not always remove an event when it is retried.
If another connection then handles and acks that event, it is only removed from the second connection.
The original connection will still report the event as "in flight" and subtract it from its capacity.
This can lead to the connection's buffer filling up with events it can never complete.
The fix that I've proposed in this PR is to always remove completed events from all connections.
I'm not sure if this is the correct fix, but have not been able to reproduce the issue with the fix.
There is also likely to be an underlying issue that causes the events to be on multiple connections at once.