Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent subscriptions in flight msgs #1971

Merged
merged 2 commits into from Aug 21, 2019

Conversation

hayley-jean
Copy link
Member

This is possibly related to issue #1392.
I'm keeping this PR in 2 commits for now, as the first commit makes the issue more obvious for testing.

The first commit in this PR:

  • Adds outstandingMessagesCount to persistent subscription stats
  • Changes totalInFlightMessages to be the total count of in flight messages on the push clients.

Ideally, these two should be the same number, however it is possible for these two to get out of sync which causes problems.


It appears that the original connection does not always remove an event when it is retried.
If another connection then handles and acks that event, it is only removed from the second connection.
The original connection will still report the event as "in flight" and subtract it from its capacity.

This can lead to the connection's buffer filling up with events it can never complete.


The fix that I've proposed in this PR is to always remove completed events from all connections.
I'm not sure if this is the correct fix, but have not been able to reproduce the issue with the fix.

There is also likely to be an underlying issue that causes the events to be on multiple connections at once.

  • Is there any case where having the same event on multiple connections would be expected/desirable?
  • What should happen in the cases where an event is on two connections, and one connection acks it?

Also correct the total in flight messages stat to the messages in flight
on the push clients.
@hayley-jean
Copy link
Member Author

Reproduction steps:

These are the steps that I've found most reliably reproduces this issue.
Even then, it's still not guaranteed. Sometimes it helps to create a fast subscriber that acks events immediately to drain the queue and try again if you don't see the issue after a reasonable period.

  1. Create a persistent subscription with these settings :
 connection.CreatePersistentSubscriptionAsync("test-stream", "test-sub",
                PersistentSubscriptionSettings.Create()
                .CheckPointAfter(TimeSpan.FromSeconds(31))
                .MaximumCheckPointCountOf(30)
                .MinimumCheckPointCountOf(1)
                .ResolveLinkTos()
                .PreferRoundRobin()
                .StartFromBeginning()
                .WithBufferSizeOf(500)
                .WithLiveBufferSizeOf(500)
                .WithMaxRetriesOf(30)
                .WithReadBatchOf(20)
                .WithMessageTimeoutOf(TimeSpan.FromSeconds(30))
                .Build(),
                userCredentials).Wait();
  1. Create 3 "Retry Clients" that subscribe but let the events time out :
var subscription = connection.ConnectToPersistentSubscription("test-stream", "test-sub",
    (s, e) => {
    	Console.WriteLine($"Timing out event {evnt.OriginalEventNumber}");
    	return;
    },
    SubscriptionDropped, userCredentials, bufferSize, false);
  1. Create 1 "Stable Client" that acks the events after a second :
var subscription = connection.ConnectToPersistentSubscription("test-stream", "test-sub",
    (s, e) => {
    	Console.WriteLine($"Processing event {evnt.OriginalEventNumber}");
    	Thread.Sleep(1000);
    	s.Acknowledge(e);
    },
    SubscriptionDropped, userCredentials, bufferSize, false);
  1. Write around 600 events to test-stream.

  2. Wait some time, and check the stats periodically :

curl -i http://localhost:2113/subscriptions/test-stream/test-sub/info -H "accept:application/json" -u admin:changeit

Result:

If testing on the first commit of this PR, the outstandingMessagesCount and totalInFlightMessages stats should get out of sync.
The number of outstanding messages ends up being less than the total in flight messages.

If testing on current master, the totalInFlightMessages should end up being less than the actual total of all the inFlightMessages count of the connections:

"totalInFlightMessages": 8,
"outstandingMessagesCount": 6,

Some of the connections will have in flight messages that just never complete processing.
If a connection fills up with these in flight messages, then the connection will not receive any more events.

@gregoryyoung
Copy link
Contributor

Multiple connections is totally fine on a retry ... In fact its by design.

Removing from all seems legit...

"This can lead to the connection's buffer filling up with events it can never complete."

How can it never complete them?! I am a bit confused on this part.

@hayley-jean
Copy link
Member Author

hayley-jean commented Aug 6, 2019

@gregoryyoung Sorry, I worded that badly. It's not that the messages would never get 'completed', more that they would never be removed from the client's outstanding queue.

The PersistentSubscription tracks all the outstanding messages, and handles them when they are acked/nacked/timed out.

Each PersistentSubscriptionClient (connection) keeps a list of the messages that it has sent to its client, and relies on the PersistentSubscription to tell it when they have been handled so they can be removed. (This list of outstanding messages is used to determine how many available slots there are in the client's buffer)

In the case that I've described here, multiple connections have the same message, but the PersistentSubscription is only telling one of them to remove it from their list.
This leaves the outstanding message on the other connections, with no way of them removing the message in the future.

@gregoryyoung
Copy link
Contributor

gregoryyoung commented Aug 6, 2019 via email

@hayley-jean hayley-jean force-pushed the persistent-subscriptions-in-flight-msgs branch from 66b06bc to be24823 Compare August 8, 2019 10:29
@hayley-jean hayley-jean marked this pull request as ready for review August 8, 2019 10:46
@shaan1337 shaan1337 self-requested a review August 15, 2019 06:52
Copy link
Member

@shaan1337 shaan1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch, nice work 👍

After inspection of the event buffers, it appears that the root cause is not an event being present on two clients at the same time. The issue can manifest itself as follows:

Consider a scenario with 2 clients C1 and C2 connected to a persistent subscription with a message timeout of 0.5s and a single event E in the stream.

  • Event E is pushed to C1
  • C1 takes 1 second to process event E, then acknowledges it.
  • During this 1 second period, on the server side event E will timeout and will be retried by the persistent subscription. Retrying will remove the event E from C1's buffer and then it may be added again to either C1's or C2's buffer. Let's assume it's added to C2's buffer.
  • The Acknowledgement from C1 for event E is now received. The persistent subscription will attempt to remove the event E from C1's buffer but it's no longer there. The event E will thus stay in client C2's buffer forever.

The fix appears correct under the above scenario.
In terms of performance: Although removing from all clients is a linear operation, I assume that the number of clients connected to a persistent subscription will be usually small.

@hayley-jean hayley-jean merged commit e8f15b6 into master Aug 21, 2019
@hayley-jean hayley-jean deleted the persistent-subscriptions-in-flight-msgs branch August 21, 2019 12:14
pgermishuys pushed a commit that referenced this pull request Oct 23, 2019
…flight-msgs

Persistent subscriptions in flight msgs
@samholder
Copy link

samholder commented Nov 4, 2019

What is the consequence of this event remaining in the buffer?

Would this take up one of the 'in flight' messages or similar?

@hayley-jean
Copy link
Member Author

@samholder Yes, it would take up one of the 'in flight' messages on the connection.

If enough events get stuck in this way and completely fill the buffer, the subscription connection will not be able to process any more events.

@samholder
Copy link

ok, thanks!

This was referenced Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants