Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732

Closed
haf opened this issue Sep 21, 2018 · 7 comments
Closed

EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732

haf opened this issue Sep 21, 2018 · 7 comments
Labels
action/investigate For future investigation detectability/unknown Unclear if negative impact will be detected impact/medium Medium prospect of negative impact kind/bug Issues which are a software defect

Comments

@haf
Copy link

haf commented Sep 21, 2018

I've finally isolated our issues with EventStore, reporting findings here as I debug the client/server.

We have lot of CatchUp subscriptions running. Our code assumes the don't duplicate events, but they are, and here's how that shows as the TCP connection state.

At some point, in this repro at 11:30:22 the TCP connection dies.

screen shot 2018-09-21 at 19 29 34

The server doesn't care about that, doesn't log the connection dying:

screen shot 2018-09-21 at 19 29 52

However, something's wrong with the server, and it ends up failing its health checks in k8s and gets restarted 2 minutes later:

screen shot 2018-09-21 at 19 30 34

During that the duration of those two minutes, the system is down, because we can't receive events.

You can see the client lib heroically trying to reconnect, until the server gets rebooted.

screen shot 2018-09-21 at 19 31 57

At this point, the catch-up subscriptions are hosed and some, or all of them start receiving duplicate events, which they act on, and send into ES on another stream (in our system):

screen shot 2018-09-21 at 19 32 42

Now, since we track our subscription positions with the above stream, the next time our app reboots and reads from where it should start consuming, that position will be *n where n is the duplication factor.

This means we don't get events for a while, until the magic position equal to when the duplicated position is, and then we'll start receiving events without first receiving their previous events, queue invariant failures en-masse.

So; root cause is that the server pod dies and stops receiving/sending TCP data. This should be classified as a major bug.

Second root cause is that the catch up subscriptions don't properly handle disconnects without duplicating inputs to their callbacks. This should be classified as a major bug.

Cascade effect is that our invariant that "subscriptions don't give us duplicate events, so we can do event-counting" in violated.

Cascade effect is that when we reboot we resubscribe from the wrong position.

I haven't discovered why event store's pod dies; but it's not the node that goes down, it's something else. I also haven't discovered why the subscriptions duplicate input. I should be able to work around by abandoning catch up subscriptions and polling instead and also keeping track of the event ids whose results I persist to trigger the idempotent write mechanism.

This client is the latest netcore client.

Another cause leading up to this failure: there's no documented way of monitoring resource demands of the ES server; so when they exhaust, I don't know what resource got exhausted. Could be classified a minor documentation bug.

Also, I don't know if it's the TCP lingering and then disconnecting when there are no events, or if it's something else.

Summary: the server dies and cascading failures in the client and lack of monitoring docs leads to application invariant failures.

Swiss cheese

ODE

This should be an exception resulting in 503 Unavailable Service from a HTTP API or effectively "pause" current execution, allowing callers to resume when the TCP socket it back up.

api           | System.AggregateException: One or more errors occurred. (Cannot access a disposed object.
api           | Object name: 'ES-b97c607f-c302-4174-9966-a804ae697e1a'.)
api           | --- End of exception stack trace ---
api           | System.ObjectDisposedException: Cannot access a disposed object.
api           | Object name: 'ES-b97c607f-c302-4174-9966-a804ae697e1a'.
api           |   at EventStore.ClientAPI.Internal.EventStoreNodeConnection.ReadEventAsync(String stream, Int64 eventNumber, Boolean resolveLinkTos, UserCredentials userCredentials) in /libs/EventStore.Client/Internal/EventStoreNodeConnection.cs:line 189

This happens WITH "KeepReconnecting" enabled, and causes the ES connection to indefinitely go into a failed state.

image

This was mitigated by crashing the app any time "error occurred" happens. The mitigation leads to potential lost requests and harsher requirements for idempotence throughout (requires: all POST requests have RequestId and are retries, all commands are identified by their request ids). This is an expensive mitigation, dev-time wise. It also leads to unavailability of our public service and should be classified as a major bug.

Authentication failure

When ES server is not up and running and/or crashing, sometimes it gives "authentication failure" despite correct credentials being used. Only in the Message property can callers read "Authentication timed out." This comes after the server comes back up, but before the server is ready to authenticate (I presume). This is a red herring in this context, since the issue here is that the server dies that the client can't handle errors thereof.

image

Since the configuration of retrying the connection stops after getting "authentication failure" despite that failure being a timeout (which happen, we're talking dist. systems here!), it violates the invariant that KeepReconnecting configurations keep reconnecting and should be classified as a major bug.

Ideas

Heartbeat

I've played around a lot with ES on docker on my laptop, with docker-compose. I can get into a state where events aren't flowing after sleeping the laptop and then starting it up again. It would be good if there was a heartbeat mechanism in the client to avoid TCP hangs like this; a mechanism that could be monitored (and with monitoring documented).

@haf
Copy link
Author

haf commented Sep 25, 2018

More information:

screen shot 2018-09-25 at 10 56 31

The EventStore server crashes about 2 times per 24 hours normally.

The above bug report is completed.

@gregoryyoung
Copy link
Contributor

"Second root cause is that the catch up subscriptions don't properly handle disconnects without duplicating inputs to their callbacks. This should be classified as a major bug."

Catchupsubscriptions and their logic for getting/pushing data is quite basic in terms of logic (its literally just a checkpoint) and has been around for many years without others having issues. Maybe you can describe more or provide a sample case?

On a side note a catchupsubscription can always duplicate events up to whatever checkpoint interval is used on restarts etc. I have not heard of any case of duplicated events over a catchupsubscription outside of where its known that they can be duplicated. Perfect example you receive event 3 over subscription, you write to some db, you go to update checkpoint power pull when it comes back up event 3 will show up again as it was never checkpointed.

@haf
Copy link
Author

haf commented Sep 28, 2018

Catchupsubscriptions and their logic for getting/pushing data is quite basic in terms of logic (its literally just a checkpoint) and has been around for many years without others having issues. Maybe you can describe more or provide a sample case?
...

I'd love to, but I can't reproduce this besides waiting for the server to hang; only then does it happen. We can't crash it with stress tests otherwise, and it's not the client side that must hang; it's the server side. It's also not enough to simply kill the pod the server is running in. Perhaps https://istio.io/docs/tasks/traffic-management/fault-injection/ can help us in a few months when we're technologically there.

If the server would automatically log bad contextual conditions, such as out of memory, out of IO ops, out of handles/ulimit errors, then perhaps the logs would tell us under what conditions we can reproduce the error quicker (than the 12-16 hours it takes now) Simply put; I'm not good enough of a developer to reproduce this right now.

I do know replay is possible, and we have those checks peppered all over the software indeed, which is part of why it's been so hard to find this problem; a lot of the code handles it well. I'm sure I have places I've forgotten replay protection; like when I interact with external APIs that don't take request Ids; in those cases I've reasoned a one-in-a-million replay would be alright; but this happens every day for us.

@haf
Copy link
Author

haf commented Oct 5, 2018

It's now been two weeks since this public report and more than six months since the initial contact with EventStore. This is a polite inquiry into how you are doing on handling the above mentioned issues?

@haf
Copy link
Author

haf commented Nov 1, 2018

Related:

@haf
Copy link
Author

haf commented Dec 15, 2018

It's now been over two months since this was reported? What are you guys doing? This is bad.

@shaan1337 shaan1337 added kind/bug Issues which are a software defect impact/medium Medium prospect of negative impact detectability/unknown Unclear if negative impact will be detected labels Mar 8, 2019
@jageall jageall added the action/investigate For future investigation label Oct 21, 2019
@jageall jageall added this to the Event Store v6.1 milestone Oct 21, 2019
@jageall jageall removed this from the Event Store 20.02.01 milestone Apr 22, 2020
@jageall
Copy link
Contributor

jageall commented Apr 23, 2020

fixed by #2367 . auto reconnect was using the wrong position. catchup subscriptions were the only subscription doing that and caused all sorts of weird edge cases. SubscriptionDropped handler is now called when the connection fails and it is expected that the handler reconnects using the checkpoint that it processed to. #2367

@jageall jageall closed this as completed Apr 23, 2020
@EventStore EventStore deleted a comment from haf Apr 23, 2020
@EventStore EventStore locked as too heated and limited conversation to collaborators Apr 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
action/investigate For future investigation detectability/unknown Unclear if negative impact will be detected impact/medium Medium prospect of negative impact kind/bug Issues which are a software defect
Projects
None yet
Development

No branches or pull requests

4 participants