EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732

haf · 2018-09-21T17:38:32Z

I've finally isolated our issues with EventStore, reporting findings here as I debug the client/server.

We have lot of CatchUp subscriptions running. Our code assumes the don't duplicate events, but they are, and here's how that shows as the TCP connection state.

At some point, in this repro at 11:30:22 the TCP connection dies.

The server doesn't care about that, doesn't log the connection dying:

However, something's wrong with the server, and it ends up failing its health checks in k8s and gets restarted 2 minutes later:

During that the duration of those two minutes, the system is down, because we can't receive events.

You can see the client lib heroically trying to reconnect, until the server gets rebooted.

At this point, the catch-up subscriptions are hosed and some, or all of them start receiving duplicate events, which they act on, and send into ES on another stream (in our system):

Now, since we track our subscription positions with the above stream, the next time our app reboots and reads from where it should start consuming, that position will be *n where n is the duplication factor.

This means we don't get events for a while, until the magic position equal to when the duplicated position is, and then we'll start receiving events without first receiving their previous events, queue invariant failures en-masse.

So; root cause is that the server pod dies and stops receiving/sending TCP data. This should be classified as a major bug.

Second root cause is that the catch up subscriptions don't properly handle disconnects without duplicating inputs to their callbacks. This should be classified as a major bug.

Cascade effect is that our invariant that "subscriptions don't give us duplicate events, so we can do event-counting" in violated.

Cascade effect is that when we reboot we resubscribe from the wrong position.

I haven't discovered why event store's pod dies; but it's not the node that goes down, it's something else. I also haven't discovered why the subscriptions duplicate input. I should be able to work around by abandoning catch up subscriptions and polling instead and also keeping track of the event ids whose results I persist to trigger the idempotent write mechanism.

This client is the latest netcore client.

Another cause leading up to this failure: there's no documented way of monitoring resource demands of the ES server; so when they exhaust, I don't know what resource got exhausted. Could be classified a minor documentation bug.

Also, I don't know if it's the TCP lingering and then disconnecting when there are no events, or if it's something else.

Summary: the server dies and cascading failures in the client and lack of monitoring docs leads to application invariant failures.

Swiss cheese

ODE

This should be an exception resulting in 503 Unavailable Service from a HTTP API or effectively "pause" current execution, allowing callers to resume when the TCP socket it back up.

api           | System.AggregateException: One or more errors occurred. (Cannot access a disposed object.
api           | Object name: 'ES-b97c607f-c302-4174-9966-a804ae697e1a'.)
api           | --- End of exception stack trace ---
api           | System.ObjectDisposedException: Cannot access a disposed object.
api           | Object name: 'ES-b97c607f-c302-4174-9966-a804ae697e1a'.
api           |   at EventStore.ClientAPI.Internal.EventStoreNodeConnection.ReadEventAsync(String stream, Int64 eventNumber, Boolean resolveLinkTos, UserCredentials userCredentials) in /libs/EventStore.Client/Internal/EventStoreNodeConnection.cs:line 189

This happens WITH "KeepReconnecting" enabled, and causes the ES connection to indefinitely go into a failed state.

This was mitigated by crashing the app any time "error occurred" happens. The mitigation leads to potential lost requests and harsher requirements for idempotence throughout (requires: all POST requests have RequestId and are retries, all commands are identified by their request ids). This is an expensive mitigation, dev-time wise. It also leads to unavailability of our public service and should be classified as a major bug.

Authentication failure

When ES server is not up and running and/or crashing, sometimes it gives "authentication failure" despite correct credentials being used. Only in the Message property can callers read "Authentication timed out." This comes after the server comes back up, but before the server is ready to authenticate (I presume). This is a red herring in this context, since the issue here is that the server dies that the client can't handle errors thereof.

Since the configuration of retrying the connection stops after getting "authentication failure" despite that failure being a timeout (which happen, we're talking dist. systems here!), it violates the invariant that KeepReconnecting configurations keep reconnecting and should be classified as a major bug.

Ideas

Heartbeat

I've played around a lot with ES on docker on my laptop, with docker-compose. I can get into a state where events aren't flowing after sleeping the laptop and then starting it up again. It would be good if there was a heartbeat mechanism in the client to avoid TCP hangs like this; a mechanism that could be monitored (and with monitoring documented).

The text was updated successfully, but these errors were encountered:

haf · 2018-09-25T08:57:33Z

More information:

The EventStore server crashes about 2 times per 24 hours normally.

The above bug report is completed.

gregoryyoung · 2018-09-27T21:47:08Z

"Second root cause is that the catch up subscriptions don't properly handle disconnects without duplicating inputs to their callbacks. This should be classified as a major bug."

Catchupsubscriptions and their logic for getting/pushing data is quite basic in terms of logic (its literally just a checkpoint) and has been around for many years without others having issues. Maybe you can describe more or provide a sample case?

On a side note a catchupsubscription can always duplicate events up to whatever checkpoint interval is used on restarts etc. I have not heard of any case of duplicated events over a catchupsubscription outside of where its known that they can be duplicated. Perfect example you receive event 3 over subscription, you write to some db, you go to update checkpoint power pull when it comes back up event 3 will show up again as it was never checkpointed.

haf · 2018-09-28T07:53:54Z

Catchupsubscriptions and their logic for getting/pushing data is quite basic in terms of logic (its literally just a checkpoint) and has been around for many years without others having issues. Maybe you can describe more or provide a sample case?
...

I'd love to, but I can't reproduce this besides waiting for the server to hang; only then does it happen. We can't crash it with stress tests otherwise, and it's not the client side that must hang; it's the server side. It's also not enough to simply kill the pod the server is running in. Perhaps https://istio.io/docs/tasks/traffic-management/fault-injection/ can help us in a few months when we're technologically there.

If the server would automatically log bad contextual conditions, such as out of memory, out of IO ops, out of handles/ulimit errors, then perhaps the logs would tell us under what conditions we can reproduce the error quicker (than the 12-16 hours it takes now) Simply put; I'm not good enough of a developer to reproduce this right now.

I do know replay is possible, and we have those checks peppered all over the software indeed, which is part of why it's been so hard to find this problem; a lot of the code handles it well. I'm sure I have places I've forgotten replay protection; like when I interact with external APIs that don't take request Ids; in those cases I've reasoned a one-in-a-million replay would be alright; but this happens every day for us.

haf · 2018-10-05T06:50:00Z

It's now been two weeks since this public report and more than six months since the initial contact with EventStore. This is a polite inquiry into how you are doing on handling the above mentioned issues?

haf · 2018-11-01T23:36:51Z

https://groups.google.com/forum/#!topic/event-store/z0RWBnBesoU — KeepReconnecting was tried here but shown not to work
throw "Cannot access a disposed object" after restarted server #1584 — solution doesn't guarantee that the next time Connect is called, the next instance is not also disposed; also solution goes against docs "to treat ES Connection as a singleton" and goes against API "KeepReconnecting"

haf · 2018-12-15T16:30:42Z

It's now been over two months since this was reported? What are you guys doing? This is bad.

jageall · 2020-04-23T18:46:31Z

fixed by #2367 . auto reconnect was using the wrong position. catchup subscriptions were the only subscription doing that and caused all sorts of weird edge cases. SubscriptionDropped handler is now called when the connection fails and it is expected that the handler reconnects using the checkpoint that it processed to. #2367

ninjaboy mentioned this issue Oct 10, 2018

EventStore repository to allow connection factory to be passed BullOak/BullOak#107

Closed

3 tasks

shaan1337 added kind/bug Issues which are a software defect impact/medium Medium prospect of negative impact detectability/unknown Unclear if negative impact will be detected labels Mar 8, 2019

jageall added the action/investigate For future investigation label Oct 21, 2019

jageall added this to the Event Store v6.1 milestone Oct 21, 2019

jageall removed this from the Event Store 20.02.01 milestone Apr 22, 2020

jageall closed this as completed Apr 23, 2020

EventStore deleted a comment from haf Apr 23, 2020

EventStore locked as too heated and limited conversation to collaborators Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732

EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732

haf commented Sep 21, 2018 •

edited

haf commented Sep 25, 2018 •

edited

gregoryyoung commented Sep 27, 2018

haf commented Sep 28, 2018 •

edited

haf commented Oct 5, 2018

haf commented Nov 1, 2018 •

edited

haf commented Dec 15, 2018 •

edited

jageall commented Apr 23, 2020

EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732

EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732

Comments

haf commented Sep 21, 2018 • edited

Swiss cheese

ODE

Authentication failure

Ideas

Heartbeat

haf commented Sep 25, 2018 • edited

gregoryyoung commented Sep 27, 2018

haf commented Sep 28, 2018 • edited

haf commented Oct 5, 2018

haf commented Nov 1, 2018 • edited

haf commented Dec 15, 2018 • edited

jageall commented Apr 23, 2020

haf commented Sep 21, 2018 •

edited

haf commented Sep 25, 2018 •

edited

haf commented Sep 28, 2018 •

edited

haf commented Nov 1, 2018 •

edited

haf commented Dec 15, 2018 •

edited