New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EventStore hangs / TCP connectivity errors / Duplicate events on CatchUp #1732
Comments
"Second root cause is that the catch up subscriptions don't properly handle disconnects without duplicating inputs to their callbacks. This should be classified as a major bug." Catchupsubscriptions and their logic for getting/pushing data is quite basic in terms of logic (its literally just a checkpoint) and has been around for many years without others having issues. Maybe you can describe more or provide a sample case? On a side note a catchupsubscription can always duplicate events up to whatever checkpoint interval is used on restarts etc. I have not heard of any case of duplicated events over a catchupsubscription outside of where its known that they can be duplicated. Perfect example you receive event 3 over subscription, you write to some db, you go to update checkpoint power pull when it comes back up event 3 will show up again as it was never checkpointed. |
I'd love to, but I can't reproduce this besides waiting for the server to hang; only then does it happen. We can't crash it with stress tests otherwise, and it's not the client side that must hang; it's the server side. It's also not enough to simply kill the pod the server is running in. Perhaps https://istio.io/docs/tasks/traffic-management/fault-injection/ can help us in a few months when we're technologically there. If the server would automatically log bad contextual conditions, such as out of memory, out of IO ops, out of handles/ulimit errors, then perhaps the logs would tell us under what conditions we can reproduce the error quicker (than the 12-16 hours it takes now) Simply put; I'm not good enough of a developer to reproduce this right now. I do know replay is possible, and we have those checks peppered all over the software indeed, which is part of why it's been so hard to find this problem; a lot of the code handles it well. I'm sure I have places I've forgotten replay protection; like when I interact with external APIs that don't take request Ids; in those cases I've reasoned a one-in-a-million replay would be alright; but this happens every day for us. |
It's now been two weeks since this public report and more than six months since the initial contact with EventStore. This is a polite inquiry into how you are doing on handling the above mentioned issues? |
Related:
|
It's now been over two months since this was reported? What are you guys doing? This is bad. |
fixed by #2367 . auto reconnect was using the wrong position. catchup subscriptions were the only subscription doing that and caused all sorts of weird edge cases. SubscriptionDropped handler is now called when the connection fails and it is expected that the handler reconnects using the checkpoint that it processed to. #2367 |
I've finally isolated our issues with EventStore, reporting findings here as I debug the client/server.
We have lot of CatchUp subscriptions running. Our code assumes the don't duplicate events, but they are, and here's how that shows as the TCP connection state.
At some point, in this repro at 11:30:22 the TCP connection dies.
The server doesn't care about that, doesn't log the connection dying:
However, something's wrong with the server, and it ends up failing its health checks in k8s and gets restarted 2 minutes later:
During that the duration of those two minutes, the system is down, because we can't receive events.
You can see the client lib heroically trying to reconnect, until the server gets rebooted.
At this point, the catch-up subscriptions are hosed and some, or all of them start receiving duplicate events, which they act on, and send into ES on another stream (in our system):
Now, since we track our subscription positions with the above stream, the next time our app reboots and reads from where it should start consuming, that position will be *n where n is the duplication factor.
This means we don't get events for a while, until the magic position equal to when the duplicated position is, and then we'll start receiving events without first receiving their previous events, queue invariant failures en-masse.
So; root cause is that the server pod dies and stops receiving/sending TCP data. This should be classified as a major bug.
Second root cause is that the catch up subscriptions don't properly handle disconnects without duplicating inputs to their callbacks. This should be classified as a major bug.
Cascade effect is that our invariant that "subscriptions don't give us duplicate events, so we can do event-counting" in violated.
Cascade effect is that when we reboot we resubscribe from the wrong position.
I haven't discovered why event store's pod dies; but it's not the node that goes down, it's something else. I also haven't discovered why the subscriptions duplicate input. I should be able to work around by abandoning catch up subscriptions and polling instead and also keeping track of the event ids whose results I persist to trigger the idempotent write mechanism.
This client is the latest netcore client.
Another cause leading up to this failure: there's no documented way of monitoring resource demands of the ES server; so when they exhaust, I don't know what resource got exhausted. Could be classified a minor documentation bug.
Also, I don't know if it's the TCP lingering and then disconnecting when there are no events, or if it's something else.
Summary: the server dies and cascading failures in the client and lack of monitoring docs leads to application invariant failures.
Swiss cheese
ODE
This should be an exception resulting in 503 Unavailable Service from a HTTP API or effectively "pause" current execution, allowing callers to resume when the TCP socket it back up.
This happens WITH "KeepReconnecting" enabled, and causes the ES connection to indefinitely go into a failed state.
This was mitigated by crashing the app any time "error occurred" happens. The mitigation leads to potential lost requests and harsher requirements for idempotence throughout (requires: all POST requests have RequestId and are retries, all commands are identified by their request ids). This is an expensive mitigation, dev-time wise. It also leads to unavailability of our public service and should be classified as a major bug.
Authentication failure
When ES server is not up and running and/or crashing, sometimes it gives "authentication failure" despite correct credentials being used. Only in the
Message
property can callers read "Authentication timed out." This comes after the server comes back up, but before the server is ready to authenticate (I presume). This is a red herring in this context, since the issue here is that the server dies that the client can't handle errors thereof.Since the configuration of retrying the connection stops after getting "authentication failure" despite that failure being a timeout (which happen, we're talking dist. systems here!), it violates the invariant that
KeepReconnecting
configurations keep reconnecting and should be classified as a major bug.Ideas
Heartbeat
I've played around a lot with ES on docker on my laptop, with docker-compose. I can get into a state where events aren't flowing after sleeping the laptop and then starting it up again. It would be good if there was a heartbeat mechanism in the client to avoid TCP hangs like this; a mechanism that could be monitored (and with monitoring documented).
The text was updated successfully, but these errors were encountered: