-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent subscriptions are quietly stopping #1392
Comments
Thank you for submitting this, @mpielikis Would you be able to provide the logs for the master node at the time? If you don't want to share them here, you can email them to hayley at geteventstore.com |
Any followup here? |
Hello @mpielikis Can you please let us know if you're still facing the same issue in the latest version of EventStore? (if you've upgraded) Do you remember if all the subscriptions that were stuck were subscribing from the same or different streams? Thanks |
Hello @shaan1337
yes it still happens. Additional debug level logging of persistent subscription could bring more details IMO. The current version is 4.0.3.0
Different subscriptions for different streams. Could not find any correlations. Every time the freezed subscriptions were different one or many. We use DispatchToSingle type of subscription the most. Three clients connects to many DispatchToSingle subscriptions (37) and to couple of RoundRobin. |
Just from looking at the log messages is it possible a timeout is occurring on writing of checkpoint? |
I was analyzing a log and didn't find any suspicious errors or timeouts just usual messages. If there is a timeout shouldn't there be a retry? |
There is a retry to a certain depth |
Also noticed that when subscription "freezes" the number of connected clients to DispatchToSingle becomes 0 or more than 1 when it normally should be only one. It seems as the client socket connections become inactive or stalled and the only way to reset them is to restart the server. |
the last "freeze". As you see there is 7 identical connections
|
these are the changes I've made to 4.0.3.0 to debug this issue |
Thanks @mpielikis for all the details, it definitely helps! |
There are retries. |
One thing that looks odd in the status: " "totalInFlightMessages": 5," But none of the connections have any in flight messages? |
no, there is one.
|
are they being ack'ed? |
You mean TCP ACK? I don't know. What I know is that restarting that client keeps these connections. I could bring more details on next "freeze". Just please tell me what should I check for or what actions should I take to help investigate this. also interesting part is this
messages are pushed, but event batch reading is stopped. |
no there is an explicit ack/nak message from the client on processing of an event (it happens automatically by default but you can also manually do it). event batch reading only happens if the subscription is not caught up. |
Last result is [Sent]. It ACKs? |
If you mean on the client side then yes there is an explicit |
the reason I ask is that there are outstanding messages on the subscription. Is it possible that an exception etc is being thrown in the client and the messages are not being ack'ed/nak'ed because of it? I am guessing that its not sending because the buffer size is 5 and its full
If a message is not acked/naked the buffer will fill and no more messages will be sent. On a side note you really don't want a buffer size of 5 unless these messages are taking +- 1 second to process. A more likely value would be two to three orders of magnitude larger (500-5000). |
the handler on a client looks like this
Even if exception happened on a client it shouldn't block the subscription. Other interesting fact is that not one but many subscriptions freezes at the same time with different clients from different machines. handlers are really lightweight and just pushing messages on a inmemory queue. |
Hello @mpielikis Can you please send us 4-5 snapshots (at intervals of around 10-15 seconds) of the query you ran when there is a freeze: It'll help us know which of the values are changing. Thanks |
Hi @mpielikis I've gone through a review of the persistent subscription code. Based on the data you provided, it looks like it can be narrowed down to the following 2 possibilities:
For no. 1, can you please verify if it's possible that your client code is calling If no. 1 isn't fruitful, for no. 2, can you please add a Thanks |
Hi @mpielikis Do you have any feedback with regards to the above please? Thanks |
Hi @shaan1337 yes. Just was waiting for this freeze to appear, now I got it. I made four snapshots of 32 subscriptions. First snapshot when there was a freeze, second when clients were closed, third when clients were up again, fourth after ES restart. Netstat -na is also added from all machines. IP addresses are masked. for example all five states of one subscription: On Freeze
On Clients Closed (12 non existant connections by netstat)
On Clients Up (12 non-existant and 1 existant 10.98.40.137:40570 connection by netstat)
On ES restart (one healthy connection 10.98.40.137:42525)
I can send full snapshot zip file. |
What do the ES logs say about connections? |
three grepped connections log entries from 5 day log: 10.98.40.137:61476
10.98.40.137:62433
10.98.40.137:62441
|
new freeze here. four states from one of the stucked persistent subscription On Freeze 2 connections: 1 lost and 1 active
netstat on 10.98.40.138
netstat on 10.98.40.137
On Client Close 1 lost connection
netstat on 10.98.40.138
On Client Up 2 connections: 1 lost 1 active
netstat on 10.98.40.138
After EventStore restart 1 live connection
netstat on 10.98.40.138
|
Hello @mpielikis , Sorry for the delay in getting back to you, thanks for all the details. Can you please send me your log files related to the above runs to shaan [at] geteventstore.com ? Thanks |
Hello @mpielikis For the new freeze, it seems that the last notify live message time is stuck at:
This is a few minutes after some of the connections were lost: 10.98.40.137:61476, 10.98.40.137:62433, 10.98.40.137:62441 on the 20th (as you sent earlier) Can you please also send me the logs for the 20th January? Thanks |
I think we're experiencing exactly the same problem. It has caused massive problems for us on version 3.9.4 of ES. |
Has this issue been addressed? I've been playing with persistent subscriptions and wonder if it is ready for prime time. |
Hello, any update in the issue? We want to use eventstore but this can be a blocker. |
We still struggle with this issue in production once in a week. It is the major problem we have. We hope PR #1640 will address it, there are some similar symptoms. |
This is likely due to the connection ending up in an invalid state; the ES connection doesn't successfully reconnect always after network splits. If you want to triage and debug this huge issue, I'd focus on introducing spotty network, with alternating partitions, and a single master ES server that you bring up and down, while tracking the events fired and tcp connection state of the ES connection implementation. It's not the server's connection; it's the client's connection that's buggy. If you set up a continuous restart of the server, you'll worsen the problem because the client will get into invalid states and throw internal exceptions, such as ObjectDisposedException, or just silently stop sending data to the application. And the issue for us is primarily catch-up subscriptions, which we have about 40 of in a single process, not persistent subscriptions. Also note-worthy is that the ES server's logs are useless to debug this. The ES client's logs, unless incremented are also useless. We're using the latest, untagged (!), docker version of the server now, and the .net core client. We've also had race conditions in the client connection's authenticate call, despite using TCP; and if it's ever unauthenticated during runtime, we crash our service and let k8s restart it. We've tried to notify ES of these problems, but they say they can't reproduce, despite having been given full access to our environment and code-base, and us having purchased commercial support. If we give them access to Stackdriver logging with ALL logs, neatly ordered by timestamp in GCP (synchronised by Google's NTP server), they still insist on us mounting a persistent volume and shipping text-file-based logs to them over e-mail. The latest bid from them is to ask us to manually handle the TCP connection state of the ES connection by "avoiding connecting to subscriptions while it's disconnected"; however, since the TCP state is on another thread, it's impossible to know for sure from the caller's thread what state the TCP socket is in. That TCP runs on another thread and therefore any assumption of the socket's state from another thread being a race condition is met with repeated assertions to "just do as I say and handle it manually and it will be fine". As for the unauthenticated issues that occur every other launch, they simply say they cannot reproduce, despite full access to logs and machines. When asked about how to monitor the EventStore server with Grafana/Prometheus/Influx or another modern stack, there's no real response either (no Prom-endpoint, no metric shipper, no real answer), so there are no metrics for ES that lets us debug the problem. This is particularly funny because I remember @gregoryyoung talking 2009 at Öredev, bragging about how, when his self-proclaimed SLA had been violated, he told the surprised customer: and this was possible because he had metrics. And yet, he tweets about completely different things other than fixing his database's client lib. |
@mpielikis By any luck are you using |
No we don't. The freeze is likely happening at some moment with combination of Slow stream reading + High Persistent Subscription usage. We saw sometimes high peaks in StorageReaderQueue with several thousand items in queue, and then the freeze happened. To lower the impact of this issue we divided the tasks between two EventStore instances with separate db, so two disks were used for reading instead of one. |
Thanks for the details @mpielikis. I earlier saw that you implemented a I've also done a code review again a few days ago and can deduce the following:
I assumed that the persistent subscription was live and that due to the freeze,
Based on the above, I think that one of the premises #1 to #5 must be false because it seems to lead to an impossible situation (or I'm missing something). Can you please confirm if #3 was true? We could also set some additional (manually activated) logging between Thanks |
We've been having very similar problems ourselves to this. We have found that editing the subscription through the UI and not changing anything but saving it awakens the subscription again. Any movement on this since November? |
This related fix may address someof the problems here #1936 |
We got the same issue in production today. We restarted the master node and it resolved the issue. |
Thank you for letting us know @alexeyzimarev |
@hayley-jean it's 5.0.2 on CentOS |
From 3.8 versions to recent 4.0.1.4 version we see accidental (~ once/twice a week) stopping/freezing of persistent subscriptions on three node cluster. We can't detect from the logs or from index.html#/subscriptions form if there is a problem. We see that the stream is not consumed and the restart of a client does not help. The only way to wake up these subscriptions is to restart the EventStore master. After the master reset/failover all the subscriptions become active again instantly.
Last accident happened this night after upgrade from 3.9.2 to 4.0.1.4. The subscription stopped around 2017.08.17 23:21:00 UTC.
Master log at that time
Cluster stats
The text was updated successfully, but these errors were encountered: