Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

Closed
desaintmartin opened this issue Oct 26, 2018 · 18 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@desaintmartin
Copy link
Collaborator

desaintmartin commented Oct 26, 2018

BUG REPORT

I have a funny issue with elasticsearch 6.4.2 when I delete the current master Pod.

The theory tells us that there should have a brief downtime of about 3 to 5 seconds, but I get ~35 seconds of downtime.

If I manually kill the java process (kubectl exec -it ... kill 1) it causes a downtime of less than 5 seconds, which is expected.

The reason is quite simple to understand but does not help to fix it:
1/ I ask for Pod termination
2/ SIGTERM is sent to java, causing ES master to shut down properly and to send proper signal to other nodes
3/ Docker container is closed, and IP from network layer (Calico in my case) decommissioned
4/ new master election process starts
5/ Other nodes try to ping the dead master but instead of getting "connection refused" as it would have been in the "old" world outside of kubernetes/containers, it just gets nothing, then timeout after 30 seconds. It means no operation (even read) can be done during 30 seconds.

When I kill the process instead of terminating the pod, steps 1 to 4 are the same but
4/ other nodes try to ping the dead master, gets "connection refused", remove the dead master from their state and continue election

Possible fixes are:
1/ Fix within elasticsearch: a master broadcasting that it is leaving a cluster should cause other nodes to remove it immediately instead of trying to ping it
2/ Wait for a few seconds AFTER java has been killed before terminating the container (and the network associated with it) so that other nodes can get "connection refused". This is what is done in https://github.com/mintel/es-image/pull/3/files but official ES image directly runs a script that exec into java process, and it looks like a workaround, not a proper fix
3/ Reduce the timeout (dangerous?)

What do you think?

cc @simonswine @icereval @rendhalver @andrenarchy

Here is a log from another node seeing the current master "master-2" leaving the cluster then... blocking everything during 30 seconds (and all other nodes do the same), resulting in complete cluster lock-down (even read-only).

[2018-10-26T08:32:39,832][INFO ][o.e.d.z.ZenDiscovery     ] [test-elasticsearch-master-1] master_left [{test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300}], reason [shut_down]
[2018-10-26T08:32:39,834][WARN ][o.e.d.z.ZenDiscovery     ] [test-elasticsearch-master-1] master left (reason = shut_down), current nodes: nodes: 
   {test-elasticsearch-data-1}{sSF8x02zQDeREJHw-NeHZg}{DvOX5HawRWS77hMPgjT07g}{10.233.105.88}{10.233.105.88:9300}
   {test-elasticsearch-data-0}{85jAAsnFQHm8SGJDQ4zWCw}{T0Bo1sGpSoWkEUvLFINhxA}{10.233.87.101}{10.233.87.101:9300}
   {test-elasticsearch-client-6d4bf78f76-4fn8n}{moF0fXIpTUq27RoGEfmnMg}{pjAv07guQHyC_9yn8xlsgw}{10.233.70.71}{10.233.70.71:9300}
   {test-elasticsearch-client-6d4bf78f76-w26dc}{A2OXcw9fRX664WjlsQa_QA}{_JTUm437QmaAelpvJZiBKQ}{10.233.105.85}{10.233.105.85:9300}
   {test-elasticsearch-master-1}{yg1ELS1fRlKximCnpgXo_w}{_Wlx6e_7Tf-WOhYBoxA_QA}{10.233.109.200}{10.233.109.200:9300}, local
   {test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300}, master
   {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}

[2018-10-26T08:32:43,003][INFO ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] detected_master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}, reason: apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [8]])
[2018-10-26T08:33:11,481][WARN ][o.e.c.NodeConnectionsService] [test-elasticsearch-master-1] failed to connect to node {test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [test-elasticsearch-master-2][10.233.120.10:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:643) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:542) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:316) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:180) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.2.jar:6.4.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
[2018-10-26T08:33:41,584][WARN ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] cluster state applier task [apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [8]])] took [58.5s] above the warn threshold of 30s
[2018-10-26T08:33:42,844][INFO ][o.e.d.z.ZenDiscovery     ] [test-elasticsearch-master-1] failed to send join request to master [{test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-10-26T08:33:47,605][INFO ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] removed {{test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300},}, reason: apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [9]])
[2018-10-26T08:34:17,656][INFO ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] added {{test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{ioZPkwyLSXaOWhXVwXBwqQ}{10.233.120.29}{10.233.120.29:9300},}, reason: apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [10]])
[2018-10-26T08:34:18,099][WARN ][o.e.t.TransportService   ] [test-elasticsearch-master-1] Received response for a request that has timed out, sent [94096ms] ago, timed out [64096ms] ago, action [internal:discovery/zen/fd/master_ping], node [{test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}], id [2232]
@desaintmartin desaintmartin changed the title [stable/elasticsearch] Delete current master node causes cluster outage of more than 30 seconds [stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds Oct 26, 2018
@stale
Copy link

stale bot commented Nov 25, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2018
@desaintmartin
Copy link
Collaborator Author

Not solved.

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2018
@kimxogus
Copy link
Contributor

I think this is related to this post.

I'm trying solution 3 which I can try by myself. Do you know which config sets ping timeout in discovery/zen/fd/master_ping? discovery.zen.fd.ping_timeout didn't help.

@desaintmartin
Copy link
Collaborator Author

Documentation talks about discovery.zen.ping_timeout [0] but I haven't tested yet.

[0] https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#master-election

@kimxogus
Copy link
Contributor

I set both discovery.zen.ping_timeout and discovery.zen.fd.ping_timeout to 3s, but didn't help to solve this issue.

Any http request to node in the cluster responded after logs written below, and outage was almost 2 minutes.

[2018-12-19T09:04:43,113][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-data-1] Received response for a request that has timed out, sent [119179ms] ago, timed out [116179ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{YS5J3gLxRUGzm1wCYa7QNw}{100.96.165.136}{100.96.165.136:9300}], id [342396]
[2018-12-19T09:04:43,113][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-data-1] Received response for a request that has timed out, sent [116179ms] ago, timed out [113179ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{YS5J3gLxRUGzm1wCYa7QNw}{100.96.165.136}{100.96.165.136:9300}], id [342397]
[2018-12-19T09:04:43,113][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-data-1] Received response for a request that has timed out, sent [113179ms] ago, timed out [110179ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{YS5J3gLxRUGzm1wCYa7QNw}{100.96.165.136}{100.96.165.136:9300}], id [342398]

It seems discovery/zen/fd/master_ping doesn't follow discovery.zen.ping_timeout.

@kimxogus
Copy link
Contributor

I tried many times after making sure all nodes have ping_timeout config.
Outage time reduced to about 1m, and node responded before master_ping timeout log.

@desaintmartin
Copy link
Collaborator Author

desaintmartin commented Dec 19, 2018

1m seems to be as expected, since 3s * 20 tries = 1 minute. This is why I thought that option as being dangerous: in order to have quick timeout, we should set this option to ~500ms (10 seconds timeout), which is dangerously low.

@kimxogus
Copy link
Contributor

kimxogus commented Dec 19, 2018

Ahh, you're right. As master re-election is done in 3~5 seconds, I thought this is irrelevant to master election.

Setting 3s seems ok for now in my cluster, but master election seemed to have trouble with 1s. I agree that ~500ms is dangerously low.

@kimxogus
Copy link
Contributor

kimxogus commented Dec 19, 2018

I made a feature request elastic/elasticsearch#36822

@desaintmartin
Copy link
Collaborator Author

I wonder one thing: why does the other ES masters still try to ping the master who left (see my point 1/)?

@DaveCTurner
Copy link

DaveCTurner commented Dec 19, 2018

The default value for net.ipv4.tcp_retries2 is 15, which IMO is far too high in the kinds of environments that Elasticsearch should operate. This setting controls how many times a TCP packet should be retransmitted before considering the connection to have dropped, and there's exponential backoff involved, so 15 retries takes many minutes to detect a connection drop. Elasticsearch reacts quickly to a connection drop. One solution could be to set this lower - see elastic/elasticsearch#34405 (comment) for more details, and links to at least two other systems that recommend reducing it to 3, so that a connection drop is detected in a second or so.

@kimxogus
Copy link
Contributor

kimxogus commented Dec 21, 2018

As elastic/elasticsearch#36822 (comment).
Setting net.ipv4.tcp_retries2=3 and transport.tcp.connect_timeout to 2 ~ 3 seconds made outage 8 ~ 10 seconds.

kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019
kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019
kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019
Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>
@kimxogus
Copy link
Contributor

I think I fixed this issue in #10687

@stale
Copy link

stale bot commented Feb 15, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2019
@stale
Copy link

stale bot commented Mar 1, 2019

This issue is being automatically closed due to inactivity.

@pavdmyt
Copy link
Contributor

pavdmyt commented Jul 24, 2019

Running Elasticsearch cluster v6.7.1. Having the same issue, after deleting elasticsearch master pod, cluster is stuck for approx 65-70 seconds. Setting net.ipv4.tcp_retries2=3 and transport.connect_timeout=3s reduces the outage to 33-37 seconds. But still not able to reach 8-10 seconds mentioned by @kimxogus

Setting discovery.zen.fd.ping_timeout=3s has no effect.

@kimxogus
Copy link
Contributor

@pavdmyt I recommend elastic's official helm chart. This issue is fixed in elastic's official chart via elastic/helm-charts#119
Setting masterTerminationFix: true will handle this problem automatically.

@pavdmyt
Copy link
Contributor

pavdmyt commented Jul 25, 2019

@kimxogus Thanks for the reference! I should give it a try.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
4 participants