[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

desaintmartin · 2018-10-26T11:46:33Z

BUG REPORT

I have a funny issue with elasticsearch 6.4.2 when I delete the current master Pod.

The theory tells us that there should have a brief downtime of about 3 to 5 seconds, but I get ~35 seconds of downtime.

If I manually kill the java process (kubectl exec -it ... kill 1) it causes a downtime of less than 5 seconds, which is expected.

The reason is quite simple to understand but does not help to fix it:
1/ I ask for Pod termination
2/ SIGTERM is sent to java, causing ES master to shut down properly and to send proper signal to other nodes
3/ Docker container is closed, and IP from network layer (Calico in my case) decommissioned
4/ new master election process starts
5/ Other nodes try to ping the dead master but instead of getting "connection refused" as it would have been in the "old" world outside of kubernetes/containers, it just gets nothing, then timeout after 30 seconds. It means no operation (even read) can be done during 30 seconds.

When I kill the process instead of terminating the pod, steps 1 to 4 are the same but
4/ other nodes try to ping the dead master, gets "connection refused", remove the dead master from their state and continue election

Possible fixes are:
1/ Fix within elasticsearch: a master broadcasting that it is leaving a cluster should cause other nodes to remove it immediately instead of trying to ping it
2/ Wait for a few seconds AFTER java has been killed before terminating the container (and the network associated with it) so that other nodes can get "connection refused". This is what is done in https://github.com/mintel/es-image/pull/3/files but official ES image directly runs a script that exec into java process, and it looks like a workaround, not a proper fix
3/ Reduce the timeout (dangerous?)

What do you think?

cc @simonswine @icereval @rendhalver @andrenarchy

Here is a log from another node seeing the current master "master-2" leaving the cluster then... blocking everything during 30 seconds (and all other nodes do the same), resulting in complete cluster lock-down (even read-only).

[2018-10-26T08:32:39,832][INFO ][o.e.d.z.ZenDiscovery     ] [test-elasticsearch-master-1] master_left [{test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300}], reason [shut_down]
[2018-10-26T08:32:39,834][WARN ][o.e.d.z.ZenDiscovery     ] [test-elasticsearch-master-1] master left (reason = shut_down), current nodes: nodes: 
   {test-elasticsearch-data-1}{sSF8x02zQDeREJHw-NeHZg}{DvOX5HawRWS77hMPgjT07g}{10.233.105.88}{10.233.105.88:9300}
   {test-elasticsearch-data-0}{85jAAsnFQHm8SGJDQ4zWCw}{T0Bo1sGpSoWkEUvLFINhxA}{10.233.87.101}{10.233.87.101:9300}
   {test-elasticsearch-client-6d4bf78f76-4fn8n}{moF0fXIpTUq27RoGEfmnMg}{pjAv07guQHyC_9yn8xlsgw}{10.233.70.71}{10.233.70.71:9300}
   {test-elasticsearch-client-6d4bf78f76-w26dc}{A2OXcw9fRX664WjlsQa_QA}{_JTUm437QmaAelpvJZiBKQ}{10.233.105.85}{10.233.105.85:9300}
   {test-elasticsearch-master-1}{yg1ELS1fRlKximCnpgXo_w}{_Wlx6e_7Tf-WOhYBoxA_QA}{10.233.109.200}{10.233.109.200:9300}, local
   {test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300}, master
   {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}

[2018-10-26T08:32:43,003][INFO ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] detected_master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}, reason: apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [8]])
[2018-10-26T08:33:11,481][WARN ][o.e.c.NodeConnectionsService] [test-elasticsearch-master-1] failed to connect to node {test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [test-elasticsearch-master-2][10.233.120.10:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:643) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:542) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:316) ~[elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:180) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.2.jar:6.4.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.2.jar:6.4.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:844) [?:?]
[2018-10-26T08:33:41,584][WARN ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] cluster state applier task [apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [8]])] took [58.5s] above the warn threshold of 30s
[2018-10-26T08:33:42,844][INFO ][o.e.d.z.ZenDiscovery     ] [test-elasticsearch-master-1] failed to send join request to master [{test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-10-26T08:33:47,605][INFO ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] removed {{test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{-j2DNUo9SFyW14T0fewv8Q}{10.233.120.10}{10.233.120.10:9300},}, reason: apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [9]])
[2018-10-26T08:34:17,656][INFO ][o.e.c.s.ClusterApplierService] [test-elasticsearch-master-1] added {{test-elasticsearch-master-2}{0-1vwNxxSo6QZZTdk-yXCg}{ioZPkwyLSXaOWhXVwXBwqQ}{10.233.120.29}{10.233.120.29:9300},}, reason: apply cluster state (from master [master {test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300} committed version [10]])
[2018-10-26T08:34:18,099][WARN ][o.e.t.TransportService   ] [test-elasticsearch-master-1] Received response for a request that has timed out, sent [94096ms] ago, timed out [64096ms] ago, action [internal:discovery/zen/fd/master_ping], node [{test-elasticsearch-master-0}{1uOUCcrjRrW8746WjVz_0A}{qZN2rnxNQVGEXumoj2NZ9A}{10.233.105.95}{10.233.105.95:9300}], id [2232]

The text was updated successfully, but these errors were encountered:

stale · 2018-11-25T13:15:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

desaintmartin · 2018-11-25T14:39:20Z

Not solved.

kimxogus · 2018-12-19T08:18:00Z

I think this is related to this post.

I'm trying solution 3 which I can try by myself. Do you know which config sets ping timeout in discovery/zen/fd/master_ping? discovery.zen.fd.ping_timeout didn't help.

desaintmartin · 2018-12-19T08:42:45Z

Documentation talks about discovery.zen.ping_timeout [0] but I haven't tested yet.

[0] https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#master-election

kimxogus · 2018-12-19T09:09:41Z

I set both discovery.zen.ping_timeout and discovery.zen.fd.ping_timeout to 3s, but didn't help to solve this issue.

Any http request to node in the cluster responded after logs written below, and outage was almost 2 minutes.

[2018-12-19T09:04:43,113][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-data-1] Received response for a request that has timed out, sent [119179ms] ago, timed out [116179ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{YS5J3gLxRUGzm1wCYa7QNw}{100.96.165.136}{100.96.165.136:9300}], id [342396]
[2018-12-19T09:04:43,113][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-data-1] Received response for a request that has timed out, sent [116179ms] ago, timed out [113179ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{YS5J3gLxRUGzm1wCYa7QNw}{100.96.165.136}{100.96.165.136:9300}], id [342397]
[2018-12-19T09:04:43,113][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-data-1] Received response for a request that has timed out, sent [113179ms] ago, timed out [110179ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{YS5J3gLxRUGzm1wCYa7QNw}{100.96.165.136}{100.96.165.136:9300}], id [342398]

It seems discovery/zen/fd/master_ping doesn't follow discovery.zen.ping_timeout.

kimxogus · 2018-12-19T09:16:11Z

I tried many times after making sure all nodes have ping_timeout config.
Outage time reduced to about 1m, and node responded before master_ping timeout log.

desaintmartin · 2018-12-19T09:18:38Z

1m seems to be as expected, since 3s * 20 tries = 1 minute. This is why I thought that option as being dangerous: in order to have quick timeout, we should set this option to ~500ms (10 seconds timeout), which is dangerously low.

kimxogus · 2018-12-19T09:49:57Z

Ahh, you're right. As master re-election is done in 3~5 seconds, I thought this is irrelevant to master election.

Setting 3s seems ok for now in my cluster, but master election seemed to have trouble with 1s. I agree that ~500ms is dangerously low.

kimxogus · 2018-12-19T10:18:16Z

I made a feature request elastic/elasticsearch#36822

desaintmartin · 2018-12-19T10:39:05Z

I wonder one thing: why does the other ES masters still try to ping the master who left (see my point 1/)?

DaveCTurner · 2018-12-19T11:37:27Z

The default value for net.ipv4.tcp_retries2 is 15, which IMO is far too high in the kinds of environments that Elasticsearch should operate. This setting controls how many times a TCP packet should be retransmitted before considering the connection to have dropped, and there's exponential backoff involved, so 15 retries takes many minutes to detect a connection drop. Elasticsearch reacts quickly to a connection drop. One solution could be to set this lower - see elastic/elasticsearch#34405 (comment) for more details, and links to at least two other systems that recommend reducing it to 3, so that a connection drop is detected in a second or so.

kimxogus · 2018-12-21T11:18:23Z

As elastic/elasticsearch#36822 (comment).
Setting net.ipv4.tcp_retries2=3 and transport.tcp.connect_timeout to 2 ~ 3 seconds made outage 8 ~ 10 seconds.

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

kimxogus · 2019-01-16T03:03:05Z

I think I fixed this issue in #10687

stale · 2019-02-15T03:43:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2019-03-01T04:00:58Z

This issue is being automatically closed due to inactivity.

pavdmyt · 2019-07-24T16:04:28Z

Running Elasticsearch cluster v6.7.1. Having the same issue, after deleting elasticsearch master pod, cluster is stuck for approx 65-70 seconds. Setting net.ipv4.tcp_retries2=3 and transport.connect_timeout=3s reduces the outage to 33-37 seconds. But still not able to reach 8-10 seconds mentioned by @kimxogus

Setting discovery.zen.fd.ping_timeout=3s has no effect.

kimxogus · 2019-07-25T03:49:34Z

@pavdmyt I recommend elastic's official helm chart. This issue is fixed in elastic's official chart via elastic/helm-charts#119
Setting masterTerminationFix: true will handle this problem automatically.

pavdmyt · 2019-07-25T08:39:21Z

@kimxogus Thanks for the reference! I should give it a try.

desaintmartin changed the title ~~[stable/elasticsearch] Delete current master node causes cluster outage of more than 30 seconds~~ [stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds Oct 26, 2018

nabadger mentioned this issue Oct 29, 2018

Master re-election delay (10+seconds) and outage upmc-enterprises/elasticsearch-operator#241

Open

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2018

stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2018

kimxogus mentioned this issue Dec 19, 2018

[Feature Request] Configuration to customize discovery/zen/fd/master_ping elastic/elasticsearch#36822

Closed

kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019

fix helm#8785

a8b6ac0

kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019

fix helm#8785

6fc505d

kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019

fix helm#8785

bad6239

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

kimxogus added a commit to kimxogus/charts that referenced this issue Jan 16, 2019

fix helm#8785

205447b

Signed-off-by: Taehyun Kim <kgyoo8232@gmail.com>

kimxogus mentioned this issue Jan 16, 2019

[stable/elasticsearch] fix cluster outage during master termination #10687

Closed

3 tasks

kimxogus mentioned this issue Jan 21, 2019

fix cluster outage, add masterService template elastic/helm-charts#41

Closed

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2019

andreykaipov mentioned this issue Feb 18, 2019

Slow re-election when elected master pod is deleted elastic/helm-charts#63

Closed

stale bot closed this as completed Mar 1, 2019

kimxogus mentioned this issue May 2, 2019

[stable/elasticsearch] support elasticsearch 7 & graceful termination on master nodes #13290

Closed

3 tasks

DaveCTurner mentioned this issue May 7, 2019

Investigating TestMutationMdiToDedicated failure elastic/cloud-on-k8s#614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

desaintmartin commented Oct 26, 2018 •

edited

stale bot commented Nov 25, 2018

desaintmartin commented Nov 25, 2018

kimxogus commented Dec 19, 2018

desaintmartin commented Dec 19, 2018

kimxogus commented Dec 19, 2018

kimxogus commented Dec 19, 2018

desaintmartin commented Dec 19, 2018 •

edited

kimxogus commented Dec 19, 2018 •

edited

kimxogus commented Dec 19, 2018 •

edited

desaintmartin commented Dec 19, 2018

DaveCTurner commented Dec 19, 2018 •

edited

kimxogus commented Dec 21, 2018 •

edited

kimxogus commented Jan 16, 2019

stale bot commented Feb 15, 2019

stale bot commented Mar 1, 2019

pavdmyt commented Jul 24, 2019

kimxogus commented Jul 25, 2019

pavdmyt commented Jul 25, 2019

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

[stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds #8785

Comments

desaintmartin commented Oct 26, 2018 • edited

stale bot commented Nov 25, 2018

desaintmartin commented Nov 25, 2018

kimxogus commented Dec 19, 2018

desaintmartin commented Dec 19, 2018

kimxogus commented Dec 19, 2018

kimxogus commented Dec 19, 2018

desaintmartin commented Dec 19, 2018 • edited

kimxogus commented Dec 19, 2018 • edited

kimxogus commented Dec 19, 2018 • edited

desaintmartin commented Dec 19, 2018

DaveCTurner commented Dec 19, 2018 • edited

kimxogus commented Dec 21, 2018 • edited

kimxogus commented Jan 16, 2019

stale bot commented Feb 15, 2019

stale bot commented Mar 1, 2019

pavdmyt commented Jul 24, 2019

kimxogus commented Jul 25, 2019

pavdmyt commented Jul 25, 2019

desaintmartin commented Oct 26, 2018 •

edited

desaintmartin commented Dec 19, 2018 •

edited

kimxogus commented Dec 19, 2018 •

edited

kimxogus commented Dec 19, 2018 •

edited

DaveCTurner commented Dec 19, 2018 •

edited

kimxogus commented Dec 21, 2018 •

edited