Closed
Description
Describe the feature:
- Configuration to customize discovery/zen/fd/master_ping. A config option to make elasticsearch skip pinging and waiting for old master before new master.
In kubernetes environment, ip of each member node in cluster are assigned to a pod which is a docker container. When a pod(node) is terminated. you will have a ping timeout to old master address as newly created pod(node) will have a different ip address. In this situation, cluster outage occurs for `discovery.zen.join_timeout` * 20 times(as [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#master-election)) which will be more than a minute. Reducing `ping_timeout` lower than 1 second is too dangerous(may have a problem in master-election) and waiting for several seconds after SIGTERM to elasticsearch for maintaining pod ip for ping doesn't seem to be a proper solution. As [this discussion](https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590), I believe that adding a config option to make elasticsearch skip pinging and waiting for old master before new master will be a good solution.
- reference: [stable/elasticsearch] Terminating current master pod causes cluster outage of more than 30 seconds helm/charts#8785 , https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590
Elasticsearch version (bin/elasticsearch --version
): 6.2.3
Plugins installed: [ingest-geoip
, ingest-user-agent
, repository-s3
]
JVM version (java -version
):
openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
OS version (uname -a
if on a Unix-like system): Linux {HOSTNAME} 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
- Deploy elasticsearch cluster in kubernetes (helm chart in my case)
- Terminate current master pod(node)
- New master is elected within 3~5 seconds, but any member node in cluster doesn't
respond to http requests about 1 minute(withdiscovery.zen.ping_timeout=3s
anddiscovery.zen.fd.ping_timeout=3s
).
Provide logs (if relevant):
[2018-12-19T09:12:33,326][INFO ][o.e.c.s.ClusterApplierService] [es-monitoring-elasticsearch-master-0] detected_master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}, added {{es-monitoring-elasticsearch-client-57654b8f98-p47cm}{HJlePFqgQxq_wmFDEDNQEw}{Thx48_UDSL2CwzLrg0NL2w}{100.96.161.172}{100.96.161.172:9300},{es-monitoring-elasticsearch-master-2}{v3FjSTfcQ4OHAzXCzDcKFQ}{e1X2hVV8SIOkDk1wvE3LKw}{100.96.162.240}{100.96.162.240:9300},{es-monitoring-elasticsearch-data-2}{V2meIqpNTQOH8zY4PCtQ7g}{Pr9uoG03Qc6Xx2h4x-o62A}{100.96.162.225}{100.96.162.225:9300},{es-monitoring-elasticsearch-data-1}{mqfXo0yqTaCcEc956tVmpA}{NQehgvsvQq2Kh1K6tKZaxA}{100.96.161.175}{100.96.161.175:9300},{es-monitoring-elasticsearch-data-0}{rn-v-yB8RbeoHXovkC4UYQ}{vF1s-vC7TheNqloIyEJg4A}{100.96.165.88}{100.96.165.88:9300},{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300},{es-monitoring-elasticsearch-client-57654b8f98-dgvxm}{462pBrdyScC9WgmlkJr8ug}{vv_jrSxbTHi3wo-r03k0fQ}{100.96.166.205}{100.96.166.205:9300},}, reason: apply cluster state (from master [master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300} committed version [367]])
[2018-12-19T09:12:43,331][INFO ][o.e.d.z.ZenDiscovery ] [es-monitoring-elasticsearch-master-0] master_left [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], reason [failed to ping, tried [3] times, each with maximum [3s] timeout]
[2018-12-19T09:12:43,332][WARN ][o.e.d.z.ZenDiscovery ] [es-monitoring-elasticsearch-master-0] master left (reason = failed to ping, tried [3] times, each with maximum [3s] timeout), current nodes: nodes:
{es-monitoring-elasticsearch-client-57654b8f98-p47cm}{HJlePFqgQxq_wmFDEDNQEw}{Thx48_UDSL2CwzLrg0NL2w}{100.96.161.172}{100.96.161.172:9300}
{es-monitoring-elasticsearch-master-2}{v3FjSTfcQ4OHAzXCzDcKFQ}{e1X2hVV8SIOkDk1wvE3LKw}{100.96.162.240}{100.96.162.240:9300}
{es-monitoring-elasticsearch-data-2}{V2meIqpNTQOH8zY4PCtQ7g}{Pr9uoG03Qc6Xx2h4x-o62A}{100.96.162.225}{100.96.162.225:9300}
{es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{srwO3u3SS9GYAWYeLyUn-g}{100.96.165.141}{100.96.165.141:9300}, local
{es-monitoring-elasticsearch-data-1}{mqfXo0yqTaCcEc956tVmpA}{NQehgvsvQq2Kh1K6tKZaxA}{100.96.161.175}{100.96.161.175:9300}
{es-monitoring-elasticsearch-data-0}{rn-v-yB8RbeoHXovkC4UYQ}{vF1s-vC7TheNqloIyEJg4A}{100.96.165.88}{100.96.165.88:9300}
{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}, master
{es-monitoring-elasticsearch-client-57654b8f98-dgvxm}{462pBrdyScC9WgmlkJr8ug}{vv_jrSxbTHi3wo-r03k0fQ}{100.96.166.205}{100.96.166.205:9300}
[2018-12-19T09:12:57,612][INFO ][o.e.d.z.ZenDiscovery ] [es-monitoring-elasticsearch-master-0] failed to send join request to master [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-12-19T09:13:01,851][INFO ][o.e.c.s.ClusterApplierService] [es-monitoring-elasticsearch-master-0] detected_master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}, reason: apply cluster state (from master [master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300} committed version [368]])
[2018-12-19T09:13:01,857][WARN ][o.e.t.TransportService ] [es-monitoring-elasticsearch-master-0] Received response for a request that has timed out, sent [27532ms] ago, timed out [24531ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], id [29]
[2018-12-19T09:13:01,857][WARN ][o.e.t.TransportService ] [es-monitoring-elasticsearch-master-0] Received response for a request that has timed out, sent [24529ms] ago, timed out [21529ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], id [30]
[2018-12-19T09:13:01,858][WARN ][o.e.t.TransportService ] [es-monitoring-elasticsearch-master-0] Received response for a request that has timed out, sent [21530ms] ago, timed out [18530ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], id [31]
[2018-12-19T09:15:54,284][INFO ][o.e.c.s.ClusterApplierService] [es-monitoring-elasticsearch-master-0] removed {{es-monitoring-elasticsearch-client-57654b8f98-dgvxm}{462pBrdyScC9WgmlkJr8ug}{vv_jrSxbTHi3wo-r03k0fQ}{100.96.166.205}{100.96.166.205:9300},}, reason: apply cluster state (from master [master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300} committed version [369]])