Skip to content

[Feature Request] Configuration to customize discovery/zen/fd/master_ping #36822

Closed
@kimxogus

Description

@kimxogus

Describe the feature:

  • Configuration to customize discovery/zen/fd/master_ping. A config option to make elasticsearch skip pinging and waiting for old master before new master.

In kubernetes environment, ip of each member node in cluster are assigned to a pod which is a docker container. When a pod(node) is terminated. you will have a ping timeout to old master address as newly created pod(node) will have a different ip address. In this situation, cluster outage occurs for `discovery.zen.join_timeout` * 20 times(as [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#master-election)) which will be more than a minute. Reducing `ping_timeout` lower than 1 second is too dangerous(may have a problem in master-election) and waiting for several seconds after SIGTERM to elasticsearch for maintaining pod ip for ping doesn't seem to be a proper solution. As [this discussion](https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590), I believe that adding a config option to make elasticsearch skip pinging and waiting for old master before new master will be a good solution.

Elasticsearch version (bin/elasticsearch --version): 6.2.3

Plugins installed: [ingest-geoip, ingest-user-agent, repository-s3]

JVM version (java -version):

openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

OS version (uname -a if on a Unix-like system): Linux {HOSTNAME} 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Deploy elasticsearch cluster in kubernetes (helm chart in my case)
  2. Terminate current master pod(node)
  3. New master is elected within 3~5 seconds, but any member node in cluster doesn't
    respond to http requests about 1 minute(with discovery.zen.ping_timeout=3s and discovery.zen.fd.ping_timeout=3s).

Provide logs (if relevant):

[2018-12-19T09:12:33,326][INFO ][o.e.c.s.ClusterApplierService] [es-monitoring-elasticsearch-master-0] detected_master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}, added {{es-monitoring-elasticsearch-client-57654b8f98-p47cm}{HJlePFqgQxq_wmFDEDNQEw}{Thx48_UDSL2CwzLrg0NL2w}{100.96.161.172}{100.96.161.172:9300},{es-monitoring-elasticsearch-master-2}{v3FjSTfcQ4OHAzXCzDcKFQ}{e1X2hVV8SIOkDk1wvE3LKw}{100.96.162.240}{100.96.162.240:9300},{es-monitoring-elasticsearch-data-2}{V2meIqpNTQOH8zY4PCtQ7g}{Pr9uoG03Qc6Xx2h4x-o62A}{100.96.162.225}{100.96.162.225:9300},{es-monitoring-elasticsearch-data-1}{mqfXo0yqTaCcEc956tVmpA}{NQehgvsvQq2Kh1K6tKZaxA}{100.96.161.175}{100.96.161.175:9300},{es-monitoring-elasticsearch-data-0}{rn-v-yB8RbeoHXovkC4UYQ}{vF1s-vC7TheNqloIyEJg4A}{100.96.165.88}{100.96.165.88:9300},{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300},{es-monitoring-elasticsearch-client-57654b8f98-dgvxm}{462pBrdyScC9WgmlkJr8ug}{vv_jrSxbTHi3wo-r03k0fQ}{100.96.166.205}{100.96.166.205:9300},}, reason: apply cluster state (from master [master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300} committed version [367]])
[2018-12-19T09:12:43,331][INFO ][o.e.d.z.ZenDiscovery     ] [es-monitoring-elasticsearch-master-0] master_left [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], reason [failed to ping, tried [3] times, each with  maximum [3s] timeout]
[2018-12-19T09:12:43,332][WARN ][o.e.d.z.ZenDiscovery     ] [es-monitoring-elasticsearch-master-0] master left (reason = failed to ping, tried [3] times, each with  maximum [3s] timeout), current nodes: nodes:
   {es-monitoring-elasticsearch-client-57654b8f98-p47cm}{HJlePFqgQxq_wmFDEDNQEw}{Thx48_UDSL2CwzLrg0NL2w}{100.96.161.172}{100.96.161.172:9300}
   {es-monitoring-elasticsearch-master-2}{v3FjSTfcQ4OHAzXCzDcKFQ}{e1X2hVV8SIOkDk1wvE3LKw}{100.96.162.240}{100.96.162.240:9300}
   {es-monitoring-elasticsearch-data-2}{V2meIqpNTQOH8zY4PCtQ7g}{Pr9uoG03Qc6Xx2h4x-o62A}{100.96.162.225}{100.96.162.225:9300}
   {es-monitoring-elasticsearch-master-0}{K6kMktL9QJC2sc7K-35McA}{srwO3u3SS9GYAWYeLyUn-g}{100.96.165.141}{100.96.165.141:9300}, local
   {es-monitoring-elasticsearch-data-1}{mqfXo0yqTaCcEc956tVmpA}{NQehgvsvQq2Kh1K6tKZaxA}{100.96.161.175}{100.96.161.175:9300}
   {es-monitoring-elasticsearch-data-0}{rn-v-yB8RbeoHXovkC4UYQ}{vF1s-vC7TheNqloIyEJg4A}{100.96.165.88}{100.96.165.88:9300}
   {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}, master
   {es-monitoring-elasticsearch-client-57654b8f98-dgvxm}{462pBrdyScC9WgmlkJr8ug}{vv_jrSxbTHi3wo-r03k0fQ}{100.96.166.205}{100.96.166.205:9300}

[2018-12-19T09:12:57,612][INFO ][o.e.d.z.ZenDiscovery     ] [es-monitoring-elasticsearch-master-0] failed to send join request to master [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-12-19T09:13:01,851][INFO ][o.e.c.s.ClusterApplierService] [es-monitoring-elasticsearch-master-0] detected_master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}, reason: apply cluster state (from master [master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300} committed version [368]])
[2018-12-19T09:13:01,857][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-master-0] Received response for a request that has timed out, sent [27532ms] ago, timed out [24531ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], id [29]
[2018-12-19T09:13:01,857][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-master-0] Received response for a request that has timed out, sent [24529ms] ago, timed out [21529ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], id [30]
[2018-12-19T09:13:01,858][WARN ][o.e.t.TransportService   ] [es-monitoring-elasticsearch-master-0] Received response for a request that has timed out, sent [21530ms] ago, timed out [18530ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300}], id [31]
[2018-12-19T09:15:54,284][INFO ][o.e.c.s.ClusterApplierService] [es-monitoring-elasticsearch-master-0] removed {{es-monitoring-elasticsearch-client-57654b8f98-dgvxm}{462pBrdyScC9WgmlkJr8ug}{vv_jrSxbTHi3wo-r03k0fQ}{100.96.166.205}{100.96.166.205:9300},}, reason: apply cluster state (from master [master {es-monitoring-elasticsearch-master-1}{IYHqXZysTTeNLaIGIs3Ggw}{SAoOYOl1T0W-XdV2kEzoYA}{100.96.166.11}{100.96.166.11:9300} committed version [369]])

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions