You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of the ping process a master of the cluster is either elected or joined to. This is done automatically. The discovery.zen.ping_timeout (which defaults to 3s) allows for the tweaking of election time to handle cases of slow or congested networks (higher values assure less chance of failure). Once a node joins, it will send a join request to the master (discovery.zen.join_timeout) with a timeout defaulting at 20 times the ping timeout.
When the master node stops or has encountered a problem, the cluster nodes start pinging again and will elect a new master. This pinging round also serves as a protection against (partial) network failures where a node may unjustly think that the master has failed. In this case the node will simply hear from other nodes about the currently active master.
If discovery.zen.master_election.filter_client is true, pings from client nodes (nodes where node.client is true, or both node.data and node.master are false) are ignored during master election; the default value is true. If discovery.zen.master_election.filter_data is true, pings from non-master-eligible data nodes (nodes where node.data is true and node.master is false) are ignored during master election; the default value is false. Pings from master-eligible nodes are always observed during master election.
Nodes can be excluded from becoming a master by setting node.master to false. Note, once a node is a client node (node.client set to true), it will not be allowed to become a master (node.master is automatically set to false).
The discovery.zen.minimum_master_nodes sets the minimum number of master eligible nodes that need to join a newly elected master in order for an election to complete and for the elected node to accept it’s mastership. The same setting controls the minimum number of active master eligible nodes that should be a part of any active cluster. If this requirement is not met the active master node will step down and a new master election will be begin.
This setting must be set to a quorum of your master eligible nodes. It is recommended to avoid having only two master eligible nodes, since a quorum of two is two. Therefore, a loss of either master node will result in an inoperable cluster.
Fault Detection
用 Ping 的方式来确认别的 node 是否还在集群里面。Detection 分为
MasterFaultDetection 在所有 node 上运行。
NodesFaultDetection 只在 Master 上运行。
There are two fault detection processes running. The first is by the master, to ping all the other nodes in the cluster and verify that they are alive. And on the other end, each node pings to master to verify if its still alive or an election process needs to be initiated.
The following settings control the fault detection process using the discovery.zen.fd prefix:
ping_interval How often a node gets pinged. Defaults to 1s.
ping_timeout How long to wait for a ping response, defaults to 30s.
ping_retries How many ping failures / timeouts cause a node to be considered failed. Defaults to 3.
The text was updated successfully, but these errors were encountered:
之前有一篇大概翻译了一下 ES 为什么会出现 Split Brain 以及通过设置参数
minimum_master_nodes
来预防。那篇文章的结尾提到了有一个 Issue 指出即使设置了这个参数依然会有 Split Brain 的问题出现。这篇文章就试着解释一下这个 Issue 所提到的问题和一些与之相关的问题。Why Still Split Brain?
假设有三个节点组成一个集群,其中 Node2 是 Master,如下图所示。
之后 Node2 和 Node3 之间的网络连接断了,但是它们和 Node1 之间还可以互相 Ping 到。这时,Node2 以为集群里只剩下 Node1 和 Node2 了。同时 Node3 也以为 集群只剩下 Node1 和 Node3 了。Node2 还是小集群里的 Master。而这个时候如果 Node3 也将自己选举为 Master 的话,这个集群里就出现了两个 Master 并且会出现 Split Brain。
这个 Split Brain 的问题即使是将
minimum_master_nodes
设置为2
也就是3/2 +1
时也会出现。Follow-up
目前这个 Issue 已经被关闭,也就是问题被解决了。
可以看到在一个 PR 上进行了 Zen Discovery 的优化来解决了这个问题。具体怎么解决的还没看懂。。。先来介绍一下 ES 使用的 Zen Discovery 吧。
Zen Discovery
(这一部分要去看相关代码才能明白但是有点看跪了。。。所以就借鉴了这篇文章里的内容)
ES 在集群内部是使用的是一种叫做 Zen Discovery 的方法。它提供的是一种叫做 Unicast Discovery,也就是“单播发现”的机制来维护集群的状态。Zen DIscovery 主要包含以下几个模块:
Ping
ES instance 通过 Ping 来找到其它节点。具体在 Ping 里做了什么要看代码了。。
Unicast
Unicast 是一个点对点的传播。在这里为了提高效率可以指定某些 host 作为 Gossip Router。也就是会采用 Gossip Protocol 的方式进行 unicast。
Unicast 采用的是 Transport 模块来 perform discovery。
Master Election
Master Election 就是用来选举 Master 的。这部分也要先看代码了。。。
Fault Detection
用 Ping 的方式来确认别的 node 是否还在集群里面。Detection 分为
ping_interval
How often a node gets pinged. Defaults to 1s.ping_timeout
How long to wait for a ping response, defaults to 30s.ping_retries
How many ping failures / timeouts cause a node to be considered failed. Defaults to 3.The text was updated successfully, but these errors were encountered: