New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch remains unhealthy because master node doesn't think its master #19842
Comments
How many master-eligible nodes do you have, and what is the value of |
we have 3 master eligible nodes, and value of discovery.zen.minimum_master_nodes is 2. |
I don't understand this paragraph. You're speaking as if there is a single master-eligible node, but you say that you have three. Either way, if there are not at least as many master-eligible nodes available as the value of The setting |
OK, let me clarify. There are 3 master eligible nodes in our cluster. If for some reason the elected master node does not find other master nodes ( because they are yet to bootstrap ), then elected master node does not become master, which is expected. But after all the master eligible nodes are bootstrapped, I would expect one of the master eligible nodes to become master and elasticsearch to become healthy automatically.... In our case this doesn't happen unless we do a full cluster restart. Do you see the problem here? |
Correct, that is what is suppose to happen.
This does not reproduce for me, we will need more detail. Can you provide the output of |
Here are our discovery/network settings - discovery.zen.ping.timeout: 2m discovery.zen.ping.multicast.enabled: false discovery.zen.fd.ping_interval: 1m discovery.type: ec2 discovery.ec2.host_type: private_ip discovery.ec2.tag.Env: bazaar cloud.aws.region: us-east-1 Here is a log message from master node that has info on the current nodes when this happened
|
That log message lists one master-eligible node only: itself. |
@clintongormley right...
The log message I pasted was from step 1, where only one master eligible node is up...rest of the logs are also in the issue. After this, the cluster remains unhealthy until manual intervention...I would expect elasticsearch to recover on its own after all master eligible nodes are bootstrapped, but it does not happen. node1 does not re-evaluate its master status, and no other master is chosen as well, and the cluster continues to remain unhealthy because master node has not been discovered...Do you see the problem here? We have had this occur from time to time. To get the cluster out of unhealthy state, we do this
In summary, the real problem is elasticsearch does not recover even after all master eligible nodes become available... |
This is incorrect. Node1 won't elect itself as master until there are enough masters available. eg: ./bin/elasticsearch --discovery.zen.minimum_master_nodes=2
./bin/elasticsearch --discovery.zen.minimum_master_nodes=2
The first node only becomes master once there are enough master-eligible nodes to satisfy the requirement:
The log message you provided shows there are not enough master nodes. So something is going wrong, but we're not seeing enough logging to understand what. |
@clintongormley I do see different behavior in 1.7.3 This is what we see in the logs for Node1 -
And then -
Note how there are no master eligible nodes currently seen... |
Ah 1.7.3! OK - you need to upgrade. SO much has changed since then. We're not going to do major rewrites to the 1.x series. |
Elasticsearch version: 1.7.3
JVM version: openjdk version "1.8.0_65"
OpenJDK Runtime Environment (build 1.8.0_65-b17)
OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)
OS version: Amazon linux 3.14.35-28.38.amzn1.x86_64
Description of the problem including expected versus actual behavior:
Expected behavior: elasticsearch becomes healthy.
Actual behavior: elasticsearch remains unhealthy..
If the master eligible node does not find minimum number of master eligible nodes in the cluster, then it thinks its not the master node. Even though master eligible nodes become available later, it continues to think its not master, and rejects join requests. It does not re-evaluate its master status even though minimum number of master eligible nodes become available later...
sequence of events:
restarting the chosen master node does not resolve the problem. However, full cluster restart fixes it..
Steps to reproduce:
Provide logs (if relevant):
Here are some logs that we see when this happens -
from chosen master node -
[2016-07-19 15:51:12,695][WARN ][discovery.ec2] [10-100-52-166] not enough master nodes, current nodes: {{client_bv=false, availability_zone=a, master=false},{client_bv=false, availability_zone=a, data=false, master=true},{data=false, client=true},{data=false, client=true}}
From the other master eligible nodes -
[2016-07-19 15:49:44,168][DEBUG][action.admin.cluster.health] [10-100-61-115] no known master node, scheduling a retry [2016-07-19 15:50:07,936][INFO ][discovery.ec2]failed to send join request to master [10-100-52-166][Be-E2WoeQ02g9APPaTP_AQ]{client_bv=false, availability_zone=a, data=false, master=true}], reason [RemoteTransportException[[10-100-52-166]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[10-100-52-166][Be-E2WoeQ02g9APPaTP_AQ][10-100-52-166][inet[/10.100.52.166:9300]]{client_bv=false, availability_zone=a, data=false, master=true}] not master for join request from [[10-100-61-115][EMVPvC_5TACp44kIpzpxWg][10-100-61-115][inet[/10.100.61.115:9300]]{client_bv=false, availability_zone=c, data=false, master=true}]]; ], tried [3] times
[2016-07-19 15:51:59,473][DEBUG][action.admin.cluster.health] [10-100-61-115] observer: timeout notification from cluster service. timeout setting [30s], time since start [30s] [2016-07-19 15:52:08,885][INFO ][discovery.ec2 ] [10-100-61-115] failed to send join request to master [[10-100-61-166][Be-E2WoeQ02g9APPaTP_AQ][10-100-52-166][inet[/10.100.52.166:9300]]{client_bv=false, availability_zone=a, data=false, master=true}], reason [RemoteTransportException[[10-100-52-166][inet[/10.100.52.166:9300]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[10-100-52-166][Be-E2WoeQ02g9APPaTP_AQ][10-100-52-166][inet[/10.100.52.166:9300]]{client_bv=false, availability_zone=a, data=false, master=true}] not master for join request from [[10-100-61-115][EMVPvC_5TACp44kIpzpxWg][10-100-61-115][inet[/10.100.61.115:9300]]{client_bv=false, availability_zone=c, data=false, master=true}]]; ], tried [3] times
The text was updated successfully, but these errors were encountered: