Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoSuchNodeException during startup #11923

Closed
clintongormley opened this issue Jun 29, 2015 · 3 comments
Closed

NoSuchNodeException during startup #11923

clintongormley opened this issue Jun 29, 2015 · 3 comments
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement help wanted adoptme v1.7.0 v2.0.0-beta1

Comments

@clintongormley
Copy link

When adding a new node to the cluster, master throws a series of NoSuchNodeException exceptions until the new node is ready:

[2015-06-29 20:14:27,958][WARN ][gateway                  ] [foo] [t][3]: failed to list shard for shard_store on node [g51KTc9wQZWlLiwiYFkcgg]
FailedNodeException[Failed node [g51KTc9wQZWlLiwiYFkcgg]]; nested: NoSuchNodeException[No such node [g51KTc9wQZWlLiwiYFkcgg]];
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:179)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:131)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:91)
    at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:65)
    at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:42)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.list(TransportNodesListShardStoreMetaData.java:82)
    at org.elasticsearch.gateway.AsyncShardFetch.asyncFetch(AsyncShardFetch.java:267)
    at org.elasticsearch.gateway.AsyncShardFetch.fetchData(AsyncShardFetch.java:117)
    at org.elasticsearch.gateway.GatewayAllocator.allocateUnassigned(GatewayAllocator.java:406)
    at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:72)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:179)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:159)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:145)
    at org.elasticsearch.discovery.zen.ZenDiscovery$11.execute(ZenDiscovery.java:937)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:378)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:209)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:179)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: NoSuchNodeException[No such node [g51KTc9wQZWlLiwiYFkcgg]]
    ... 20 more
@s1monw
Copy link
Contributor

s1monw commented Jun 29, 2015

@kimchy can you take a look at this?

@kimchy
Copy link
Member

kimchy commented Jun 29, 2015

I think I know what happens, now that we reroute within the same cluster state when we add nodes, it means that they will be part of the cluster state being built. When we go and list the started shards, we use the existing cluster state that hasn't yet been updated to find the relevant nodes, and they will not be there since they are just being added... .

bleskes added a commit to bleskes/elasticsearch that referenced this issue Jun 30, 2015
elastic#11776 has simplified our rerouting logic by removing a scheduled background reroute in favor of an explicit reroute during the cluster state processing of a node join (the only place where we didn't do it explicitly). While that change is conceptually good, it change semantics a bit in two ways:

 - shard listing actions underpinning shard allocation do not have access to that new node yet (causing errors during shard allocation see elastic#11923
 - the very first cluster state published to a node already has shard assignments to it. This surfaced other issues we are working to fix separately

 This commit changes the reroute to be done post processing the initial join cluster state to side step these issues while we work on a longer term solution.
bleskes added a commit that referenced this issue Jun 30, 2015
 - shard listing actions underpinning shard allocation do not have access to that new node yet (causing errors during shard allocation see #11923
 - the very first cluster state published to a node already has shard assignments to it. This surfaced other issues we are working to fix separately

 This commit changes the reroute to be done post processing the initial join cluster state to side step these issues while we work on a longer term solution.

Closes #11960
@bleskes
Copy link
Contributor

bleskes commented Jun 30, 2015

closed with #11960

@bleskes bleskes closed this as completed Jun 30, 2015
@bleskes bleskes added the v1.7.0 label Jun 30, 2015
bleskes added a commit that referenced this issue Jun 30, 2015
- shard listing actions underpinning shard allocation do not have access to that new node yet (causing errors during shard allocation see #11923
- the very first cluster state published to a node already has shard assignments to it. This surfaced other issues we are working to fix separately

This commit changes the reroute to be done post processing the initial join cluster state to side step these issues while we work on a longer term solution.

Closes #11960
szroland pushed a commit to szroland/elasticsearch that referenced this issue Jun 30, 2015
 - shard listing actions underpinning shard allocation do not have access to that new node yet (causing errors during shard allocation see elastic#11923
 - the very first cluster state published to a node already has shard assignments to it. This surfaced other issues we are working to fix separately

 This commit changes the reroute to be done post processing the initial join cluster state to side step these issues while we work on a longer term solution.

Closes elastic#11960
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement help wanted adoptme v1.7.0 v2.0.0-beta1
Projects
None yet
Development

No branches or pull requests

4 participants