Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node disconnection during index creation leaves behind unassigned shards #9323

Closed
ppf2 opened this issue Jan 15, 2015 · 11 comments
Closed

Node disconnection during index creation leaves behind unassigned shards #9323

ppf2 opened this issue Jan 15, 2015 · 11 comments
Assignees
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source.

Comments

@ppf2
Copy link
Member

ppf2 commented Jan 15, 2015

[2015-01-15 02:02:53,614][DEBUG][cluster.service          ] [master_node] processing [create-index [twitter20150116], cause [api]]: execute
[2015-01-15 02:02:53,614][DEBUG][indices                  ] [master_node] creating Index [twitter20150116], shards [10]/[1]
[2015-01-15 02:02:53,645][INFO ][cluster.metadata         ] [master_node] [twitter20150116] creating index, cause [api], shards [10]/[1], mappings [tweets]
[2015-01-15 02:02:53,832][DEBUG][cluster.service          ] [master_node] cluster state updated, version [4655], source [create-index [twitter20150116], cause [api]]
[2015-01-15 02:02:53,832][DEBUG][cluster.service          ] [master_node] publishing cluster state version 4655
[2015-01-15 02:03:23,642][DEBUG][action.admin.indices.create] [master_node] [twitter20150116] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 30s
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.run(MetaDataCreateIndexService.java:148)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-01-15 02:03:23,846][DEBUG][discovery.zen.publish    ] [master_node] timed out waiting for all nodes to process published state [4655] (timeout [30s])
[2015-01-15 02:03:23,846][DEBUG][cluster.service          ] [master_node] set local cluster state to version 4655
[2015-01-15 02:03:23,846][DEBUG][river.cluster            ] [master_node] processing [reroute_rivers_node_changed]: execute
[2015-01-15 02:03:23,846][DEBUG][river.cluster            ] [master_node] processing [reroute_rivers_node_changed]: no change in cluster_state
[2015-01-15 02:03:23,877][DEBUG][cluster.service          ] [master_node] processing [create-index [twitter20150116], cause [api]]: done applying updated cluster_state (version: 4655)
[2015-01-15 02:03:45,229][DEBUG][action.admin.indices.create] [master_node] [twitter20150116] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 30s
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.run(MetaDataCreateIndexService.java:148)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-01-15 02:03:45,260][DEBUG][action.admin.indices.create] [master_node] [twitter20150116] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 30s
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.run(MetaDataCreateIndexService.java:148)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-01-15 02:03:45,260][DEBUG][action.admin.indices.create] [master_node] [twitter20150116] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 30s
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.run(MetaDataCreateIndexService.java:148)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-01-15 02:03:53,862][DEBUG][action.admin.indices.create] [master_node] [twitter20150116] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index [twitter20150116], cause [api]) within 30s
    at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:263)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
[2015-01-15 02:03:54,097][DEBUG][discovery.zen.publish    ] [master_node] timed out waiting for all nodes to process published state [4656] (timeout [30s])
[2015-01-15 02:03:54,097][DEBUG][cluster.service          ] [master_node] set local cluster state to version 4656
[2015-01-15 02:03:54,097][DEBUG][cluster.service          ] [master_node] processing [shard-started ([twitter20150116][0], node[HtIqBAZEQXqLvGt7QRNKHw], [P], s[INITIALIZING]), reason [after recovery from gateway]]: execute
[2015-01-15 02:03:54,097][DEBUG][cluster.service          ] [master_node] processing [shard-started ([twitter20150116][0], node[HtIqBAZEQXqLvGt7QRNKHw], [P], s[INITIALIZING]), reason [after recovery from gateway]]: no change in cluster_state
.... processing messages for all other shards
[2015-01-15 02:03:54,409][DEBUG][transport.netty          ] [master_node] disconnecting from [[data_node][HtIqBAZEQXqLvGt7QRNKHw][data_node][inet[/IP:9300]]{http=false, tag=data_node, upgradedomain=data4-1, faultdomain=data4-1, master=false}], channel closed event
[2015-01-15 02:03:54,425][DEBUG][discovery.zen.publish    ] [master_node] failed to send cluster state to [data_node][HtIqBAZEQXqLvGt7QRNKHw][data_node][inet[/IP:9300]]{http=false, tag=data_node, upgradedomain=data4-1, faultdomain=data4-1, master=false}
org.elasticsearch.transport.NodeDisconnectedException: [data_node][inet[/IP:9300]][internal:discovery/zen/publish] disconnected
[2015-01-15 02:03:54,425][DEBUG][discovery.zen.publish    ] [master_node] failed to send cluster state to [data_node][HtIqBAZEQXqLvGt7QRNKHw][data_node][inet[/IP:9300]]{http=false, tag=data_node, upgradedomain=data4-1, faultdomain=data4-1, master=false}
org.elasticsearch.transport.NodeDisconnectedException: [data_node][inet[/IP:9300]][internal:discovery/zen/publish] disconnected
[2015-01-15 02:03:54,425][DEBUG][discovery.zen.publish    ] [master_node] failed to send cluster state to [data_node][HtIqBAZEQXqLvGt7QRNKHw][data_node][inet[/IP:9300]]{http=false, tag=data_node, upgradedomain=data4-1, faultdomain=data4-1, master=false}
org.elasticsearch.transport.NodeDisconnectedException: [data_node][inet[/IP:9300]][internal:discovery/zen/publish] disconnected
[2015-01-15 02:03:54,425][DEBUG][cluster.service          ] ack received from node [[data_node][HtIqBAZEQXqLvGt7QRNKHw][data_node][inet[/IP:9300]]{http=false, tag=data_node, upgradedomain=data4-1, faultdomain=data4-1, master=false}], cluster_state update (version: 4655)
org.elasticsearch.transport.NodeDisconnectedException: [data_node][inet[/IP:9300]][internal:discovery/zen/publish] disconnected
[2015-01-15 02:04:03,804][DEBUG][action.admin.cluster.node.stats] [master_node] failed to execute on node [HtIqBAZEQXqLvGt7QRNKHw]
org.elasticsearch.transport.SendRequestTransportException: [data_node][inet[/IP:9300]][cluster:monitor/nodes/stats[n]]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:213)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:165)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:97)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:70)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:75)
    at org.elasticsearch.cluster.InternalClusterInfoService.updateNodeStats(InternalClusterInfoService.java:232)
    at org.elasticsearch.cluster.InternalClusterInfoService$ClusterInfoUpdateJob.run(InternalClusterInfoService.java:291)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [data_node][inet[/IP:9300]] Node not connected
    at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:946)
    at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:640)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
    ... 10 more
[2015-01-15 02:04:24,325][DEBUG][gateway.local            ] [master_node] [twitter20150116][0]: not allocating, number_of_allocated_shards_found [0], required_number [1]

It looks like if there is a node connection issue during index creation between the master node and a data node that it is trying to initialize a shard on, it will leave the primary and replica of the shard in an unassigned state and will not retry or try to allocate it somewhere else.

twitter20150116 0 p UNASSIGNED 
twitter20150116 0 r UNASSIGNED
@clintongormley clintongormley added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >bug labels Jan 16, 2015
@ppf2
Copy link
Member Author

ppf2 commented Jan 26, 2015

Looks like this was caused by a socket connection issue which caused the master to disconnect the data node when it was trying to allocate replicas. As a result, the master lost the single copy it had of shard 0 and therefore also cannot allocate the replica. We opted to leave these shards in place to prevent any potential loss of data.

@ppf2 ppf2 closed this as completed Jan 26, 2015
@emwalker
Copy link

emwalker commented Feb 3, 2015

What do you do to recover from this situation? I'm having trouble restarting the service on the node on which it failed.

@bleskes
Copy link
Contributor

bleskes commented Feb 3, 2015

@emwalker once a primary shard is started it will accept indexing requests which is why ES can't automatically re-assign it once lost. Rather it waits for the node to re-join the cluster (and re-assign it there). If you know nothing was indexed or you don't mind the data loss, you can use the reroute API to force assign the primary somewhere else. Obviously it will be empty. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html#cluster-reroute (see the allow_primary flag)

@jlmoody
Copy link

jlmoody commented Sep 16, 2015

@ppf2 What version of Elasticsearch was running when you experienced this issue?

@ppf2
Copy link
Member Author

ppf2 commented Sep 27, 2015

I believe it was 1.4.1 @jlmoody

@ankireddy501
Copy link

i am also facing the same problem during the type update

org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping [Activity]) within 30s
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:263)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

@vineet85
Copy link

We are also facing this during PUT requests to update the mapping:- Noticing this happens particularly when there are relocating shards on the cluster.

Elasticsearch.Net.ElasticsearchServerException: ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [documenthighlight]) within 30s

Is this going to be fixed in a later version?

@bleskes
Copy link
Contributor

bleskes commented Mar 16, 2016

@vineet85 what you describe sounds like a different issue. Can you open one with all the details filled in?

@akg-dotnet-dev
Copy link

We are also getting the same error:

[2017-02-05 02:00:47,371][DEBUG][action.admin.indices.create] [Nuke - Squadron Supreme Member] [flowtrader-middletier-2017.02.05] failed to create
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (acquire index lock) within 1m
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.doRun(MetaDataCreateIndexService.java:150)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I'm not able to see the solution provided here or am i missing something.

ES version: 1.7.5

@clintongormley
Copy link

@ImdotnetJunkie 1.7.5 is no longer supported. Time to upgrade - so much has changed and been improved

@akg-dotnet-dev
Copy link

@clintongormley : Yeah, we are already on our way to upgrade and now we got another good reason to upgrade.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source.
Projects
None yet
Development

No branches or pull requests

8 participants