-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relocation of shards causes bulk indexing client to hang #1839
Comments
Verified with elasticsearch 0.18.7 and 0.19.1 |
Hey, can you help write a standalone test case, and the scenario (i.e. start 4 nodes, restart one node while test case is bulk indexing data), that recreates it? It will help speed things up to see where the problem is. |
OK |
OK - here it is. Just edit and execute the class (with a dependency to elasticsearch 0.19.1 jar).
Here's the code: package org.elasticsearch.issue1839; import org.elasticsearch.action.bulk.BulkRequestBuilder; import java.util.Date; public final class Main {
} |
Hi, thanks for the recreation, I managed to recreate it locally as well. I found the problem, it revolves around not properly handling a relocation of a primary shard when just when the one we relocated from gets closed. I will post a fix in both 0.19 and master branches (closing this issue in the commit, so we can keep track of the change). If you can check it yourself as well it would be great. |
Cool - that was fast :-) I'll try it when 0.19.3 is released - so I can rollback my "timeout loop". |
I have set up 4 big servers (lots of cores, lots of disks, lots of ram) - each running an elasticsearch node.
One client reads rows from a database and continuously submits indexing requests to the cluster. Indexing requests are bundled into bulk requests with 2500 indexing requests.
The index has 32 shards.
My client is using the Java client API.
So far so good.
I just wanted to know what happens, if I shutdown a node and restart it again.
Shutdown works fine (except: see below).
Restart works fine...
Until the cluster starts to relocate shards.
When a bulk request "hits" a shard being relocated, the cliend hangs forever.
I have tried several networking settings, transport client vs. node client - nothing helped.
One thing fixed the issue for me:
Previously, the code was:
When I use actionGet(timeout) with a timeout, the method throws a ElasticSearchTimeoutException in such a situation and I can submit the bulk request again.
In such a situation I see no activity in the elasticsearch threads and no activity in "my" calling thread - it just waits in org.elasticsearch.common.util.concurrent.BaseFuture.Sync#acquireSharedInterruptibly forever.
None of the cluster log files indicate an error.
I do not know if this behaviour affects searches.
The text was updated successfully, but these errors were encountered: