Shard stuck in relocating state with recovery stage=translog #9226

ppf2 · 2015-01-09T23:10:42Z

A primary shard was stuck in RELOCATION state. Recovery API shows that all files have completed at 100%. The stage shows that it is TRANSLOG. But it has been sitting there for > 15 hours.

  {
        "id": 0,
        "type": "RELOCATION",
        "stage": "TRANSLOG",
        "primary": true,
        "start_time": "16443.3d",
        "start_time_in_millis": 1420704589802,
        "stop_time": "0s",
        "stop_time_in_millis": 0,
        "total_time": "15.1h",
        "total_time_in_millis": 54706836,
        "source": {
          "id": "0eltfic0RdmC9v-Yb17Wzw",
          "host": "elasticsearch1.localdomain",
          "transport_address": "inet[/IP:9300]",
          "ip": "127.0.1.1",
          "name": "elasticsearch1.localdomain"
        },
        "target": {
          "id": "ctCRm_huQsSBoTobhmqJdg",
          "host": "elasticsearch8.localdomain",
          "transport_address": "inet[/IP:9300]",
          "ip": "127.0.1.1",
          "name": "elasticsearch8.localdomain"
        },
        "index": {
          "files": {
            "total": 318,
            "reused": 0,
            "recovered": 318,
            "percent": "100.0%",
            "details": [
             ... list of files
            ]
          },
          "bytes": {
            "total": 58111147927,
            "reused": 0,
            "recovered": 58111147927,
            "percent": "100.0%"
          },
          "total_time": "0s",
          "total_time_in_millis": 0
        },
        "translog": {
          "recovered": 24685532,
          "total_time": "0s",
          "total_time_in_millis": 0
        },
        "start": {
          "check_index_time": "0s",
          "check_index_time_in_millis": 0,
          "total_time": "0s",
          "total_time_in_millis": 0
        }
      }

Running a reroute command with no post body got it to unstuck:

curl -XPOST 'localhost:9200/_cluster/reroute?pretty&explain=true'

Not sure how it got into this state (no log entries related to recovery of this shard in the data or master logs) and why we have to run an empty reroute request to get it unstuck.

The text was updated successfully, but these errors were encountered:

bleskes · 2015-01-10T15:36:16Z

When the shard got unstuck, did it become green on the same node or did it move to another node? Also , which version of Es was it?

On Sat, Jan 10, 2015 at 12:10 AM, Pius notifications@github.com wrote:

A primary shard was stuck in RELOCATION state. Recovery API shows that all files have completed at 100%. The stage shows that it is TRANSLOG. But it has been sitting there for > 15 hours.

  {
        "id": 0,
        "type": "RELOCATION",
        "stage": "TRANSLOG",
        "primary": true,
        "start_time": "16443.3d",
        "start_time_in_millis": 1420704589802,
        "stop_time": "0s",
        "stop_time_in_millis": 0,
        "total_time": "15.1h",
        "total_time_in_millis": 54706836,
        "source": {
          "id": "0eltfic0RdmC9v-Yb17Wzw",
          "host": "elasticsearch1.localdomain",
          "transport_address": "inet[/IP:9300]",
          "ip": "127.0.1.1",
          "name": "elasticsearch1.localdomain"
        },
        "target": {
          "id": "ctCRm_huQsSBoTobhmqJdg",
          "host": "elasticsearch8.localdomain",
          "transport_address": "inet[/IP:9300]",
          "ip": "127.0.1.1",
          "name": "elasticsearch8.localdomain"
        },
        "index": {
          "files": {
            "total": 318,
            "reused": 0,
            "recovered": 318,
            "percent": "100.0%",
            "details": [
             ... list of files
            ]
          },
          "bytes": {
            "total": 58111147927,
            "reused": 0,
            "recovered": 58111147927,
            "percent": "100.0%"
          },
          "total_time": "0s",
          "total_time_in_millis": 0
        },
        "translog": {
          "recovered": 24685532,
          "total_time": "0s",
          "total_time_in_millis": 0
        },
        "start": {
          "check_index_time": "0s",
          "check_index_time_in_millis": 0,
          "total_time": "0s",
          "total_time_in_millis": 0
        }
      }

Running a reroute command with no post body got it to unstuck:

curl -XPOST 'localhost:9200/_cluster/reroute?pretty&explain=true'

Not sure how it got into this state (no log entries related to recovery of this shard in the data or master logs) and why we have to run an empty reroute request to get it unstuck.

Reply to this email directly or view it on GitHub:
#9226

ppf2 · 2015-01-10T18:25:37Z

Once it got unstuck, it became green on the expected target node, thx. This is 1.4.2. The following is from the output of running the empty reroute command (which got it unstuck):

          "shards" : {
            "0" : [ {
              "state" : "STARTED",
              "primary" : true,
              "node" : "ctCRm_huQsSBoTobhmqJdg",
              "relocating_node" : null,
              "shard" : 0,
              "index" : "index_name"
            } ],

kstaken · 2015-01-13T16:37:25Z

I've been seeing what I think is a similar issue. We upgraded from 1.3.5 to 1.4.2 last week but now from time to time will see a shard get stuck in RELO or INIT. Most recent example was overnight after a node was restarted late yesterday. It moved a shard on restart and this morning it was still sitting in translog state well beyond what would be normal recovery time. I canceled the allocation and the shard is now allocating on another node currently in the translog phase. It hasn't been running long enough to say if there is a problem with this allocation. There seems to be nothing in the logs that indicates there was a problem.

bleskes · 2015-01-13T17:03:24Z

@kstaken it's a bit hard to debug those things. It might an instance of #8720 . Did you see any networking related issues in the logs? potentially this message "failed to send shard started" ?

kstaken · 2015-01-13T19:12:05Z

Nothing in the logs that I can see. I've seen this happen maybe 4-5 times since we upgraded to 1.4.2. Don't remember seeing this particular issue on 1.3.0 or 1.3.5. The new allocation of this same shard is also taking a long time. Is there any way to see deeper into the translog activity?

If I look at the initializing shard on disk I see the translog files grow and then disappear to be replaced by a new file that repeats the cycle. There's also a second translog file sitting in the same directory that doesn't seem to change.

bleskes · 2015-01-13T20:42:40Z

@kstaken can you enable trace logging on the indices.recovery log and post that information here?

pickypg · 2015-01-15T16:23:40Z

This recently happened with a user that was about to delete the transaction log a day later, but right before they were going to do it, then it finished.

Ready to apply the logger for next time though.

kstaken · 2015-01-15T16:42:15Z

@bleskes I figured out the issue with the second allocation of the problem shard. When the shard got stuck the translog had grown so large it couldn't replay it on recovery. It appeared to be replaying the translog at a rate that was about 50% of the rate new records were being added to it. It's conceivable that the shard appearing to be stuck originally may have been the result of the same thing. Now that I know what to look for I can confirm that the next time this happens. The normal replicas seem to have no issue keeping up, is there throttling on the replay of translogs during recovery?

BTW, to recover this shard I canceled the allocation and forced a fresh allocation on a new node. That seemed to allow the primary to clear the translog and then it was able to get back in sync.

bleskes · 2015-01-22T14:17:11Z

@kstaken sorry for the late response. I was travelling.

The normal replicas seem to have no issue keeping up, is there throttling on the replay of translogs during recovery?

There is no throttling but it is dealt with using a single thread. That's less then the amount of threads that are available for indexing. That said, the recovery works in two stages - it first acquires a snapshot of the current translog and performs that on the replica (while indexing is on going on the primary). Once that is done, it blocks the indexing on the primary and replicate the last operations. So in theory it should always be able to catch up.

I do share the feeling that this is due to a very large transaction log. If this happens again, do you mind sharing this information, issued once and then 5m later: GET _recovery , GET {index_name}/_stats?level=shards and GET /_nodes/_local/hot_threads on the target node of the recovery. This will tell us whether the recovery is really stuck or just making progress really slowly. The last hot threads call will help shed light on why it's slow (if that's the case).

kstaken · 2015-01-22T20:38:19Z

@bleskes Thanks. I've had a couple more instances of the issue and all have been on the same shard and I've been able to confirm on that shard it's a case of translog replay not being able to keep up with the rate of indexing. If I stop indexing, recovery will complete.

Turns out we have a hotspot on that shard due to routing, it had grown much too large at over 1billion docs and 220GB in size. We're in the process of re-indexing to deal with this but at this size the translog on the primary seemed to be growing at about 2x the rate of consumption on the recovering node and I'm pretty sure it would never catch up. I'm not sure if this is an actual issue now or just an example of why a 220GB shard is not a good thing to have.

bleskes · 2015-01-23T15:18:23Z

@kstaken thx. Replying in reverse order.

I'm not sure if this is an actual issue now or just an example of why a 220GB shard is not a good thing to have.

A 220GB shard is indeed unusually big but it should work. I

this size the translog on the primary seemed to be growing at about 2x the rate of consumption on the recovering node.

I see. For what it's worth, as long as there is an ongoing recovery the translog will not be trimmed as we need to make sure everything is replicated. Once replication is done, the translog will be dropped as one.

2x the rate of consumption on the recovering node and I'm pretty sure it would never catch up.

The replica is guaranteed to catch up because once it's will pause indexing in order to complete the operation. We need to figure out why it's so slow to deal with the initial snapshot when it allows concurrent indexing. When you say 2x the rate of consumption, do you measure the operation count in `GET {index}/_recovery} output?

If I stop indexing, recovery will complete.

That's interesting. Maybe indexing just put load on the machine. I've chasing this for the last couple of days and if you can share the following, it would be very helpful:

Did you see something like this log message from the target shard, while the recovery was going on: "now throttling indexing: numMergesInFlight={}, maxNumMerges={}" ?
Can you run GET _nodes/_local/hot_threads a couple of time on the target node and on the source node? I want to see if there is any clear hot spots?

bleskes · 2015-01-23T21:03:09Z

@kstaken one more question - what environment are you running on? which OS?

kstaken · 2015-01-23T23:19:53Z

@bleskes I've been measuring the output by looking at {index}/_status?recovery=true and looking at the translog.recovered on the replica vs. translog.operations on the primary. Is that a valid thing to do?

When this has occurred there has been nothing in the logs indicating any kind of issue, not even the throttling message.

If I see the problem again I'll try to come up with the other things you asked for however, I'll be retiring this index in the next couple days once the re-indexing process completes. I thought I had seen the issue on other shards but that was before I was paying specific attention to this shard and the last four occurrences have all been on the same shard so I'm doubting that I've seen it elsewhere now.

Our environment is Ubuntu 14.04 on physical hardware. ES 1.4.2 on Java 8. We have 72 ES nodes of two types, one with SSDs and one with HDs. The impacted index is on nodes with SSDs.

kstaken · 2015-01-23T23:35:28Z

@bleskes I should also add that the node holding the primary here is indeed under very heavy load due to indexing. It will bounce around 60-90 in load average, however the load on the replicas is not elevated and is consistent with other nodes in the cluster. Stopping indexing will return the load to normal. Since it's a major hot spot, it's conceivable that shard could be receiving well over 10,000 indexing requests per second. The index overall is receiving 50K-100K/sec, sometimes more. Those requests are also 'create' requests with unique IDs and most will be dropped as duplicates. The actual write volume on the index overall is less than 1000/sec.

mikemccand · 2015-01-24T00:06:09Z

One change in 1.4.0 was to disable loading bloom filters by default since we made other performance improvements that should have made them unnecessary in most cases: #6959

I wonder if that is causing the performance issues here? Can you try setting index.codec.bloom.load to true and see if it makes a difference?

bleskes · 2015-01-24T11:55:06Z

@kstaken thx. The index _status api is a good way, but it's deprecated and replaced with the _recovery API . can you let us how it goes with @mikemccand 's suggestion?

kstaken · 2015-01-26T17:24:10Z

Unfortunately I'm not able to test re-enabling bloom filters as our re-indexing process completed over the weekend and the problem index was removed from usage. I'll have the index around for a couple more days and then it will be deleted completely as the variation in shard sizes is also causing disk allocation problems.

bleskes · 2015-01-26T18:22:30Z

@kstaken OK. Let us know if this happens again.

kstaken · 2015-01-26T19:52:52Z

@bleskes sorry I can't test enabling bloom filters to confirm, but I did some more digging and looking at the historical CPU usage of the node holding the problem primary does show a dramatic increase in load right after it was upgraded to 1.4.2. Prior to 1.4.2 we didn't have any nodes that showed consistently high load, after the upgrade which ever node this shard resided on pushed 80-90 load non stop.

Does this imply we should enable bloom filters on the new index for the use case where we're generating the IDs?

bleskes · 2015-01-26T19:55:00Z

@kstaken it would be great if can do a "before" and "after" check regarding the effect bloom filters have on your index.

drax68 · 2015-02-06T10:32:12Z

Bumped into same issue during shards initialization on es 1.4.2, tested with enabled bloom filters - this doesn't helped. Only deletion of transactions logs helped to finish initialization.

bleskes · 2015-02-06T10:58:16Z

@drax68 do you know how big the translog was before you deleted it? Did you see any index throttling messages in the log? Also, how long was it "stuck" in the TRANSLOG phase?

drax68 · 2015-02-06T11:07:45Z

4-6 Gb translog, on replica it's size was spinning around 200Mb. Shard size about 40Gb.

It was stuck for hours, then I have to delete translogs to complete initialization.

avleen · 2015-02-09T07:12:27Z

This happened to us over the weekend too.
Friday night we upgraded from 1.3.2 to 1.4.2.
We've seen multiple replica recoveries since then (either due to a cluster restart or a node failing and replicas needing to be rebuilt).

In most cases it's fine. But sometimes, just sometimes, the target host doesn't finish catching up the translog and it gets pretty big.
Right now I have this happening, and the translog on the source host is 22Gb.
On the target host, the translog grows to ~200mb, and then rotates to a new file.
The source host was sending to the target at ~750kbit/sec, which seem really really slow. The servers have 10gbit, are on the same switch, and all have SSDs. This made no sense.

Unfortunately I wasn't able to get hot_threads output this time, but when it happens again I will.
In the time it took me to finish fixing the issue and write this note, the translog on the source host grew to 24Gb.

bleskes · 2015-02-09T07:45:54Z

@avleen any chance you can check in the logs of the node hosting the primary of the stuck shard for "now throttling indexing" (which means it's #9394 )? if happen to have your logs in debug mode, check for lines with "recovery_mapping_check". If they are too far apart, it will indicate it's #9575

bleskes · 2015-02-09T07:47:25Z

@drax68 see my previous comment - I don't know if you still have the logs of the node with the primary shard, but if you do it would be great if you can check them.

avleen · 2015-02-09T07:49:41Z

And it happened again.

hot_threads from the target host:

::: [logdb50][xne7qsKqQnSeacIQnM_uyQ][logdb5][inet[logdb50]]{local=false, master=false}

   92.8% (464ms out of 500ms) cpu usage by thread 'elasticsearch[logdb50][generic][T#353]'
     7/10 snapshots sharing following 24 elements
       sun.nio.ch.NativeThread.current(Native Method)
       sun.nio.ch.NativeThreadSet.add(NativeThreadSet.java:46)
       sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:695)
       sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
       org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
       org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
       org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
       org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
       org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
       org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:152)
       org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:506)
       org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.lookup(PerThreadIDAndVersionLookup.java:104)
       org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:150)
       org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:161)
       org.elasticsearch.index.engine.internal.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1381)
       org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:427)
       org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:404)
       org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:779)
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:433)
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:412)
       org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:744)
     3/10 snapshots sharing following 6 elements
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:433)
       org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:412)
       org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:744)

    1.3% (6.7ms out of 500ms) cpu usage by thread 'elasticsearch[logdb50][[transport_server_worker.default]][T#7]{New I/O worker #34}'
     10/10 snapshots sharing following 15 elements
       sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
       sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
       sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
       sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
       sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
       org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
       org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:744)

    0.8% (3.8ms out of 500ms) cpu usage by thread 'elasticsearch[logdb50][[transport_server_worker.default]][T#6]{New I/O worker #33}'
     10/10 snapshots sharing following 15 elements
       sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
       sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
       sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
       sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
       sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
       org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
       org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:744)

Source host:

::: [logdb40][aJwrteLzRRedjpeRH-XTuA][logdb40][inet[/:9300]]{local=false, master=false}

    4.2% (20.7ms out of 500ms) cpu usage by thread 'elasticsearch[logdb40][generic][T#562]'
     10/10 snapshots sharing following 21 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
       org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:274)
       org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:113)
       org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:45)
       org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:34)
       org.elasticsearch.indices.recovery.RecoverySource$1.sendSnapshot(RecoverySource.java:432)
       org.elasticsearch.indices.recovery.RecoverySource$1.phase2(RecoverySource.java:300)
       org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1132)
       org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:654)
       org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:137)
       org.elasticsearch.indices.recovery.RecoverySource.access$2600(RecoverySource.java:74)
       org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:464)
       org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:450)
       org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:744)

    0.1% (611micros out of 500ms) cpu usage by thread 'elasticsearch[logdb40][transport_client_worker][T#9]{New I/O worker #9}'
     10/10 snapshots sharing following 15 elements
       sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
       sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
       sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
       sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
       sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
       org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
       org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:744)

    0.1% (410.3micros out of 500ms) cpu usage by thread 'elasticsearch[logdb40][transport_client_worker][T#10]{New I/O worker #10}'
     10/10 snapshots sharing following 15 elements
       sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
       sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
       sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
       sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
       sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
       org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
       org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
       org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       java.lang.Thread.run(Thread.java:744)

We can't keep stopping nodes to delete their translog, as it just results in the problem jumping to a different host.

Nothing mentioning throttling at all :(
On these hosts, there is a lot more traffic between them right now, about 15mbit/sec. Indexing from logstash is temporarily off. I'm hoping that lets the translog catch up.

avleen · 2015-02-09T07:53:49Z

Typical, computers making me look bad.
Just after I sent that last message, this popped into the log for the target host:

[2015-02-09 07:50:32,376][INFO ][index.engine.internal    ] [logdb50] [logstash-2015.02.09][25] now throttling indexing: numMergesInFlight=9, maxNumMerges=8
[2015-02-09 07:50:47,503][INFO ][index.engine.internal    ] [logdb50] [logstash-2015.02.09][25] stop throttling indexing: numMergesInFlight=7, maxNumMerges=8

It looks like this came in right around the time the replica the translog caught up. It only happened one on this node, and possibly once on another node at the time when it got over this problem.

mikemccand · 2015-04-06T17:16:50Z

To be clear, we never saw this behavior in 1.3. Only 1.4. Is there some change between the two which we can point to?

@avleen as far as I can tell, this code is the same in 1.3 vs 1.4, so I'm not sure this theory can explain what you're seeing ...

mikemccand · 2015-04-06T17:25:09Z

@avleen actually bloom filters (enabled in 1.3 and disabled in 1.4) could have masked this issue, since they would have somewhat hidden the cost of segment explosion.

avleen · 2015-04-06T18:04:49Z

That makes a lot of sense. Thanks!

On Mon, Apr 6, 2015, 13:25 Michael McCandless notifications@github.com
wrote:

@avleen https://github.com/avleen actually bloom filters (enabled in
1.3 and disabled in 1.4) could have masked this issue, since they would
have somewhat hidden the cost of segment explosion.

Reply to this email directly or view it on GitHub
#9226 (comment)
.

sylvae · 2015-04-07T14:53:42Z

Hmm I think I see one possible issue: in 1.x (fixed in 2.0 since we let Lucene do the merging again) we apparently do not start the async merging until after recovery has finished (performRecoveryFinalization -> startScheduledTasksIfNeeded)? @bleskes is this correct? Or do we somehow start the async merging on engine init?

@mikemccand @bleskes I have some questions,where and when es start the async merging normally such as create a new index? And if we start the async merging on engine init,is it possible to fix this issue?Could it have some negative effects?

bleskes · 2015-04-07T16:59:43Z

@sylvae it is started when recovery is finished (which is no good). @mikemccand is already working on a fix.

This does not affect 2.0, where we let Lucene launch merges normally (#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like #9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes #10463

sylvae · 2015-04-09T06:45:18Z

@mikemccand @bleskes I modified the code similar to #10463 and did some tests, the issue is no longer reproduce, shards recovered normally . I think this issue has already been fixed.Thank you.

bleskes · 2015-04-09T06:55:21Z

That's great news. Thank you for pointing us at the segment count (I.e., many small files). Let's wait until 1.5.1 is out (soon) and see how works for the others

On Thu, Apr 9, 2015 at 8:45 AM, sylvae notifications@github.com wrote:

@mikemccand @bleskes I modified the code similar to #10463 and did some tests, the issue is no longer reproduce, shards recovered normally . I think this issue has already been fixed.Thank you.

Reply to this email directly or view it on GitHub:
#9226 (comment)

bleskes · 2015-04-10T04:19:35Z

hi all, a short note to say 1.5.1 is out, including #10463 which has seemed to have helped @sylvae. It would be great if others can give it as spin and see whether it helps them as well. Hopefully we can drive this to closure.

twitchjorge · 2015-04-10T04:25:34Z

@bleskes thanks! I'll upgrade sometime next week and let you know how it goes

This does not affect 2.0, where we let Lucene launch merges normally (elastic#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like elastic#9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes elastic#10463

This does not affect 2.0, where we let Lucene launch merges normally (#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like #9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes #10463

bleskes · 2015-04-23T21:50:05Z

@twitchjorge and others on this thread - any news you can share?

twitchjorge · 2015-04-23T21:55:56Z

sorry @bleskes priorities have somewhat pulled me away from this. reviewing the fix is still on my todo list though!

avleen · 2015-04-24T01:47:18Z

Unfortunately I haven't had a chance to upgrade yet either. Although I am
running into the bug as we speak :-)
I don't think I'll realistically be able to upgrade before the end of May
at the moment.
I hope others can provide some more solid feedback!

On Thu, Apr 23, 2015 at 5:56 PM twitchjorge notifications@github.com
wrote:

sorry @bleskes https://github.com/bleskes priorities have somewhat
pulled me away from this. reviewing the fix is still on my todo list though!

Reply to this email directly or view it on GitHub
#9226 (comment)
.

xgwu · 2015-04-24T01:56:13Z

@bleskes We upgraded our production cluster to 1.5.1 last weekend and no long running translog recovery spot. I feel the issue is fixed.

bleskes · 2015-04-24T08:50:02Z

@xgwu thanks! I'm going to leave this open for a little more giving others some time as well.. if we don't hear anything in a week or two, I'll close this.

bleskes · 2015-06-03T11:19:44Z

Hey all, it's been almost 2 months since 1.5.1 was released. Since no one reported this still happens , I'm going close it as we seem to have found the root cause (#10463). Thank you all for the information and help.

This does not affect 2.0, where we let Lucene launch merges normally (elastic#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like elastic#9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes elastic#10463

clintongormley assigned bleskes Jan 16, 2015

clintongormley added the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Jan 16, 2015

clintongormley added blocker v1.5.1 labels Apr 7, 2015

mikemccand mentioned this issue Apr 7, 2015

Lucene merges should run on the target shard during recovery #10463

Closed

clintongormley removed the blocker label Apr 8, 2015

clintongormley added v1.5.2 and removed v1.5.1 labels Apr 9, 2015

bleskes added feedback_needed and removed v1.5.2 labels Apr 10, 2015

bleskes closed this as completed Jun 3, 2015

Shard stuck in relocating state with recovery stage=translog #9226

Shard stuck in relocating state with recovery stage=translog #9226

Comments

ppf2 commented Jan 9, 2015

bleskes commented Jan 10, 2015

Not sure how it got into this state (no log entries related to recovery of this shard in the data or master logs) and why we have to run an empty reroute request to get it unstuck.

ppf2 commented Jan 10, 2015

kstaken commented Jan 13, 2015

bleskes commented Jan 13, 2015

kstaken commented Jan 13, 2015

bleskes commented Jan 13, 2015

pickypg commented Jan 15, 2015

kstaken commented Jan 15, 2015

bleskes commented Jan 22, 2015

kstaken commented Jan 22, 2015

bleskes commented Jan 23, 2015

bleskes commented Jan 23, 2015

kstaken commented Jan 23, 2015

kstaken commented Jan 23, 2015

mikemccand commented Jan 24, 2015

bleskes commented Jan 24, 2015

kstaken commented Jan 26, 2015

bleskes commented Jan 26, 2015

kstaken commented Jan 26, 2015

bleskes commented Jan 26, 2015

drax68 commented Feb 6, 2015

bleskes commented Feb 6, 2015

drax68 commented Feb 6, 2015

avleen commented Feb 9, 2015

bleskes commented Feb 9, 2015

bleskes commented Feb 9, 2015

avleen commented Feb 9, 2015

avleen commented Feb 9, 2015

mikemccand commented Apr 6, 2015

mikemccand commented Apr 6, 2015

avleen commented Apr 6, 2015

sylvae commented Apr 7, 2015

bleskes commented Apr 7, 2015

sylvae commented Apr 9, 2015

bleskes commented Apr 9, 2015

@mikemccand @bleskes I modified the code similar to #10463 and did some tests, the issue is no longer reproduce, shards recovered normally . I think this issue has already been fixed.Thank you.

bleskes commented Apr 10, 2015

twitchjorge commented Apr 10, 2015

bleskes commented Apr 23, 2015

twitchjorge commented Apr 23, 2015

avleen commented Apr 24, 2015

xgwu commented Apr 24, 2015

bleskes commented Apr 24, 2015

bleskes commented Jun 3, 2015