An inactive shard is activated by triggered synced flush #13802

bleskes · 2015-09-25T18:29:15Z

When a shard becomes inactive we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents come along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard inactive.

A new unit test is introduced and comparable integration tests are removed.

I think we need to push this all the way to 1.7.3 ...

…ced flush When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth. To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active. A new unit test is introduced and comparable integration tests are removed.

mikemccand · 2015-09-26T08:48:27Z

core/src/main/java/org/elasticsearch/indices/memory/IndexingMemoryController.java

@@ -144,7 +151,7 @@ public IndexingMemoryController(Settings settings, ThreadPool threadPool, Indice
                translogBuffer = maxTranslogBuffer;
            }
        } else {
-            translogBuffer = ByteSizeValue.parseBytesSizeValue(translogBufferSetting, null);
+            translogBuffer = ByteSizeValue.parseBytesSizeValue(translogBufferSetting, TRANSLOG_BUFFER_SIZE_SETTING);


Nice catch.

mikemccand · 2015-09-26T09:35:35Z

This was a great catch @bleskes, and the cutover from IT to unit test is wonderful.

I left a bunch of small comments, but otherwise LGTM.

I still find that massive if statement in updateShardStatuses confusing but I can't quickly see an easy way to improve it...

bleskes · 2015-09-26T18:58:10Z

@mikemccand thx for the review. I pushed another round. Tried to simplify the 'big if'

mikemccand · 2015-09-26T19:13:58Z

core/src/test/java/org/elasticsearch/indices/memory/IndexingMemoryControllerTests.java

+        // index into both shards, move the clock and see that they are still active
+        controller.setTranslog(shard1, randomInt(2), randomInt(2) + 1);
+        controller.setTranslog(shard2, randomInt(2) + 1, randomInt(2));
+        // nthe controller doesn't know when the ops happened, so even if this is more


nthe -> the

mikemccand · 2015-09-26T19:21:12Z

LGTM, thanks @bleskes!

bleskes · 2015-09-27T05:35:32Z

push this to master. will give CI some time before pushing it all the way back to 1.7...

bleskes · 2015-09-27T09:19:09Z

Pre-existing issue: I think 30s default is too large, because an index that was inactive and suddenly becomes active with a 512 KB indexing

I was thinking about this some more (morning run, he he) - I think we should fold all the decisions and state about being active or not into the indexshard. Then it will be easy to reset on every index operation (immediately). We also already have the notion of an inactivity listener - we can extend it to have an active event as well. We can also add an checkIfInactive method on IndexShard which check if there was any indexing ops (we already have stats) since the last time it was checked. To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

mikemccand · 2015-09-27T14:39:31Z

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like how IndexShard.updateBufferSize "infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorb ShardIndexingService into IndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it has postCreate and postCreateUnderLock. Seems like maybe an excessive abstraction at this point...

bleskes · 2015-09-27T15:03:29Z

I personally don't mind folding it into indexshard though having a dedicate stats/active component has benefit (clearer testing suite) . I know Simon has an opinion on this.

On 27 sep. 2015 4:39 PM +0200, Michael McCandlessnotifications@github.com, wrote:

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like howIndexShard.updateBufferSize"infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorbShardIndexingServiceintoIndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it haspostCreateandpostCreateUnderLock. Seems like maybe an excessive abstraction at this point...

—
Reply to this email directly orview it on GitHub(#13802 (comment)).

When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth. To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active. A new unit test is introduced and comparable integration tests are removed. Closes #13802

When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth. To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active. A new unit test is introduced and comparable integration tests are removed. Relates elastic#13802 Includes a backport of elastic#13784

When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth. To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active. A new unit test is introduced and comparable integration tests are removed. Relates elastic#13802 Includes a backport of elastic#13784 Closes elastic#13824

bleskes added >bug v2.1.0 :Internal v2.0.0 v5.0.0-alpha1 v1.7.3 labels Sep 25, 2015

bleskes changed the title ~~An inactive shard is temporarily activated by triggered synced flush~~ An inactive shard is activated by triggered synced flush Sep 25, 2015

bleskes added the review label Sep 26, 2015

mikemccand reviewed Sep 26, 2015
View reviewed changes

feedback

56d0449

mikemccand reviewed Sep 26, 2015
View reviewed changes

feedback 2

7b5d061

bleskes closed this in 148265b Sep 27, 2015

bleskes deleted the indexing_memory_controler branch September 27, 2015 05:35

bleskes mentioned this pull request Sep 28, 2015

An inactive shard is activated by triggered synced flush (backport of #13802) #13824

Closed

mikemccand mentioned this pull request Oct 3, 2015

When shard becomes active again, immediately increase its indexing buffer #13918

Merged

clintongormley added v2.0.0-rc1 and removed v2.0.0 labels Oct 7, 2015

clintongormley removed the v2.0.0-rc1 label Nov 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An inactive shard is activated by triggered synced flush #13802

An inactive shard is activated by triggered synced flush #13802

bleskes commented Sep 25, 2015

mikemccand Sep 26, 2015

mikemccand commented Sep 26, 2015

bleskes commented Sep 26, 2015

mikemccand Sep 26, 2015

mikemccand commented Sep 26, 2015

bleskes commented Sep 27, 2015

bleskes commented Sep 27, 2015

mikemccand commented Sep 27, 2015

bleskes commented Sep 27, 2015

An inactive shard is activated by triggered synced flush #13802

An inactive shard is activated by triggered synced flush #13802

Conversation

bleskes commented Sep 25, 2015

mikemccand Sep 26, 2015

Choose a reason for hiding this comment

mikemccand commented Sep 26, 2015

bleskes commented Sep 26, 2015

mikemccand Sep 26, 2015

Choose a reason for hiding this comment

mikemccand commented Sep 26, 2015

bleskes commented Sep 27, 2015

bleskes commented Sep 27, 2015

mikemccand commented Sep 27, 2015

bleskes commented Sep 27, 2015