Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An inactive shard is activated by triggered synced flush #13802

Closed
wants to merge 3 commits into from

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Sep 25, 2015

When a shard becomes inactive we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents come along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard inactive.

A new unit test is introduced and comparable integration tests are removed.

I think we need to push this all the way to 1.7.3 ...

…ced flush

When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.
@bleskes bleskes changed the title An inactive shard is temporarily activated by triggered synced flush An inactive shard is activated by triggered synced flush Sep 25, 2015
@bleskes bleskes added the review label Sep 26, 2015
@@ -144,7 +151,7 @@ public IndexingMemoryController(Settings settings, ThreadPool threadPool, Indice
translogBuffer = maxTranslogBuffer;
}
} else {
translogBuffer = ByteSizeValue.parseBytesSizeValue(translogBufferSetting, null);
translogBuffer = ByteSizeValue.parseBytesSizeValue(translogBufferSetting, TRANSLOG_BUFFER_SIZE_SETTING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

@mikemccand
Copy link
Contributor

This was a great catch @bleskes, and the cutover from IT to unit test is wonderful.

I left a bunch of small comments, but otherwise LGTM.

I still find that massive if statement in updateShardStatuses confusing but I can't quickly see an easy way to improve it...

@bleskes
Copy link
Contributor Author

bleskes commented Sep 26, 2015

@mikemccand thx for the review. I pushed another round. Tried to simplify the 'big if'

// index into both shards, move the clock and see that they are still active
controller.setTranslog(shard1, randomInt(2), randomInt(2) + 1);
controller.setTranslog(shard2, randomInt(2) + 1, randomInt(2));
// nthe controller doesn't know when the ops happened, so even if this is more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nthe -> the

@mikemccand
Copy link
Contributor

LGTM, thanks @bleskes!

@bleskes bleskes closed this in 148265b Sep 27, 2015
@bleskes
Copy link
Contributor Author

bleskes commented Sep 27, 2015

push this to master. will give CI some time before pushing it all the way back to 1.7...

@bleskes bleskes deleted the indexing_memory_controler branch September 27, 2015 05:35
@bleskes
Copy link
Contributor Author

bleskes commented Sep 27, 2015

Pre-existing issue: I think 30s default is too large, because an index that was inactive and suddenly becomes active with a 512 KB indexing

I was thinking about this some more (morning run, he he) - I think we should fold all the decisions and state about being active or not into the indexshard. Then it will be easy to reset on every index operation (immediately). We also already have the notion of an inactivity listener - we can extend it to have an active event as well. We can also add an checkIfInactive method on IndexShard which check if there was any indexing ops (we already have stats) since the last time it was checked. To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

@mikemccand
Copy link
Contributor

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like how IndexShard.updateBufferSize "infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorb ShardIndexingService into IndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it has postCreate and postCreateUnderLock. Seems like maybe an excessive abstraction at this point...

@bleskes
Copy link
Contributor Author

bleskes commented Sep 27, 2015

I personally don't mind folding it into indexshard though having a dedicate stats/active component has benefit (clearer testing suite) . I know Simon has an opinion on this.

On 27 sep. 2015 4:39 PM +0200, Michael McCandlessnotifications@github.com, wrote:

I think we should fold all the decisions and state about being active or not into the indexshard.

+1, this would be cleaner. E.g. I don't like howIndexShard.updateBufferSize"infers" that it's going inactive by looking if the new size is 512 KB ... would be better it was more directly aware it's going inactive.

I can try to tackle this.

To avoid making IndexShard more complex, we can push this functionality ShardIndexingService - which is already in charge of stats.

Or, why not absorbShardIndexingServiceintoIndexShard? All it does is stats gathering and calling listeners (which I think is only used by percolator?). It's also confusing how it haspostCreateandpostCreateUnderLock. Seems like maybe an excessive abstraction at this point...


Reply to this email directly orview it on GitHub(#13802 (comment)).

bleskes added a commit that referenced this pull request Sep 28, 2015
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Closes #13802
bleskes added a commit that referenced this pull request Sep 28, 2015
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Closes #13802
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 28, 2015
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Relates elastic#13802
Includes a backport of elastic#13784
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Sep 28, 2015
When a shard becomes in active we trigger a sync flush in order to speed up future recoveries. The sync flush causes a new translog generation to be made, which in turn confuses the IndexingMemoryController making it think that the shard is active. If no documents comes along in the next 5m, the shard is made inactive again , triggering a sync flush and so forth.

To avoid this, the IndexingMemoryController is changed to ignore empty translogs when checking if a shard became active. This comes with the price of potentially missing indexing operations which are followed by a flush. This is acceptable as if no more index operation come in, it's OK to leave the shard in active.

A new unit test is introduced and comparable integration tests are removed.

Relates elastic#13802
Includes a backport of elastic#13784

Closes elastic#13824
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants