Throttling incoming indexing when Lucene merges fall behind #6066

mikemccand · 2014-05-06T14:59:31Z

Lucene has low-level protection that blocks incoming segment-producing threads (indexing threads, NRT reopen threads, commit, etc.) when there are too many merges running.

But this is too harsh for Elasticsearch, so it's entirely disabled, but this means merges can fall far behind under heavy indexing, and this results in too many segments in the index, which causes all sorts of problems (slow version lookups, too much RAM, etc.).

So we need to do something "softer"; Simon has a good starting patch, which I tested and confirmed (after https://issues.apache.org/jira/browse/LUCENE-5644 is fixed) at least in one use-case that it prevents too many segments in the index:

Before Simon's + Lucene's fix: http://people.apache.org/~mikemccand/lucenebench/base.html

Same test with the fix: http://people.apache.org/~mikemccand/lucenebench/throttled.html

Segment counts stay essentially flat.

Here's Simon's prototype patch: s1monw@2de96f9

The text was updated successfully, but these errors were encountered:

nik9000 · 2014-05-06T15:10:50Z

It looks like Simon's prototype pauses the indexing thread if too many merges are in flight. I'm not 100% clear on the code path that gets here. Will that pause indexing or pause refreshing or both? It'd be neat to slow down just the refreshing and let indexing be slowed down by the refresh backlog logic. Or am I crazy?

s1monw · 2014-05-06T19:25:30Z

@nik9000 internally the IndexWriter has several threads states (8 by default) that we index into. If we limit to a single threads we only use on of the states and make sure we max out the RAM buffer and write the least amount of segments. This means we 1. reduce the number of segments to merge and 2. make sure flushes are only done if really needed. I think we can't slow down refreshes otherwise folks will see odd results since you don't get new documents. You also want to refresh to publish merged segments to further reduce the number of segments. We will do the right thing and provide backpressure on indexing not on refresh. Hope that makes sense?

nik9000 · 2014-05-06T19:42:31Z

We will do the right thing and provide backpressure on indexing not on refresh. Hope that makes sense?

I'd honestly forgotten about flushes. Its what I get for only playing on the other side. Anyway, I'm happy so long as back pressure is provided on indexing.

…-merges; fix nocommit

mikemccand · 2014-05-09T11:43:56Z

I tested the current throttling branch with the refresh=-1 case, and we have problems because the "abandoned" thread states will never flush until a full flush ... workaround is you must use a refresh to get them flushed.

mikemccand · 2014-05-09T15:17:14Z

I'm inclined to simply document that index throttling won't kick in if you use SerialMergeScheduler.

SMS only allows one merge to run at a time, so apps that are doing heavy bulk indexing really should not be using it.

mikemccand · 2014-05-13T10:32:48Z

OK I reviewed these changes with Simon. We decided we don't need to add a separate "kill switch" for this because you can just set max_merge_count higher to avoid throttling. But we also decided not to document this new setting on the index-modules-merges docs: it's a very advanced setting, and playing with it could easily mess up merges.

This commit upgrades to the latest Lucene 4.8.1 release including the following bugfixes: * An IndexThrottle now kicks in when merges start falling behind limiting index threads to 1 until merges caught up. Closes #6066 * RateLimiter now kicks in at the configured rate where previously the limiter was limiting at ~8MB/sec almost all the time. Closes #6018

l15k4 · 2015-10-21T22:18:26Z

Guys I don't think this works as expected, I'm getting :

now throttling indexing: numMergesInFlight=4, maxNumMerges=3
stop throttling indexing: numMergesInFlight=2, maxNumMerges=3

5 times a second right at the beginning of bulk indexing. I'm disabling throttling and refresh interval, I start with optimized index, waiting until segment merging finishes, but segment merging is still falling behind... Why is the indexing throttling starting and stopping so frequently?

clintongormley · 2015-10-23T18:08:22Z

@l15k4 your merges are not keeping up. See https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html#scheduling

l15k4 · 2015-10-23T23:00:10Z

@clintongormley but they are not not keeping up right at the moment of starting indexing intto a a small (1M records) optimized index... increasing merge thread pool doesn't help...

I have 4 ec2.xlarge instances clustered with 1B records in 30 indices (5 shards each). And if I create a new index and start bulk index into it then throttling happens right away. All fields are doc_values and I think that it happened right I after I reindexed everything to doc_values around 1.6.0 .... I cannot shake it off since then... I tried everything...

Imho I need to scale it up just because of segment merging, but there will be plenty of unused resources ...

I'm trying to solve this issue for months now...

mikemccand · 2015-10-24T10:48:39Z

@l15k4 did you disable store IO throttling (defaults to 20 MB/sec, which is too low for heavy indexing cases).

Where are you storing the shards (what IO devices), EBS or local instance storage?

Also try the ideas here: https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing

l15k4 · 2015-10-25T14:12:10Z

@mikemccand I set it up to 30, 40, 80, 100 MB/s ... it had no effect. I also tried to set index.merge.scheduler.max_thread_count: 6 but it lead to throttling now throttling indexing: numMergesInFlight=9 so it didn't help either...

We use EBS (General Purpose (SSD)) on c4.xlarge instances. 2 volumes, one for system and one dedicated for ES...

It seems that if you are doing bulk indexing and have all fields doc_values then you need either quad core machine or physically attached SSD or both, otherwise segment merging will always fall behind no matter what optimizations one does...

It always looks this way, it is throttling for some period of time like 15-20 minutes and then it stops http://i.imgur.com/UyDTlHi.png

I also tried to shrink index and bulk threadpools for segment merging to keep up with bulk indexing, but it didn't help either ... it doesn't keep up right when the first few bulk index requests come...

l15k4 · 2015-10-25T20:23:41Z

The best bulk indexing performance I can get on a machine with 4 hyper threads and EBS (750 Mbps) with all fields being doc_values is by increasing index.merge.scheduler.max_thread_count to 6 and decreasing threadpool.bulk.size: 2, this way it is throttling now throttling indexing like every 6th second but it is still throttling so the throughput is now http://i.imgur.com/wXCNZh7.png

I think that after doc_values people don't have much of a choice, they'll need physically attached SSD...

mikemccand · 2015-10-26T01:07:06Z

Hmm enabling doc values is typically a minor indexing performance hit in my experience, e.g. see the nightly benchmarks at https://benchmarks.elastic.co (annotation R on the first chart).

Do you have provisioned IOPs for your EBS mounts? Are you sure you're not running into that limit?

Can you try the local instance SSD, just for comparison? Your EBS is backed by SSD as well, so this would let us remove EBS from the equation. (You'd need to switch to an i2.4xlarge instance for this test).

l15k4 · 2015-10-26T12:53:25Z

General Purpose unfortunately, the price of IO Provisioned SSDs surprised us. If you want to go beyond 160 MiB/s to 320 MiB/s it costs double than the volume itself.

I guess it wouldn't throttle with IO Provisioned SSD with 9000 IOPS to reach those 320MiB/s ... but these machines cost fortune :-)

mikemccand · 2015-10-26T21:10:50Z

I guess it wouldn't throttle with IO Provisioned SSD with 9000 IOPS to reach those 320MiB/s ... but these machines cost fortune :-)

Or just use the local instance attached SSDs on the i2.* instance types ...

mikemccand added non-issue labels May 6, 2014

mikemccand removed the non-issue label May 6, 2014

mikemccand mentioned this issue May 7, 2014

Should we lower the default merge IO throttle rate? #6081

Closed

mikemccand added a commit that referenced this issue May 7, 2014

#6066: log when we start/stop throttling indexing because of too-many…

33a9711

…-merges; fix nocommit

mikemccand mentioned this issue May 7, 2014

Stats API should report on index throttling #6085

Closed

kevinkluge assigned mikemccand May 9, 2014

xyu mentioned this issue May 12, 2014

Delete By Query under heavy indexing load causes OOM errors #6025

Closed

s1monw added the enhancement label May 18, 2014

s1monw closed this as completed in 85a0b76 May 19, 2014

mikemccand mentioned this issue Oct 17, 2014

Core: dynamic updates to max_merge_count is ignored by index throttling #8132

Closed

clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throttling incoming indexing when Lucene merges fall behind #6066

Throttling incoming indexing when Lucene merges fall behind #6066

mikemccand commented May 6, 2014

nik9000 commented May 6, 2014

s1monw commented May 6, 2014

nik9000 commented May 6, 2014

mikemccand commented May 9, 2014

mikemccand commented May 9, 2014

mikemccand commented May 13, 2014

l15k4 commented Oct 21, 2015

clintongormley commented Oct 23, 2015

l15k4 commented Oct 23, 2015

mikemccand commented Oct 24, 2015

l15k4 commented Oct 25, 2015

l15k4 commented Oct 25, 2015

mikemccand commented Oct 26, 2015

l15k4 commented Oct 26, 2015

mikemccand commented Oct 26, 2015

Throttling incoming indexing when Lucene merges fall behind #6066

Throttling incoming indexing when Lucene merges fall behind #6066

Comments

mikemccand commented May 6, 2014

nik9000 commented May 6, 2014

s1monw commented May 6, 2014

nik9000 commented May 6, 2014

mikemccand commented May 9, 2014

mikemccand commented May 9, 2014

mikemccand commented May 13, 2014

l15k4 commented Oct 21, 2015

clintongormley commented Oct 23, 2015

l15k4 commented Oct 23, 2015

mikemccand commented Oct 24, 2015

l15k4 commented Oct 25, 2015

l15k4 commented Oct 25, 2015

mikemccand commented Oct 26, 2015

l15k4 commented Oct 26, 2015

mikemccand commented Oct 26, 2015