Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttling incoming indexing when Lucene merges fall behind #6066

Closed
mikemccand opened this issue May 6, 2014 · 15 comments
Closed

Throttling incoming indexing when Lucene merges fall behind #6066

mikemccand opened this issue May 6, 2014 · 15 comments
Assignees
Labels

Comments

@mikemccand
Copy link
Contributor

Lucene has low-level protection that blocks incoming segment-producing threads (indexing threads, NRT reopen threads, commit, etc.) when there are too many merges running.

But this is too harsh for Elasticsearch, so it's entirely disabled, but this means merges can fall far behind under heavy indexing, and this results in too many segments in the index, which causes all sorts of problems (slow version lookups, too much RAM, etc.).

So we need to do something "softer"; Simon has a good starting patch, which I tested and confirmed (after https://issues.apache.org/jira/browse/LUCENE-5644 is fixed) at least in one use-case that it prevents too many segments in the index:

Before Simon's + Lucene's fix: http://people.apache.org/~mikemccand/lucenebench/base.html

Same test with the fix: http://people.apache.org/~mikemccand/lucenebench/throttled.html

Segment counts stay essentially flat.

Here's Simon's prototype patch: s1monw@2de96f9

@nik9000
Copy link
Member

nik9000 commented May 6, 2014

It looks like Simon's prototype pauses the indexing thread if too many merges are in flight. I'm not 100% clear on the code path that gets here. Will that pause indexing or pause refreshing or both? It'd be neat to slow down just the refreshing and let indexing be slowed down by the refresh backlog logic. Or am I crazy?

@s1monw
Copy link
Contributor

s1monw commented May 6, 2014

@nik9000 internally the IndexWriter has several threads states (8 by default) that we index into. If we limit to a single threads we only use on of the states and make sure we max out the RAM buffer and write the least amount of segments. This means we 1. reduce the number of segments to merge and 2. make sure flushes are only done if really needed. I think we can't slow down refreshes otherwise folks will see odd results since you don't get new documents. You also want to refresh to publish merged segments to further reduce the number of segments. We will do the right thing and provide backpressure on indexing not on refresh. Hope that makes sense?

@nik9000
Copy link
Member

nik9000 commented May 6, 2014

We will do the right thing and provide backpressure on indexing not on refresh. Hope that makes sense?

I'd honestly forgotten about flushes. Its what I get for only playing on the other side. Anyway, I'm happy so long as back pressure is provided on indexing.

@mikemccand
Copy link
Contributor Author

I tested the current throttling branch with the refresh=-1 case, and we have problems because the "abandoned" thread states will never flush until a full flush ... workaround is you must use a refresh to get them flushed.

@mikemccand
Copy link
Contributor Author

I'm inclined to simply document that index throttling won't kick in if you use SerialMergeScheduler.

SMS only allows one merge to run at a time, so apps that are doing heavy bulk indexing really should not be using it.

@mikemccand
Copy link
Contributor Author

OK I reviewed these changes with Simon. We decided we don't need to add a separate "kill switch" for this because you can just set max_merge_count higher to avoid throttling. But we also decided not to document this new setting on the index-modules-merges docs: it's a very advanced setting, and playing with it could easily mess up merges.

s1monw added a commit that referenced this issue May 17, 2014
This commit upgrades to the latest Lucene 4.8.1 release including the
following bugfixes:

 * An IndexThrottle now kicks in when merges start falling behind
   limiting index threads to 1 until merges caught up. Closes #6066
 * RateLimiter now kicks in at the configured rate where previously
   the limiter was limiting at ~8MB/sec almost all the time. Closes #6018
@s1monw s1monw closed this as completed in 85a0b76 May 19, 2014
s1monw added a commit that referenced this issue May 19, 2014
This commit upgrades to the latest Lucene 4.8.1 release including the
following bugfixes:

 * An IndexThrottle now kicks in when merges start falling behind
   limiting index threads to 1 until merges caught up. Closes #6066
 * RateLimiter now kicks in at the configured rate where previously
   the limiter was limiting at ~8MB/sec almost all the time. Closes #6018
s1monw added a commit that referenced this issue May 19, 2014
This commit upgrades to the latest Lucene 4.8.1 release including the
following bugfixes:

 * An IndexThrottle now kicks in when merges start falling behind
   limiting index threads to 1 until merges caught up. Closes #6066
 * RateLimiter now kicks in at the configured rate where previously
   the limiter was limiting at ~8MB/sec almost all the time. Closes #6018
@clintongormley clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015
@l15k4
Copy link

l15k4 commented Oct 21, 2015

Guys I don't think this works as expected, I'm getting :

now throttling indexing: numMergesInFlight=4, maxNumMerges=3
stop throttling indexing: numMergesInFlight=2, maxNumMerges=3

5 times a second right at the beginning of bulk indexing. I'm disabling throttling and refresh interval, I start with optimized index, waiting until segment merging finishes, but segment merging is still falling behind... Why is the indexing throttling starting and stopping so frequently?

@clintongormley
Copy link

@l15k4
Copy link

l15k4 commented Oct 23, 2015

@clintongormley but they are not not keeping up right at the moment of starting indexing intto a a small (1M records) optimized index... increasing merge thread pool doesn't help...

I have 4 ec2.xlarge instances clustered with 1B records in 30 indices (5 shards each). And if I create a new index and start bulk index into it then throttling happens right away. All fields are doc_values and I think that it happened right I after I reindexed everything to doc_values around 1.6.0 .... I cannot shake it off since then... I tried everything...

Imho I need to scale it up just because of segment merging, but there will be plenty of unused resources ...

I'm trying to solve this issue for months now...

@mikemccand
Copy link
Contributor Author

@l15k4 did you disable store IO throttling (defaults to 20 MB/sec, which is too low for heavy indexing cases).

Where are you storing the shards (what IO devices), EBS or local instance storage?

Also try the ideas here: https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing

@l15k4
Copy link

l15k4 commented Oct 25, 2015

@mikemccand I set it up to 30, 40, 80, 100 MB/s ... it had no effect. I also tried to set index.merge.scheduler.max_thread_count: 6 but it lead to throttling now throttling indexing: numMergesInFlight=9 so it didn't help either...

We use EBS (General Purpose (SSD)) on c4.xlarge instances. 2 volumes, one for system and one dedicated for ES...

It seems that if you are doing bulk indexing and have all fields doc_values then you need either quad core machine or physically attached SSD or both, otherwise segment merging will always fall behind no matter what optimizations one does...

It always looks this way, it is throttling for some period of time like 15-20 minutes and then it stops http://i.imgur.com/UyDTlHi.png

I also tried to shrink index and bulk threadpools for segment merging to keep up with bulk indexing, but it didn't help either ... it doesn't keep up right when the first few bulk index requests come...

@l15k4
Copy link

l15k4 commented Oct 25, 2015

The best bulk indexing performance I can get on a machine with 4 hyper threads and EBS (750 Mbps) with all fields being doc_values is by increasing index.merge.scheduler.max_thread_count to 6 and decreasing threadpool.bulk.size: 2, this way it is throttling now throttling indexing like every 6th second but it is still throttling so the throughput is now http://i.imgur.com/wXCNZh7.png

I think that after doc_values people don't have much of a choice, they'll need physically attached SSD...

@mikemccand
Copy link
Contributor Author

Hmm enabling doc values is typically a minor indexing performance hit in my experience, e.g. see the nightly benchmarks at https://benchmarks.elastic.co (annotation R on the first chart).

Do you have provisioned IOPs for your EBS mounts? Are you sure you're not running into that limit?

Can you try the local instance SSD, just for comparison? Your EBS is backed by SSD as well, so this would let us remove EBS from the equation. (You'd need to switch to an i2.4xlarge instance for this test).

@l15k4
Copy link

l15k4 commented Oct 26, 2015

General Purpose unfortunately, the price of IO Provisioned SSDs surprised us. If you want to go beyond 160 MiB/s to 320 MiB/s it costs double than the volume itself.

I guess it wouldn't throttle with IO Provisioned SSD with 9000 IOPS to reach those 320MiB/s ... but these machines cost fortune :-)

@mikemccand
Copy link
Contributor Author

I guess it wouldn't throttle with IO Provisioned SSD with 9000 IOPS to reach those 320MiB/s ... but these machines cost fortune :-)

Or just use the local instance attached SSDs on the i2.* instance types ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants