New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More resource efficient analysis wrapping usage #6714
Conversation
Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread. The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer... This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances... closes elastic#6714
@uschindler hey, I would love for a review of this as well? I think that if its good, this might be a good addition to Lucene? |
+1 from me. I think the change is good. afterwards, we should fix this in Lucene too: the 'restrictions' here are not harsh and IMO should somehow be the "default" behavior in lucene. The issue is there are two use cases baked into AnalyzerWrapper.java: 1. delegating use case (by field name). 2. actual wrapping use case, where you take an existing analyzer and tweak functionality. So long term, I am thinking we should separate the two in lucene. And this delegating use-case (which is way more typical) will be more efficient, the base class for PerFieldAnalyzerWrapper. The other wrapping use case, can be separate class which is base for ShingleANalyzerWrapper and those things. |
Hi, I will look into this in a moment. Thanks for the ping on twitter! |
return super.wrapReader(fieldName, reader); | ||
} | ||
|
||
private static class DelegatingReuseStrategy extends ReuseStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a non-static inner class. By this you don't need the "wrapper" field (its automatically passed down).
I am not 100% sure, maybe the javac compiler disallows to create inner non-static classes if "this" is not yet useable... If it does not work, ignore this comment :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, compiler barfs when its not static :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the compiler is right: Before the super constructor is called, this is not yet useable, not even defined according to JLS. And we create the inner class before the super constructor is called (argument is evaluated first)!
Hi,
In any case we should put this thing into Lucene, too - and make PerFieldAnalyzerWrapper extend it! Strong +1 and fighting against memory waste :-) |
In addition, I have the feeling, we should reorder the above code extract from Lucene, so the wrapping of reader is done after the components are created or at the beginning of the method. The current code is hard to understand because the initReader() method is somewhere in the middle of the other logic! This is a relict from the time when Tokenizers got the reader in the constructor (oh my god, thanks @rmuir for fixing this!) |
In Lucene 4.x we still have the reader in Tokenizers constructor. But we still don't need to make the wrapReader() method final. If the delegating AnalyzerWrapper wraps the reader and stores the TokenStream with the wrapped reader in the delegate, its still no problem, because if the component is reused, both - the delegate and the delegator - can set a new reader - wrapped or not. In addition, when the Tokenizer is closed, it unsets the reader, so the cache does not contain a reader anymore. |
cool, great, I will push this then for now, and we will move to whatever improvement happens in Lucene |
Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread. The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer... This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances... closes #6714
Hi, |
Hi Shay, |
@uschindler cool, yea, both additional cases are not relevant in ES case, I added an assert that will trip once we update to 4.10 to remove this class, and use the one from the issue. we can't use per field analyzer wrapper (I assume you mean FieldNamesAnalyzer class), this is going through additional changes that will cause it to diverge from it even more, which I think is fine. |
Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread.
The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer...
This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances...