Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More resource efficient analysis wrapping usage #6714

Closed
wants to merge 1 commit into from

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented Jul 3, 2014

Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread.
The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer...
This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances...

Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread.
The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer...
This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances...
closes elastic#6714
@kimchy
Copy link
Member Author

kimchy commented Jul 3, 2014

@uschindler hey, I would love for a review of this as well? I think that if its good, this might be a good addition to Lucene?

@rmuir
Copy link
Contributor

rmuir commented Jul 3, 2014

+1 from me. I think the change is good. afterwards, we should fix this in Lucene too:

the 'restrictions' here are not harsh and IMO should somehow be the "default" behavior in lucene. The issue is there are two use cases baked into AnalyzerWrapper.java: 1. delegating use case (by field name). 2. actual wrapping use case, where you take an existing analyzer and tweak functionality. So long term, I am thinking we should separate the two in lucene. And this delegating use-case (which is way more typical) will be more efficient, the base class for PerFieldAnalyzerWrapper. The other wrapping use case, can be separate class which is base for ShingleANalyzerWrapper and those things.

@uschindler
Copy link
Contributor

Hi, I will look into this in a moment. Thanks for the ping on twitter!

@s1monw s1monw added the review label Jul 3, 2014
return super.wrapReader(fieldName, reader);
}

private static class DelegatingReuseStrategy extends ReuseStrategy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a non-static inner class. By this you don't need the "wrapper" field (its automatically passed down).

I am not 100% sure, maybe the javac compiler disallows to create inner non-static classes if "this" is not yet useable... If it does not work, ignore this comment :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, compiler barfs when its not static :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the compiler is right: Before the super constructor is called, this is not yet useable, not even defined according to JLS. And we create the inner class before the super constructor is called (argument is evaluated first)!

@uschindler
Copy link
Contributor

Hi,
I know the problem and was not happy with AnalyzerWrapper all the time, especially for PerFieldAnalyzerWrapper in Lucene (what is actually the difference of Elasticsearch's FieldNameAnalyzer)?
I agree, if you don't wrap the tokenstream, there is no problem at all.
Actually, you don't need to make wrapReader final, because the Reader is always recreated if the tokenstream is reused. getReusableComponents() would then return non-null and Analyzer base class would wrap the reader for you:

  public final TokenStream tokenStream(final String fieldName,
                                       final Reader reader) throws IOException {
    TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
    final Reader r = initReader(fieldName, reader);
    if (components == null) {
      components = createComponents(fieldName);
      reuseStrategy.setReusableComponents(this, fieldName, components);
    }
    components.setReader(r);
    return components.getTokenStream();
  }

In any case we should put this thing into Lucene, too - and make PerFieldAnalyzerWrapper extend it! Strong +1 and fighting against memory waste :-)

@uschindler
Copy link
Contributor

In addition, I have the feeling, we should reorder the above code extract from Lucene, so the wrapping of reader is done after the components are created or at the beginning of the method. The current code is hard to understand because the initReader() method is somewhere in the middle of the other logic! This is a relict from the time when Tokenizers got the reader in the constructor (oh my god, thanks @rmuir for fixing this!)

@uschindler
Copy link
Contributor

In Lucene 4.x we still have the reader in Tokenizers constructor. But we still don't need to make the wrapReader() method final. If the delegating AnalyzerWrapper wraps the reader and stores the TokenStream with the wrapped reader in the delegate, its still no problem, because if the component is reused, both - the delegate and the delegator - can set a new reader - wrapped or not. In addition, when the Tokenizer is closed, it unsets the reader, so the cache does not contain a reader anymore.

@kimchy
Copy link
Member Author

kimchy commented Jul 3, 2014

cool, great, I will push this then for now, and we will move to whatever improvement happens in Lucene

@kimchy kimchy added enhancement and removed review labels Jul 3, 2014
@kimchy kimchy closed this in 5249005 Jul 3, 2014
kimchy added a commit that referenced this pull request Jul 3, 2014
Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread.
The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer...
This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances...
closes #6714
@kimchy kimchy deleted the better_analyzer_wrapper branch July 3, 2014 19:07
@uschindler
Copy link
Contributor

@uschindler
Copy link
Contributor

Hi Shay,
I resolved the issue in Lucene. The new DelegatingAnalyzerWrapper will be available in Lucene 4.10! Maybe you want to add a note into ES code base, or use XDelegatingAnalyzerWrapper as new class name.
The Lucene class has additional tests and also supports one use-case which does not happen in ES: If you wrap the delegating analyzer with another non-delegating analyzer and use "new OtherAnalyzer(delegatingWrapper.getReuseStrategy())" (the common pattern), this would break reuse. To prevent this, DelegatingAnalyzerWrapper now gets a ReuseStrategy like all other analyzers as ctor param and delegates to this "fallback", if delegating to the underlying analyzer is not possible (if the Analyzer is not top-level).
I would also use PerFieldAnalyzerWrapper in Elasticsearch now, as your class is more or less a clone.

@kimchy
Copy link
Member Author

kimchy commented Jul 5, 2014

@uschindler cool, yea, both additional cases are not relevant in ES case, I added an assert that will trip once we update to 4.10 to remove this class, and use the one from the issue.

we can't use per field analyzer wrapper (I assume you mean FieldNamesAnalyzer class), this is going through additional changes that will cause it to diverge from it even more, which I think is fine.

@clintongormley clintongormley changed the title More resource efficient analysis wrapping usage Analysis: More resource efficient analysis wrapping usage Jul 16, 2014
@clintongormley clintongormley added the :Search/Analysis How text is split into tokens label Jun 7, 2015
@clintongormley clintongormley changed the title Analysis: More resource efficient analysis wrapping usage More resource efficient analysis wrapping usage Jun 7, 2015
@achow achow mentioned this pull request Sep 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants