More resource efficient analysis wrapping usage #6714

kimchy · 2014-07-03T14:27:03Z

Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread.
The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer...
This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances...

Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread. The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer... This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances... closes elastic#6714

kimchy · 2014-07-03T14:34:21Z

@uschindler hey, I would love for a review of this as well? I think that if its good, this might be a good addition to Lucene?

rmuir · 2014-07-03T14:48:44Z

+1 from me. I think the change is good. afterwards, we should fix this in Lucene too:

the 'restrictions' here are not harsh and IMO should somehow be the "default" behavior in lucene. The issue is there are two use cases baked into AnalyzerWrapper.java: 1. delegating use case (by field name). 2. actual wrapping use case, where you take an existing analyzer and tweak functionality. So long term, I am thinking we should separate the two in lucene. And this delegating use-case (which is way more typical) will be more efficient, the base class for PerFieldAnalyzerWrapper. The other wrapping use case, can be separate class which is base for ShingleANalyzerWrapper and those things.

uschindler · 2014-07-03T15:09:02Z

Hi, I will look into this in a moment. Thanks for the ping on twitter!

uschindler · 2014-07-03T16:52:52Z

src/main/java/org/apache/lucene/analysis/SimpleAnalyzerWrapper.java

+        return super.wrapReader(fieldName, reader);
+    }
+
+    private static class DelegatingReuseStrategy extends ReuseStrategy {


I would make this a non-static inner class. By this you don't need the "wrapper" field (its automatically passed down).

I am not 100% sure, maybe the javac compiler disallows to create inner non-static classes if "this" is not yet useable... If it does not work, ignore this comment :-)

indeed, compiler barfs when its not static :)

And the compiler is right: Before the super constructor is called, this is not yet useable, not even defined according to JLS. And we create the inner class before the super constructor is called (argument is evaluated first)!

uschindler · 2014-07-03T17:05:06Z

Hi,
I know the problem and was not happy with AnalyzerWrapper all the time, especially for PerFieldAnalyzerWrapper in Lucene (what is actually the difference of Elasticsearch's FieldNameAnalyzer)?
I agree, if you don't wrap the tokenstream, there is no problem at all.
Actually, you don't need to make wrapReader final, because the Reader is always recreated if the tokenstream is reused. getReusableComponents() would then return non-null and Analyzer base class would wrap the reader for you:

  public final TokenStream tokenStream(final String fieldName,
                                       final Reader reader) throws IOException {
    TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
    final Reader r = initReader(fieldName, reader);
    if (components == null) {
      components = createComponents(fieldName);
      reuseStrategy.setReusableComponents(this, fieldName, components);
    }
    components.setReader(r);
    return components.getTokenStream();
  }

In any case we should put this thing into Lucene, too - and make PerFieldAnalyzerWrapper extend it! Strong +1 and fighting against memory waste :-)

uschindler · 2014-07-03T17:08:46Z

In addition, I have the feeling, we should reorder the above code extract from Lucene, so the wrapping of reader is done after the components are created or at the beginning of the method. The current code is hard to understand because the initReader() method is somewhere in the middle of the other logic! This is a relict from the time when Tokenizers got the reader in the constructor (oh my god, thanks @rmuir for fixing this!)

uschindler · 2014-07-03T17:11:33Z

In Lucene 4.x we still have the reader in Tokenizers constructor. But we still don't need to make the wrapReader() method final. If the delegating AnalyzerWrapper wraps the reader and stores the TokenStream with the wrapped reader in the delegate, its still no problem, because if the component is reused, both - the delegate and the delegator - can set a new reader - wrapped or not. In addition, when the Tokenizer is closed, it unsets the reader, so the cache does not contain a reader anymore.

kimchy · 2014-07-03T19:02:39Z

cool, great, I will push this then for now, and we will move to whatever improvement happens in Lucene

Today, we take great care to try and share the same analyzer instances across shards and indices (global analyzer). The idea is to share the same analyzer so the thread local resource it has will not be allocated per analyzer instance per thread. The problem is that AnalyzerWrapper keeps its resources on its own per thread storage, and with per field reuse strategy, it causes for per field per thread token stream components to be used. This is very evident with the StandardTokenizer that uses a buffer... This came out of test with "many fields", where the majority of 1GB heap was consumed by StandardTokenizer instances... closes #6714

uschindler · 2014-07-03T21:56:29Z

Hi,
I opened https://issues.apache.org/jira/browse/LUCENE-5803

uschindler · 2014-07-05T12:41:45Z

Hi Shay,
I resolved the issue in Lucene. The new DelegatingAnalyzerWrapper will be available in Lucene 4.10! Maybe you want to add a note into ES code base, or use XDelegatingAnalyzerWrapper as new class name.
The Lucene class has additional tests and also supports one use-case which does not happen in ES: If you wrap the delegating analyzer with another non-delegating analyzer and use "new OtherAnalyzer(delegatingWrapper.getReuseStrategy())" (the common pattern), this would break reuse. To prevent this, DelegatingAnalyzerWrapper now gets a ReuseStrategy like all other analyzers as ctor param and delegates to this "fallback", if delegating to the underlying analyzer is not possible (if the Analyzer is not top-level).
I would also use PerFieldAnalyzerWrapper in Elasticsearch now, as your class is more or less a clone.

kimchy · 2014-07-05T12:48:06Z

@uschindler cool, yea, both additional cases are not relevant in ES case, I added an assert that will trip once we update to 4.10 to remove this class, and use the one from the issue.

we can't use per field analyzer wrapper (I assume you mean FieldNamesAnalyzer class), this is going through additional changes that will cause it to diverge from it even more, which I think is fine.

s1monw added the review label Jul 3, 2014

uschindler reviewed Jul 3, 2014
View reviewed changes

kimchy added enhancement and removed review labels Jul 3, 2014

kimchy closed this in 5249005 Jul 3, 2014

kimchy mentioned this pull request Jul 3, 2014

Don't create mapping entry for dynamic templates #6619

Closed

kimchy deleted the better_analyzer_wrapper branch July 3, 2014 19:07

clintongormley changed the title ~~More resource efficient analysis wrapping usage~~ Analysis: More resource efficient analysis wrapping usage Jul 16, 2014

clintongormley added the :Search/Analysis How text is split into tokens label Jun 7, 2015

clintongormley changed the title ~~Analysis: More resource efficient analysis wrapping usage~~ More resource efficient analysis wrapping usage Jun 7, 2015

achow mentioned this pull request Sep 22, 2015

Leaky PatternTokenizer #13721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More resource efficient analysis wrapping usage #6714

More resource efficient analysis wrapping usage #6714

kimchy commented Jul 3, 2014

kimchy commented Jul 3, 2014

rmuir commented Jul 3, 2014

uschindler commented Jul 3, 2014

uschindler Jul 3, 2014

kimchy Jul 3, 2014

uschindler Jul 3, 2014

uschindler commented Jul 3, 2014

uschindler commented Jul 3, 2014

uschindler commented Jul 3, 2014

kimchy commented Jul 3, 2014

uschindler commented Jul 3, 2014

uschindler commented Jul 5, 2014

kimchy commented Jul 5, 2014

More resource efficient analysis wrapping usage #6714

More resource efficient analysis wrapping usage #6714

Conversation

kimchy commented Jul 3, 2014

kimchy commented Jul 3, 2014

rmuir commented Jul 3, 2014

uschindler commented Jul 3, 2014

uschindler Jul 3, 2014

Choose a reason for hiding this comment

kimchy Jul 3, 2014

Choose a reason for hiding this comment

uschindler Jul 3, 2014

Choose a reason for hiding this comment

uschindler commented Jul 3, 2014

uschindler commented Jul 3, 2014

uschindler commented Jul 3, 2014

kimchy commented Jul 3, 2014

uschindler commented Jul 3, 2014

uschindler commented Jul 5, 2014

kimchy commented Jul 5, 2014