Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non analyzed token stream optimization everywhere #6001

Closed
wants to merge 2 commits into from

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented Apr 30, 2014

In the string type, we have an optimization to reuse the StringTokenStream on a thread local when a non analyzed field is used (instead of creating it each time). We should use this across the board on all places where we create a field with a String.
Also, move to a specific XStringField, that we can reuse StringTokenStream instead of copying it.

In the string type, we have an optimization to reuse the StringTokenStream on a thread local when a non analyzed field is used (instead of creating it each time). We should use this across the board on all places where we create a field with a String.
Also, move to a specific XStringField, that we can reuse StringTokenStream instead of copying it.
@rmuir
Copy link
Contributor

rmuir commented Apr 30, 2014

I don't know much about Elasticsearch's thread churn, but maybe CloseableThreadLocal is interesting here? It is annoying as it seems to produce ghosts in profilers (https://issues.apache.org/jira/browse/LUCENE-4474) as it calls a native method when it purges, but i played some more and this does seem to be a ghost.

@mikemccand
Copy link
Contributor

Patch looks good! Maybe add a comment that StringTokenStream comes from Lucene?

We are not nervous about StringTokenStream always hanging onto the last value it indexed? I imagine these values are typically small ...

@kimchy
Copy link
Member Author

kimchy commented Apr 30, 2014

@rmuir the problem with CloseableThreadLocal is that we use a static thread local, and there isn't a good place to close it. Even when closing a node, there might be other nodes running in the JVM (like in our test infra). We could go with non static thread local, but then there will be just too many of those... . I think its ok for this case. Also, because we have a fully defined thread pool model for all the places that might use it (index/bulk thread pool), I am good with it.

@mikemccand I think its evident that StringTokenStream, if you think docs are needed, I will add it. Regarding the value, yea, I don't think it will be a big problem, but it would be nice if in Lucene, maybe the value can be explicitly nullified post usage? We can create a temp class that mirrors StringTokenStream on ES side till its done, not sure its needed though.

@mikemccand
Copy link
Contributor

@kimchy OK it's fine with me to skip the comment ...

@rmuir
Copy link
Contributor

rmuir commented Apr 30, 2014

@kimchy I think CloseableThreadLocal has a confusing name actually (and documentation).

I am not speaking of its close() properties here, instead the fact that it has a built-in garbage collection mechanism: it periodically purges (this is amortized over the get()s, as a multiplier * number of threads enrolled)

@kimchy
Copy link
Member Author

kimchy commented Apr 30, 2014

@rmuir ahh, yea, agreed, it is still not requires in ES because of the bounded threads that will be allowed to use it? I can easily change it, but the main merit of using it in the context of ES is actually using close where we can (because threads don't come and go)

@rmuir
Copy link
Contributor

rmuir commented Apr 30, 2014

Yeah i am unsure if its appropriate here or not. The only benefit really would be that the "GC" of threadlocal here would be consistent with what is happening with Analyzed fields too (since it uses CTL)

@kimchy
Copy link
Member Author

kimchy commented Apr 30, 2014

@rmuir kk, will add it as an additional safety measure

not for the close method (since its static), but more for its amortized GC of thread locals
@kimchy
Copy link
Member Author

kimchy commented Apr 30, 2014

added, if all is good, will push it soonish

@rmuir
Copy link
Contributor

rmuir commented Apr 30, 2014

+1

I will work with mike to fix this guy in lucene too.

@kimchy kimchy closed this in 23f200b Apr 30, 2014
kimchy added a commit that referenced this pull request Apr 30, 2014
In the string type, we have an optimization to reuse the StringTokenStream on a thread local when a non analyzed field is used (instead of creating it each time). We should use this across the board on all places where we create a field with a String.
Also, move to a specific XStringField, that we can reuse StringTokenStream instead of copying it.
closes #6001
@kimchy kimchy deleted the improve_non_analyzed_field branch April 30, 2014 21:19
@mikemccand
Copy link
Contributor

@kimchy Turns out StringTokenStream already sets value=null in its close method ...

@kimchy
Copy link
Member Author

kimchy commented Apr 30, 2014

ahh, cool!

@mikemccand
Copy link
Contributor

I opened https://issues.apache.org/jira/browse/LUCENE-5634 to get this into Lucene.

@clintongormley clintongormley added the :Search/Analysis How text is split into tokens label Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants