New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using the FVH and Postings Highlighter without storing extra data #5184
Conversation
Speeds things up especially when you have lots of matched fields that share the same indexed field.
This code lies more then memory index but it is faster and doesn't lie enough to break the FVH. Also add term filtering to the analysis which saves a ton of time.
Nothing fancy for skipping the offsets for short documents like with the FVH.
What I think I have left: More stuff? |
|
I have no idea if this is faster yet, but it ought to be less gnarly from a memory standpoint.
This helps a bunch with memory but isn't perfect. A few things we could really improve but I don't really feel like it now: 1. Rebuild SliceWriter and SliceReader against Elasticsearch's IntArray so that we can build a smaller and throw out XIntBlockPool. 2. Cache the BytesRefHash between analyses - maybe use it instead of a Set<String> for filtering and use the term that spits out the BytesRef and the hashcode of the field at the same time.
It wasn't helping
|
|
I'm going to set this aside for a few days until someone has time to review it. This is what I (personally) think I could get out of this:
In the instance I'm able to easily test I'm seeing 5% space savings with performance improvement and 25-30% for a performance decrease of under a tenth of a millisecond per document. The index is only a gig with replication so none of this takes disk seek time into account but I imagine that'll tip things further towards storing fewer term vectors. |
Oh, I'll also see some decreased load from indexing. I'm not sure how much. If indexes is spread evenly across all the documents and everything behaves like my test index (ha!) then it'll be 15%. |
Is this too nasty a hack to be worth it? I'd love to be able to work more on this or something with similar goals but I don't think it is worth it for me to do anything else without a review from someone closer to the project. |
I like this feature! You are exploring it in the context of highlighting but I think computing term vectors dynamically could be very useful to the term vectors API as well. Maybe this pull request should be splitted into smaller pull requests in order to help get the change in. I think there are at least 3 different changes:
I think 2. is more tricky than it sounds because if a single field has two values, one that is large and the other one that is small, then either none of these fields or both of them should have term vectors. Regarding 3. I'm still wondering whether it should be done. I like the postings highlighter because it is fast and I'm not sure I want to enable users to have bad (slow) experience with it because of dynamically-generated offsets. :-) (I'm less concerned about on-the-fly term vectors since term vectors are already slow). Implementation-wise, I see that you reimplemented some indexing logic. Maybe a clean way to do that would be to contribute term vectors support to MemoryIndex? |
Fine by me. I'll do that soonish.
Yeah. What I've got works for my use case because my multi-valued fields are all short string. I imagine that is pretty common. But it works accidentally which isn't good.
The value in computing offsets on the fly is to see what the postings highlighter would do with your document. Its not something you'd want to do all the time like you might want with the term vectors. Its "easy" to jam it into the term vector computing logic but it adds another hack to maintain. I think If I can convince you it is a good idea to do this we should combine the proposed PR number 3 and number 1. It'd make my life easier.
I started with MemoryIndex! It spits out term vectors no problem. I dropped it in favor of that hand written storage (which is I cribbed from MemoryIndex anyway) for a few reasons:
I was wondering if it'd be useful to port IntBlockPool to BigArrays, maybe in such a way that I could submit it back to Lucene. I know it is used in a bunch of places but I'm not sure how big a hot spot they are. It is involved in term vector writing.... Did I mention that only storing term vectors for large fields speeds up indexing quite a bit compared to blindly storing them for everything? Its pretty cool. |
Good points, I can definitely imagine how this helps performance. On the other hand, using MemoryIndex is appealing since it would make this feature very easily maintainable. @s1monw what do you think?
Maybe we could just change the page size of
I'm not surprised, term vectors are very expensive! (it's a bit crazy that highlighters use them) |
We'll add this back in another pull request.
I'm wondering where to go from here. I'll do more work on this, but I'd like some advice on what is next. Should I concentrate on making the lying readers less deceitful? Should I look at plugging this into the term vector api?
I made an XIntBlockPool which works OK. Certainly not perfect. |
Abandoning in favor of getting this Elasticsearch plugin released which has this: https://github.com/nik9000/expiremental-highlighter |
…he index. Adds the ability to the Term Vector API to generate term vectors for some chosen fields, even though they haven't been explicitely stored in the index. Relates to elastic#5184 Closes elastic#6567
Right now this is a work in progress more for review then anything. I'll squash it and make it better later.
I'm sure there are tons of things wrong with this but it could save me a ton of disk space and speed up highlighting. But, yeah, it is a huge hack.
Closes #5183