Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using the FVH and Postings Highlighter without storing extra data #5184

Closed
wants to merge 14 commits into from

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Feb 19, 2014

Right now this is a work in progress more for review then anything. I'll squash it and make it better later.

I'm sure there are tons of things wrong with this but it could save me a ton of disk space and speed up highlighting. But, yeah, it is a huge hack.

Closes #5183

Speeds things up especially when you have lots of matched fields that
share the same indexed field.
This code lies more then memory index but it is faster and doesn't lie
enough to break the FVH.

Also add term filtering to the analysis which saves a ton of time.
Nothing fancy for skipping the offsets for short documents like with the
FVH.
@nik9000
Copy link
Member Author

nik9000 commented Feb 21, 2014

What I think I have left:
1. termSet support for the Postings highlighter
2. Decide if I actually should sort the terms. If we filter them there aren't too many to sort and it might speed up getting the FHV's getting the frequency because it'll change random reads to be more linear. Maybe. The Postings highlighter doesn't seem to care.
3. Double and triple check tests
4. More tests not part of the HighlighterSearchTests? Lower level, probably
5. Cleanup whitespace and formatting issues

More stuff?

@nik9000
Copy link
Member Author

nik9000 commented Feb 21, 2014

6. See about using the recycler
7. Recheck value lookup caching. I have a suspicion it might not be working/worth the trouble.

I have no idea if this is faster yet, but it ought to be less gnarly from
a memory standpoint.
This helps a bunch with memory but isn't perfect.

A few things we could really improve but I don't really feel like it now:
1.  Rebuild SliceWriter and SliceReader against Elasticsearch's IntArray
so that we can build a smaller and throw out XIntBlockPool.
2.  Cache the BytesRefHash between analyses - maybe use it instead of a
Set<String> for filtering and use the term that spits out the BytesRef
and the hashcode of the field at the same time.
It wasn't helping
@nik9000
Copy link
Member Author

nik9000 commented Feb 21, 2014

8. Recheck performance against master.

@nik9000
Copy link
Member Author

nik9000 commented Feb 21, 2014

  1. If user sets term_vector_over_x then use FVH by default.

@nik9000
Copy link
Member Author

nik9000 commented Feb 24, 2014

I'm going to set this aside for a few days until someone has time to review it. This is what I (personally) think I could get out of this:

  1. Ability to try the postings highlighter in production without a full reindex cycle.
  2. A modest space savings on my indexes and a decent highlighting performance improvement.
  3. The chance to sacrifice some of this improvement to save more space for data that is searched less frequently.

In the instance I'm able to easily test I'm seeing 5% space savings with performance improvement and 25-30% for a performance decrease of under a tenth of a millisecond per document. The index is only a gig with replication so none of this takes disk seek time into account but I imagine that'll tip things further towards storing fewer term vectors.

@nik9000
Copy link
Member Author

nik9000 commented Feb 24, 2014

Oh, I'll also see some decreased load from indexing. I'm not sure how much. If indexes is spread evenly across all the documents and everything behaves like my test index (ha!) then it'll be 15%.

@nik9000
Copy link
Member Author

nik9000 commented Mar 3, 2014

Is this too nasty a hack to be worth it? I'd love to be able to work more on this or something with similar goals but I don't think it is worth it for me to do anything else without a review from someone closer to the project.

@jpountz
Copy link
Contributor

jpountz commented Mar 4, 2014

I like this feature! You are exploring it in the context of highlighting but I think computing term vectors dynamically could be very useful to the term vectors API as well.

Maybe this pull request should be splitted into smaller pull requests in order to help get the change in. I think there are at least 3 different changes:

  1. add the ability to compute term vectors on the fly (useful for both highlighting and the term vectors API)
  2. add an option to the string field mapper to only store term vectors on large fields to save index space
  3. add the ability to compute offsets on the fly

I think 2. is more tricky than it sounds because if a single field has two values, one that is large and the other one that is small, then either none of these fields or both of them should have term vectors.

Regarding 3. I'm still wondering whether it should be done. I like the postings highlighter because it is fast and I'm not sure I want to enable users to have bad (slow) experience with it because of dynamically-generated offsets. :-) (I'm less concerned about on-the-fly term vectors since term vectors are already slow).

Implementation-wise, I see that you reimplemented some indexing logic. Maybe a clean way to do that would be to contribute term vectors support to MemoryIndex?

@nik9000
Copy link
Member Author

nik9000 commented Mar 4, 2014

Maybe this pull request should be splitted into smaller pull requests in order to help get the change in. I think there are at least 3 different changes:

Fine by me. I'll do that soonish.

I think 2. is more tricky than it sounds because if a single field has two values, one that is large and the other one that is small, then either none of these fields or both of them should have term vectors.

Yeah. What I've got works for my use case because my multi-valued fields are all short string. I imagine that is pretty common. But it works accidentally which isn't good.

Regarding 3. I'm still wondering whether it should be done. I like the postings highlighter because it is fast and I'm not sure I want to enable users to have bad (slow) experience with it because of dynamically-generated offsets. :-) (I'm less concerned about on-the-fly term vectors since term vectors are already slow).

The value in computing offsets on the fly is to see what the postings highlighter would do with your document. Its not something you'd want to do all the time like you might want with the term vectors. Its "easy" to jam it into the term vector computing logic but it adds another hack to maintain.

I think If I can convince you it is a good idea to do this we should combine the proposed PR number 3 and number 1. It'd make my life easier.

Implementation-wise, I see that you reimplemented some indexing logic. Maybe a clean way to do that would be to contribute term vectors support to MemoryIndex?

I started with MemoryIndex! It spits out term vectors no problem. I dropped it in favor of that hand written storage (which is I cribbed from MemoryIndex anyway) for a few reasons:

  1. MemoryIndex doesn't hook into Elasticsearch's page caching infrastructure and it does make big arrays of bytes and ints. Probably not insurmountable.
  2. Limiting the terms that you store vectors for is curcial from a performance perspective. It felt cleaner to do it in the hand rolled code then by adding a term filter. Maybe that is just craziness. I think I could squeeze more performance out of it this way, especially if I mess around with reusing the BytesRefHash across separate highlighted fields and use that Attribute that spits out the term with the with the hashcode.
  3. MemoryIndex wants to sort terms which is technically correct but not required for either highlighter. This isn't a big deal for documents with few terms or if you are limiting the vectors to just a few terms. I have decent sized documents and hadn't implemented the filter yet so MemoryIndex looked like it was really getting in the way. Maybe it wasn't and filtering the terms would have been better.

I was wondering if it'd be useful to port IntBlockPool to BigArrays, maybe in such a way that I could submit it back to Lucene. I know it is used in a bunch of places but I'm not sure how big a hot spot they are. It is involved in term vector writing....

Did I mention that only storing term vectors for large fields speeds up indexing quite a bit compared to blindly storing them for everything? Its pretty cool.

@jpountz jpountz self-assigned this Mar 4, 2014
@jpountz
Copy link
Contributor

jpountz commented Mar 4, 2014

Limiting the terms that you store vectors for is curcial from a performance perspective.
MemoryIndex wants to sort terms

Good points, I can definitely imagine how this helps performance. On the other hand, using MemoryIndex is appealing since it would make this feature very easily maintainable. @s1monw what do you think?

I was wondering if it'd be useful to port IntBlockPool to BigArrays, maybe in such a way that I could submit it back to Lucene. I know it is used in a bunch of places but I'm not sure how big a hot spot they are. It is involved in term vector writing....

Maybe we could just change the page size of BigArrays from 16KB to 32KB so that it can be used to write allocators for IntBlockPool?

Did I mention that only storing term vectors for large fields speeds up indexing quite a bit compared to blindly storing them for everything? Its pretty cool.

I'm not surprised, term vectors are very expensive! (it's a bit crazy that highlighters use them)

We'll add this back in another pull request.
@nik9000
Copy link
Member Author

nik9000 commented Mar 14, 2014

Good points, I can definitely imagine how this helps performance. On the other hand, using MemoryIndex is appealing since it would make this feature very easily maintainable. @s1monw what do you think?

I'm wondering where to go from here. I'll do more work on this, but I'd like some advice on what is next. Should I concentrate on making the lying readers less deceitful? Should I look at plugging this into the term vector api?

Maybe we could just change the page size of BigArrays from 16KB to 32KB so that it can be used to write allocators for IntBlockPool?

I made an XIntBlockPool which works OK. Certainly not perfect.

@nik9000
Copy link
Member Author

nik9000 commented Mar 31, 2014

Abandoning in favor of getting this Elasticsearch plugin released which has this: https://github.com/nik9000/expiremental-highlighter

@nik9000 nik9000 closed this Mar 31, 2014
alexksikes added a commit to alexksikes/elasticsearch that referenced this pull request Jul 17, 2014
…he index.

Adds the ability to the Term Vector API to generate term vectors for some
chosen fields, even though they haven't been explicitely stored in the index.

Relates to elastic#5184
Closes elastic#6567
alexksikes added a commit that referenced this pull request Jul 17, 2014
…he index.

Adds the ability to the Term Vector API to generate term vectors for some
chosen fields, even though they haven't been explicitely stored in the index.

Relates to #5184
Closes #6567
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow using the FVH and Postings Highlighter without storing extra data
3 participants