Allow using the FVH and Postings Highlighter without storing extra data #5184

nik9000 · 2014-02-19T22:56:00Z

Right now this is a work in progress more for review then anything. I'll squash it and make it better later.

I'm sure there are tons of things wrong with this but it could save me a ton of disk space and speed up highlighting. But, yeah, it is a huge hack.

Closes #5183

Speeds things up especially when you have lots of matched fields that share the same indexed field.

This code lies more then memory index but it is faster and doesn't lie enough to break the FVH. Also add term filtering to the analysis which saves a ton of time.

Nothing fancy for skipping the offsets for short documents like with the FVH.

nik9000 · 2014-02-21T01:31:39Z

What I think I have left:
~~1. termSet support for the Postings highlighter~~
2. Decide if I actually should sort the terms. If we filter them there aren't too many to sort and it might speed up getting the FHV's getting the frequency because it'll change random reads to be more linear. Maybe. The Postings highlighter doesn't seem to care.
3. Double and triple check tests
4. More tests not part of the HighlighterSearchTests? Lower level, probably
5. Cleanup whitespace and formatting issues

More stuff?

nik9000 · 2014-02-21T13:42:19Z

~~6. See about using the recycler~~
~~7. Recheck value lookup caching. I have a suspicion it might not be working/worth the trouble.~~

I have no idea if this is faster yet, but it ought to be less gnarly from a memory standpoint.

This helps a bunch with memory but isn't perfect. A few things we could really improve but I don't really feel like it now: 1. Rebuild SliceWriter and SliceReader against Elasticsearch's IntArray so that we can build a smaller and throw out XIntBlockPool. 2. Cache the BytesRefHash between analyses - maybe use it instead of a Set<String> for filtering and use the term that spits out the BytesRef and the hashcode of the field at the same time.

It wasn't helping

nik9000 · 2014-02-21T22:22:08Z

~~8. Recheck performance against master.~~

nik9000 · 2014-02-21T22:31:44Z

If user sets term_vector_over_x then use FVH by default.

nik9000 · 2014-02-24T13:39:12Z

I'm going to set this aside for a few days until someone has time to review it. This is what I (personally) think I could get out of this:

Ability to try the postings highlighter in production without a full reindex cycle.
A modest space savings on my indexes and a decent highlighting performance improvement.
The chance to sacrifice some of this improvement to save more space for data that is searched less frequently.

In the instance I'm able to easily test I'm seeing 5% space savings with performance improvement and 25-30% for a performance decrease of under a tenth of a millisecond per document. The index is only a gig with replication so none of this takes disk seek time into account but I imagine that'll tip things further towards storing fewer term vectors.

nik9000 · 2014-02-24T13:46:09Z

Oh, I'll also see some decreased load from indexing. I'm not sure how much. If indexes is spread evenly across all the documents and everything behaves like my test index (ha!) then it'll be 15%.

nik9000 · 2014-03-03T17:26:19Z

Is this too nasty a hack to be worth it? I'd love to be able to work more on this or something with similar goals but I don't think it is worth it for me to do anything else without a review from someone closer to the project.

jpountz · 2014-03-04T00:33:27Z

I like this feature! You are exploring it in the context of highlighting but I think computing term vectors dynamically could be very useful to the term vectors API as well.

Maybe this pull request should be splitted into smaller pull requests in order to help get the change in. I think there are at least 3 different changes:

add the ability to compute term vectors on the fly (useful for both highlighting and the term vectors API)
add an option to the string field mapper to only store term vectors on large fields to save index space
add the ability to compute offsets on the fly

I think 2. is more tricky than it sounds because if a single field has two values, one that is large and the other one that is small, then either none of these fields or both of them should have term vectors.

Regarding 3. I'm still wondering whether it should be done. I like the postings highlighter because it is fast and I'm not sure I want to enable users to have bad (slow) experience with it because of dynamically-generated offsets. :-) (I'm less concerned about on-the-fly term vectors since term vectors are already slow).

Implementation-wise, I see that you reimplemented some indexing logic. Maybe a clean way to do that would be to contribute term vectors support to MemoryIndex?

nik9000 · 2014-03-04T13:59:20Z

Maybe this pull request should be splitted into smaller pull requests in order to help get the change in. I think there are at least 3 different changes:

Fine by me. I'll do that soonish.

I think 2. is more tricky than it sounds because if a single field has two values, one that is large and the other one that is small, then either none of these fields or both of them should have term vectors.

Yeah. What I've got works for my use case because my multi-valued fields are all short string. I imagine that is pretty common. But it works accidentally which isn't good.

Regarding 3. I'm still wondering whether it should be done. I like the postings highlighter because it is fast and I'm not sure I want to enable users to have bad (slow) experience with it because of dynamically-generated offsets. :-) (I'm less concerned about on-the-fly term vectors since term vectors are already slow).

The value in computing offsets on the fly is to see what the postings highlighter would do with your document. Its not something you'd want to do all the time like you might want with the term vectors. Its "easy" to jam it into the term vector computing logic but it adds another hack to maintain.

I think If I can convince you it is a good idea to do this we should combine the proposed PR number 3 and number 1. It'd make my life easier.

Implementation-wise, I see that you reimplemented some indexing logic. Maybe a clean way to do that would be to contribute term vectors support to MemoryIndex?

I started with MemoryIndex! It spits out term vectors no problem. I dropped it in favor of that hand written storage (which is I cribbed from MemoryIndex anyway) for a few reasons:

MemoryIndex doesn't hook into Elasticsearch's page caching infrastructure and it does make big arrays of bytes and ints. Probably not insurmountable.
Limiting the terms that you store vectors for is curcial from a performance perspective. It felt cleaner to do it in the hand rolled code then by adding a term filter. Maybe that is just craziness. I think I could squeeze more performance out of it this way, especially if I mess around with reusing the BytesRefHash across separate highlighted fields and use that Attribute that spits out the term with the with the hashcode.
MemoryIndex wants to sort terms which is technically correct but not required for either highlighter. This isn't a big deal for documents with few terms or if you are limiting the vectors to just a few terms. I have decent sized documents and hadn't implemented the filter yet so MemoryIndex looked like it was really getting in the way. Maybe it wasn't and filtering the terms would have been better.

I was wondering if it'd be useful to port IntBlockPool to BigArrays, maybe in such a way that I could submit it back to Lucene. I know it is used in a bunch of places but I'm not sure how big a hot spot they are. It is involved in term vector writing....

Did I mention that only storing term vectors for large fields speeds up indexing quite a bit compared to blindly storing them for everything? Its pretty cool.

jpountz · 2014-03-04T16:25:56Z

Limiting the terms that you store vectors for is curcial from a performance perspective.
MemoryIndex wants to sort terms

Good points, I can definitely imagine how this helps performance. On the other hand, using MemoryIndex is appealing since it would make this feature very easily maintainable. @s1monw what do you think?

I was wondering if it'd be useful to port IntBlockPool to BigArrays, maybe in such a way that I could submit it back to Lucene. I know it is used in a bunch of places but I'm not sure how big a hot spot they are. It is involved in term vector writing....

Maybe we could just change the page size of BigArrays from 16KB to 32KB so that it can be used to write allocators for IntBlockPool?

Did I mention that only storing term vectors for large fields speeds up indexing quite a bit compared to blindly storing them for everything? Its pretty cool.

I'm not surprised, term vectors are very expensive! (it's a bit crazy that highlighters use them)

We'll add this back in another pull request.

nik9000 · 2014-03-14T15:57:40Z

Good points, I can definitely imagine how this helps performance. On the other hand, using MemoryIndex is appealing since it would make this feature very easily maintainable. @s1monw what do you think?

I'm wondering where to go from here. I'll do more work on this, but I'd like some advice on what is next. Should I concentrate on making the lying readers less deceitful? Should I look at plugging this into the term vector api?

Maybe we could just change the page size of BigArrays from 16KB to 32KB so that it can be used to write allocators for IntBlockPool?

I made an XIntBlockPool which works OK. Certainly not perfect.

nik9000 · 2014-03-31T17:25:18Z

Abandoning in favor of getting this Elasticsearch plugin released which has this: https://github.com/nik9000/expiremental-highlighter

…he index. Adds the ability to the Term Vector API to generate term vectors for some chosen fields, even though they haven't been explicitely stored in the index. Relates to elastic#5184 Closes elastic#6567

…he index. Adds the ability to the Term Vector API to generate term vectors for some chosen fields, even though they haven't been explicitely stored in the index. Relates to #5184 Closes #6567

nik9000 added 8 commits February 19, 2014 17:53

Allow using the fvh without storing vectors

c9aa517

Let users configure term vectors on long fields

90ac620

Add caching for source fetching

861811e

Speeds things up especially when you have lots of matched fields that share the same indexed field.

Replace MemoryIndex with hand built code

d93f958

This code lies more then memory index but it is faster and doesn't lie enough to break the FVH. Also add term filtering to the analysis which saves a ton of time.

Cleanup

b3d4f83

More cleanup

8feddf9

Add support for Postings highlighter

6b886dd

Nothing fancy for skipping the offsets for short documents like with the FVH.

Fix some tests I missed

e5f8358

nik9000 added 4 commits February 21, 2014 08:49

termSet support for Postings highlighter

18a55e7

Start integrating with the PageCacheRecycler

4356bac

I have no idea if this is faster yet, but it ought to be less gnarly from a memory standpoint.

Remove the lookup caching

2873b61

It wasn't helping

Docs

d6f8232

dadoonet assigned uboness Feb 24, 2014

javanna unassigned uboness Feb 25, 2014

jpountz self-assigned this Mar 4, 2014

Stop supporting term vectors only on long fields

63babf1

We'll add this back in another pull request.

nik9000 closed this Mar 31, 2014

alexksikes mentioned this pull request Jun 19, 2014

Compute term vectors on the fly if not stored in index #6567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using the FVH and Postings Highlighter without storing extra data #5184

Allow using the FVH and Postings Highlighter without storing extra data #5184

nik9000 commented Feb 19, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 24, 2014

nik9000 commented Feb 24, 2014

nik9000 commented Mar 3, 2014

jpountz commented Mar 4, 2014

nik9000 commented Mar 4, 2014

jpountz commented Mar 4, 2014

nik9000 commented Mar 14, 2014

nik9000 commented Mar 31, 2014

Allow using the FVH and Postings Highlighter without storing extra data #5184

Allow using the FVH and Postings Highlighter without storing extra data #5184

Conversation

nik9000 commented Feb 19, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 21, 2014

nik9000 commented Feb 24, 2014

nik9000 commented Feb 24, 2014

nik9000 commented Mar 3, 2014

jpountz commented Mar 4, 2014

nik9000 commented Mar 4, 2014

jpountz commented Mar 4, 2014

nik9000 commented Mar 14, 2014

nik9000 commented Mar 31, 2014