Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-field boosting of the _all field is broken unless very specific conditions are met #4315

Closed
jpountz opened this issue Dec 2, 2013 · 7 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Dec 2, 2013

The _all field uses payloads in order to be able to store per-field boosts in a single index field. However, the way it is implemented relies on the fact that the token stream doesn't eagerly consume the input java.io.Reader (see AllEntries.read). So in practice, boost on the _all field doesn't work when under any of these circumstances:

  • there is a char filter,
  • the tokenizer is not the standard tokenizer,
  • any token filter has read-ahead logic.
@roytmana
Copy link

roytmana commented Dec 2, 2013

Could you also consider a wider scope of

  1. Per field boost in multified see Index time boost in multi_field ignored? #4108
  2. Infrastructure for boosting fragments of input text at index time. This would allow to have some sort of markup in the indexed json to supply boost to fragments of text. Common use case is finding and boosting fragments of importance as a part of indexing

@ghost ghost assigned jpountz Dec 3, 2013
@jpountz
Copy link
Contributor Author

jpountz commented Dec 3, 2013

@roytmana The two issues you are mentioning are actually quite tough to implement, so I would like to concentrate on just fixing boosting on the _all field for now.

@roytmana
Copy link

roytmana commented Dec 3, 2013

@jpountz isn't #1 quite similar to _all?
I understand _all is searched in a special way taking per field boosts stored as postings into account. Could not the same to be done for multifields?

@jpountz
Copy link
Contributor Author

jpountz commented Dec 3, 2013

@roytmana a similar method could be applied indeed. But I'm not fully happy with the way per-field boosting works for the _all field so I would like that we consider improving it before applying the same logic to other places. In particular, this doesn't work with all queries (eg. phrase queries) and is quite wasteful storage-wise (4 bytes per occurrence of a term whose field has a boost which is not 1: I wouldn't be surprised to see that it sometimes almost doubles the size of the inverted index for the _all field).

@roytmana
Copy link

roytmana commented Dec 3, 2013

@jpountz Great thank you for the info. I just wanted to bring these two cases up so you could consider them as you work on _all implementation. Hopefully multifield will follow soon :-) and an arbitrary snippet boosting after that

jpountz added a commit to jpountz/elasticsearch that referenced this issue Dec 3, 2013
_all boosting used to rely on the fact that the TokenStream doesn't eagerly
consume the input java.io.Reader. This fixes the issue by using binary search
in order to find the right boost given a token's start offset.

Close elastic#4315
@jpountz jpountz closed this as completed in 309ee7d Dec 5, 2013
@roytmana
Copy link

roytmana commented Dec 5, 2013

@jpountz do you mind if I create another ticket with expanded scope as discussed in my first reply toy your post as I feel ability to boos individual text fragments and particularly multifields is very powerful feature?
Or maybe you would rather write it up yourself?

jpountz added a commit that referenced this issue Dec 5, 2013
_all boosting used to rely on the fact that the TokenStream doesn't eagerly
consume the input java.io.Reader. This fixes the issue by using binary search
in order to find the right boost given a token's start offset.

Close #4315
@jpountz
Copy link
Contributor Author

jpountz commented Dec 5, 2013

@roytmana please open a ticket. I do think the ability to boost individual text fragments is very interesting!

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
_all boosting used to rely on the fact that the TokenStream doesn't eagerly
consume the input java.io.Reader. This fixes the issue by using binary search
in order to find the right boost given a token's start offset.

Close elastic#4315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants