Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom _all fields #4520

Closed
clintongormley opened this issue Dec 19, 2013 · 12 comments · Fixed by #4796
Closed

Custom _all fields #4520

clintongormley opened this issue Dec 19, 2013 · 12 comments · Fixed by #4796

Comments

@clintongormley
Copy link

In the quest for a cleaner way of setting up custom _all fields, there are two questions that need to be answered:

  1. Does it ever make sense to index tokens from different analyzer chains into a single field?
  2. Can we support per-field boosting on the custom _all field (like we do with the _all field), and can we only pay the query-time price of per-field boosting if it used?

Different analyzer chains

I can't think of a good use case where it makes sense to combine the output from different analyzer chains into a single field. The field can only ever be searched via a single analyzer, multiple analyzers can produce tokens which interfere with each other (and so produce wrong results) and the term frequencies for overlapping tokens will be badly messed up. Also, a clean token stream should never have offsets move "backwards".

So I think we can discount multiple analyzers outputting to a single field.

Per-field boosting

When combining multiple fields into a single field, you lose the effect of field norms (ie title is shorter and thus more important than body). Field-level boosting at index time is the only way to maintain this distinction.

The _all field takes field-level boosts into account by storing any boost that is not 1.0 as a payload with each term. Retrieving these payloads has an impact on query performance, but the _all field has an optimization called "auto_boost" which allows you to only pay the price of payloads if any included field has a boost other than 1.0.

I think field-level boosts should be supported with custom _all fields too.

Proposed syntax

Given that we're not going to support separate analyzer chains, the current way of implementing custom _all fields with multi-fields is verbose and misleading, as it implies that each source field can apply its own analyzer.

Instead, we suggest the following:

{ "title": {
    "type": "string",
    "copy_to": "my_all_field"
}}

The copy_to parameter can also support an array of fieldnames:

"copy_to": [ "my_all_field_1", "my_all_field_2" ]

Per-field boosting could be specified in two ways:

  1. With the caret ^ syntax:

    "copy_to": "my_all_field^2"

  2. As an object:

    "copy_to": { "field": "my_all_field", "boost": 2 }

The destination custom _all field can be defined in the mapping:

"my_all_field": {
    "type": "string",
    "analyzer": "my_analyzer"
}

If it is not defined in the mapping, then it should be added using dynamic mapping (or fail if dynamic mapping is disabled)

@roytmana
Copy link

@clintongormley

On Different analyzer chains:
I would not discount the value of different analyzer chains. Any chain that create multiple tokens at the same position (synonyms, stemmers) currently handled gracefully in various AND queries by treating tokens at the same position as (OR) fragment. It works very well for cases where I want to have an _all-like field but I want to decide which contributing fields should be stemmed and which should be precise (would not want to stem people names for example whlie would want to stem their comment). With potential implementation of text fragment boosting #4364 we could even have logic to boost originals higher that stemmed/synonym tokens

On Proposed syntax:

  1. It does not support different analyzer chains :-)
  2. It does not allow inheriting boosts of the contributing fields forcing us to repeat them. It would be nice if it would inherit it unless overridden

I would consider it a shorthand form but would like to retain complete verbose form

@clintongormley
Copy link
Author

@roytmana stemming some words and not others is pretty meaningless - you have to choose at query time whether you want to query the stemmed form or the unstemmed form. At that stage better to have it in two different fields.

Putting tokens from multiple analysis chains results in a mess - it really doesn't work well.

Second, for field-level index time boosting: I don't recommend using it for a single field. You lose precision in field norms and you have to reindex if you want to change it. Much better to use query-time boosting on a field instead.

For the custom _all field you can't do it at query time, which is why I would like to support it there. So you should only end up specifying it once: in the copy_to parameter.

@roytmana
Copy link

It is not meaningless. Yes you have freedom to choose analyzer at query time but you do not have to. As I said latest ES versions handle AND queries for tokens on the same position gracefully removing issue with not being able to use the same (stem+no-stem) analyzer at query time

In some cases it will create a mess and in some no. In the case I outlined above it works better for me than trying to combine several flavors of _all like field (stemmed and unstemmed) and it is the only way to have an _all like field combining stemmed and unstemmed input very importand for cases where stemming of certain contributing fileds can screw up data (like stemming people names)

in cases when you have hundreds of fields contributing to an _all like field I would like to have as much control over how it is put together (boosts, position gaps and analyzer chain) as possible. It would be up to me to make sure it is not mess in the end. I would not want ES preventing me from getting burned by denying me such functionality. Not to mention that there could be many people who do use it already and removing it would break their code.

I do not dispute that _all like field level analyzer chain without per contributing field chains is the most common use case but why not use shorthand default config - absence of analyzers on contributing field definition which will be the case when using your shorthand version as indication that the _all like field analyzers should be used

@jpountz
Copy link
Contributor

jpountz commented Dec 19, 2013

I'm wondering that it may actually be a better idea to stem your family names at indexing time? For example, let's imagine that one of the family names is Y, which is also a common name whose stem is X. I assume that you would apply stemming at query time so a query on Y would be translated into (X or Y). And then if you didn't apply stemming at indexing time, X is going to have a lower frequency, so matches on X are going to get better scores?

@roytmana
Copy link

from that perspective, yes (ideally I would want to give stemmed form a slight negative boost), I did not test it enough with real data as I had to switch from all-like to back to _all field due to field based boosts not supported.

But here is another scenario: I am most interested in real words not people names. I am searching on "turn" but getting also Turner because names were stemmed.

also in case of synonyms it is not as obvious

I guess it is never perfect for all the scenarious

@roytmana
Copy link

@clintongormley if we use copy-to syntax, it would be great if we could copy multifields recursively into other multifields.

For example I may have a my_all field which includes 100 fields and I want a stemmed version of it and shingled one being able to create my_all_stemmed by copying my_all would be a huge benefit

@clintongormley
Copy link
Author

@roytmana I don't think that would work with how ES uses stream parsing. We would have to hold on to a bunch of information to support this, plus would have to handle circular dependencies. Sounds more complex than we want to make this.

Instead, you'll just be able to specify:

"copy_to": [ "my_all", "my_all_stemmed"]

(yes I realise you'll have to do it on all 100 fields, but I think the advantages of being explicit outweigh the complications of recursion here)

@roytmana
Copy link

@clintongormley fair enough it's not too hard .

what about reversing it:

"my_all":{"copy-from":["message.title", "message.body", "message.sender.email"....]}

makes it easy to maintain all in one place the big disadvantage is the need to use full property names

I still have some concern re. using just copy-to form:

I would like to be able to inherit boosts from contributing fields if no boost is specified in copy-to statement and I would like to be able to specify position gap offset for each contributing field even if you decide not to support different analysis pipelines

Will copy_to support bott strings ( field names to copy to) and objects with field name and options such as boos and gap offset and anything else we may need in the future. The string form would be a shorthand for default copy logic

@clintongormley
Copy link
Author

We did consider copy_from, but it suffers from similar issues with stream parsing. You essentially need to reparse the document in order to get all of the values from the other fields.

As far as position_offset_gap, that would be configured (like analyzer, type, etc) in the mapping for the destination field, as it is a single setting per analyzer (and we only have one analyzer -- the analyzer associated with the destination field).

Re inheriting boosts... hmmm, I suppose we could do that. However, I repeat, using field-level index-time boosting is a bad idea, with the exception of when you use a custom _all field and are left with no other option.

and anything else we may need in the future

There shouldn't be anything other than boost. All we're doing is taking the value from one place and indexing as a different field, which has all the settings you need. The only exception being per-field boost.

@roytmana
Copy link

thanks for the explanation @clintongormley I still feel that providing flexibility in hos all-like fields are put together (multiple pipelines) would have very valuable but it is your call of course :-)

Will traditional field-scoped concept of multifield remain (say for not analyzed version of a field no copying from multiple sources involved) or will we have to declare them separately and then use copy_to?

is this slotted for near future 0.9.x or 1.0.x? I just want to plan better as I have a rather big mapping file to rework. Thankfully it is all defined in javascript code and generates itself including proper naming (full name) of multifields where both all-like and field scoped multifields are needed but still it is fair amount of work.

@clintongormley
Copy link
Author

Will traditional field-scoped concept of multifield remain (say for not analyzed version of a field no copying from multiple sources involved) or will we have to declare them separately and then use copy_to?

Multi-fields will remain, although I'd like to see their syntax improved as per #4521

is this slotted for near future 0.9.x or 1.0.x?

It won't be in 0.90 but hoping to get it in for 1.0

@roytmana
Copy link

@clintongormley many thanks!
#4521 would be very nice to have as well.

@ghost ghost assigned imotov Jan 13, 2014
@imotov imotov closed this as completed in 649f1b1 Jan 20, 2014
imotov added a commit to imotov/elasticsearch that referenced this issue Jan 31, 2014
Currently, boosting on `copy_to` is misleading and does not work as originally specified in elastic#4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field.  If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6.  This was not the intention.

  The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
imotov added a commit that referenced this issue Jan 31, 2014
Currently, boosting on `copy_to` is misleading and does not work as originally specified in #4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field.  If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6.  This was not the intention.

  The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
imotov added a commit that referenced this issue Jan 31, 2014
Currently, boosting on `copy_to` is misleading and does not work as originally specified in #4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field.  If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6.  This was not the intention.

  The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Currently, boosting on `copy_to` is misleading and does not work as originally specified in elastic#4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field.  If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6.  This was not the intention.

  The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants