New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default position_offset_gap to 100 #12544
Conversation
WIP because it needs tests and more testing. This is way way harder then it ought to be. |
Oh and it needs documentation changes and breaking changes documentation. |
Which one is it? the position gap or the offset gap? Lucene has both, and they both have different meanings. Changing the former to this value makes sense, the latter will break many things. |
And please, please, we have to rename this, because its totally meaningless. positions and offsets are separate things... |
I understand the rename. It's the positions.
|
Does anyone have an opinion on what would be ok behavior from a backwards compatibility standpoint? |
Talked with @dakrone and we decided that the most correct thing as to make |
I believe this is ready for review. @jpountz - its more fiddly than I expected at first so it might be worth a review from you or someone who's pretty experienced with mapping. @mute - I think this is the right way to do #12538. Its much more complicated than it looked and certainly wasn't really low hanging fruit. |
I'd love this in 2.0.0 at some point. It'll need a review soon if it is going to. |
@@ -150,7 +150,7 @@ limit to `32766 / 3 = 10922` since UTF-8 characters may occupy at most 3 | |||
bytes. | |||
|
|||
|`position_offset_gap` |Position increment gap between field instances | |||
with the same field name. Defaults to 0. | |||
with the same field name. Defaults to the default of the analyzer which is 10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 seems so arbitrary, but I'm OK with it. I assume you didn't pick 1 because then a sloppy phrase query with slop=1 would incorrectly match...
But, can we explain why it's non-zero here? E.g. just add "... so that proximity queries like PhraseQuery do not incorrectly match across 2 values of a multi-valued field"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! I'll do that.
10 is totally arbitrary. Its much more than 1 which is too low. I'd be happy to change it to 1024 or some other magic number. I just want it much much more than 1.
I agree that 10 is small, and could quite easily overlap with a typical slop value. Personally i'd go for eg 100 |
@@ -155,6 +173,9 @@ public StringFieldMapper build(BuilderContext context) { | |||
iterator.remove(); | |||
} else if (propName.equals("position_offset_gap")) { | |||
builder.positionOffsetGap(XContentMapValues.nodeIntegerValue(propNode, -1)); | |||
if (builder.positionOffsetGap < 0) { | |||
throw new MapperParsingException("positions_offset_gap less than 0 aren't allowed."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the exc we throw here effectively discards builder
but I think it's less smelly to first assign to a local "untrusted" var first, e.g. int newPositionOffsetGap
, and check whether that's < 0, before calling builder.positionOffsetGap(newPositionOffsetGap)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I thought about that. I'll make that change.
Ok - I'll make these changes. I don't think I'll have time today because I'm in class and can't really concentrate. But tonight or tomorrow "morning". |
analyzer = (NamedAnalyzer) analyzerF; | ||
if (overridePositionOffsetGap >= 0 && analyzer.getPositionIncrementGap(analyzer.name()) != overridePositionOffsetGap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comments here explaining? I'm confused...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - this was confusing when I wrote it. Its funky and deserves more comments....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - added some comments above.
Thanks @nik9000 this is a great change (so prox aware queries never match across 2 values of a multi-valued field), I just left some minor comments. Can you open a follow-on issue to rename |
Swapped 2.0 for 2.1 because we're too late in the 2.0 release cycle to get this merged there. I'm going to pick this one back up in a few minutes and have another read through. I'll see if I can address that last open comment of @mikemccand and rebase. And I'll change all the 2.0s into 2.1s in the code. |
d3378af
to
41d0199
Compare
This is much more fiddly than you'd expect it to be because of the way position_offset_gap is applied in StringFieldMapper. Instead of setting the default to 100 its simpler to make sure that all the analyzers default to 100 and that StringFieldMapper doesn't override the default unless the user specifies something different. Unless the index was created before 2.1, in which case the old default of 0 has to take. Also postition_offset_gaps less than 0 aren't allowed at all. New tests test that: 1. the new default doesn't match phrases across values with reasonably low slop (5) 2. the new default doest match phrases across values with reasonably high slop (50) 3. you can override the value and phrases work as you'd expect 4. if you leave the value undefined in the mapping and define it on a custom analyzer the the value from the custom analyzer shines through Closes elastic#7268
41d0199
to
e49b98c
Compare
@@ -0,0 +1,651 @@ | |||
[[mapping-core-types]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grrr - how did I add a whole new file....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I figured it out - rebase woes. Fixing.
My rebased added core-types back which was removed in the mean time. This properly re-removes the file and moves the updated documentation to the string field mapper docs.
@mikemccand - would you mind having another look at this? Its now ready for review again based on 2.1. |
LGTM, thanks @nik9000! |
@nik9000 this looks good I think this should go into 2.0 though |
Ok - I'll merge to 2.1 now and backport it. I'm not super comfortable sticking it in 2.0 but I'll try because you want it in there. |
I merged this into master but github doesn't recognize it - probably because I merged it locally and pushed. That failed so I rebased onto elastic's master and squashed. At this point github's must have lost any connection to the patch. I'll leave this open to work the backport to 2.0. I'll close it once I've merged there. |
* values. | ||
*/ | ||
public static final int POSITION_OFFSET_GAP = 100; | ||
public static final int POSITION_OFFSET_GAP_PRE_2_1 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is going to 2.0 branch right? Do we need to change this to 2_0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, yup. I had originally targeted 2.0, then gave up on making it in, then @s1monw said we should jam it in there anyway. So I got what I had reviewed merged as quickly as I could. It's looking like it'll be reasonably quick to merge to 2.0 so I'll do that and then go fix master so it makes sense.
=== Mapping changes | ||
|
||
==== position_offset_gap | ||
The default `position_offset_grap` is now 100. Indexes created in Elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
position_offset_grap -> position_offset_gap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - Will fix this one right quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And merged to 2.0. For my last trick on this pull request I'll fix master so that it has the same version range checks as 2.0. And then finally, finally, this is done. |
Scratch that, I'll send that as another pull request. |
This is much more fiddly than it looks because of the way position_offset_gap
is applied in StringFieldMapper. Instead of setting the default to 10 its
simpler to make sure that all the analyzers default to 10 and that
StringFieldMapper doesn't override the default unless the user specifies
something different.
Closes #7268