Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parent/child to not be Lucene queries #8134

Closed
jpountz opened this issue Oct 17, 2014 · 12 comments
Closed

Refactor parent/child to not be Lucene queries #8134

jpountz opened this issue Oct 17, 2014 · 12 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Oct 17, 2014

Parent/child queries have a non-desirable property: given a parent/child query Q, updating a document in segment A might change the set of matching document in another segment B.

This is an issue because it means that parent/child queries and filters cannot be cached per segment, so we had to add logic to make sure these queries don't get cached, either directly or as part of a cached parent filter (eg. under a cached bool filter). The propagation logic can be a bit fragile so I think we should work on a better fix.

One idea could be to change the abstraction we have to match document from a single Lucene query to something that could perform several Lucene queries. For instance in the case of has_child, we could have a first query that would collect parent ids and then build a new query based on these ids. This is the same execution logic, but each query on its own would solely depend on data that is stored in the current segment, so they would be cacheable (even though it might not a good idea to cache them).

@martijnvg
Copy link
Member

+1 to make the parent/child logic less fragile.

Just wondering how these two separate searches are executed:

  1. We could run the first search in the query parser and pass the matching parent ordinals (optionally with scores) as argument to the query/filter that is returned by the parser.
  2. We could back to how has_child was executed around version 0.19 where there was scoped queries infrastructure. If a query implemented ScopePhase interface it ran twice as separate queries. First just the inner query ran and it collected parent ids, then the actual has_child query ran and only emitted parent docs that had a parent id that was collected before. Running with two distinct queries would mean that we wouldn't run into this filter cache issue. I believe this mechanism was removed because it added additional complexity in the query phase that was unwanted at that time. Maybe was can bring it back to live and simplify it.

@clintongormley
Copy link

Wondering if it would be possible to take parent-level filters into account within the child queries. For instance, imagine you have a filter on the parent which matches only 1% of your parents, but the has_child query/filter matches 90% of your docs. Currently, the has_child clause can't take advantage of the parent-level filter and so has to match all 90% of parents regardless.

@gmenegatti
Copy link

+1 one million times @martijnvg @clintongormley
For a company like ours that currently has more than 500 million parents and in few months around 60 billion children, and where the goal is to count and filter the number of parent based on some data that is stored on the child, this is definitely important.

@iddoav
Copy link

iddoav commented Oct 25, 2014

The optimization @clintongormley suggests would be very beneficial for our use-cases in Totango. Probably also @martijnvg's. Currently, the latency for such queries is not acceptable product-wise, and we'll have to work around with ugly denormalization. What we really want, is to have the parent-child queries to work faster. So, I am happy that you're pushing for it.

@RossLieberman
Copy link

I too ran into the issue described in #5116 using rescore with has_child. Performing the query X times for each set of rescore works and is the difference between 60ms (each rescore) versus 10,000ms for my data, but it would be more ideal to just use the parent's data set during rescore or even better if function_score in the original query supported summing child document values to generate the score on each node before combining on the cluster.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 13, 2015

I tried to think more about it and essentially:

  1. the current design has the issue that queries are not cacheable per segment because whether a document matches a document depends on what happens in other segments as well
  2. as a work-around, we could run the first phase (collecting ids) in the query parser and then return a query for the second phase (which would essentially be a fielddata terms filter with scores for each term). However this query would be specific to a given index searcher and it would not work for things that want to keep queries alive for a long time like filtered aliases or the percolator
  3. using a different abstraction is tempting, but then you could only use parent/child queries as top-level queries (ie. no more nesting in boolean queries)

@martijnvg
Copy link
Member

@jpountz I think that option 2 is fair. The terms collected in phase1 are always tied to a given IndexSearch/DirectoryReader, so if that changes the collected terms are invalid. I think for the percolator we should forbid the usage of parent/child queries.

What I also wanted to tackle with this issue is making parent/child queries smarter in terms of execution. Currently most of the times parent/child queries need either evaluate all parent or children documents (depending on whether has_child or has_parent is used) if there is a match with the inner query, which is unneeded because usually parent/child queries are part of a bigger query and therefor not all parents/children will ever match.

The best thing I can come up with is to make the parent/child queries execute lazily. So for example if a parent document matches with all other queries we push it to has_child and check it has a child document and if it matches with the inner query. Maybe we need to buffer all matching parent docs in order to make this efficient. The easiest place this approach could be applied is if has_child was defined as a post filter. If has_child was defined in a bool query in the main query applying this approach is trickier, because of how the query execution works at the moment.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 20, 2015

The best thing I can come up with is to make the parent/child queries execute lazily. So for example if a parent document matches with all other queries we push it to has_child and check it has a child document and if it matches with the inner query.

I believe this could be done with two-phase iteration that has been introduced in Lucene 5.1: https://issues.apache.org/jira/browse/LUCENE-6198

We just discussed this issue a bit more and agreed on the following points:

  • the limitations of option 2 are fair (no use in filtered aliases or the percolator)
  • we need to move the infrastructure to Lucene

@martijnvg
Copy link
Member

I believe this could be done with two-phase iteration that has been introduced in Lucene 5.1

Cool, with that the second phase matching logic in has_child and has_parent then only needs to be executed for docs we know that are going to match. This should speed things up significantly.

we need to move the infrastructure to Lucene

+1 I think we should rewrite the JoinUtil in the join module to do what has_child and has_parent do.

@rmuir
Copy link
Contributor

rmuir commented Feb 25, 2015

+1 this sounds great.

@martijnvg martijnvg self-assigned this Mar 25, 2015
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 25, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 19, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 19, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 2, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
@pauleil
Copy link

pauleil commented May 6, 2015

What are your plans on this task? Asking because it blocks #2917, which (at least for us) would save an enormous amount of effort and maintenance involved in roundabout solutions.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 10, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 10, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 11, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 15, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 19, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
@pauleil
Copy link

pauleil commented May 28, 2015

Does your commit mean that you've solved this? If so, that's really great.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 29, 2015
* Cut the `has_child` and `has_parent` queries over to use Lucene's query time global ordinal join. The main benefit of this change is that parent/child queries can now efficiently execute if parent/child queries are wrapped in a bigger boolean query. If the rest of the query only hit a few documents both has_child and has_parent queries don't need to evaluate all parent or child documents any more.
* Cut the `_parent` field over to use doc values. This significantly reduces the on heap memory footprint of parent/child, because the parent id values are never loaded into memory.

Breaking changes:
* The `type` option on the `_parent` field can only point to a parent type that doesn't exist yet, so this means that an existing type/mapping can't become a parent type any longer.
* The `has_child` and `has_parent` queries can no longer be use in alias filters.

All these changes, improvements and breaks in compatibility only apply for indices created with ES version 2.0 or higher. For indices creates with ES <= 2.0 the older implementation is used.

It is highly recommended to re-index all your indices with parent and child documents to benefit from all the improvements that come with this refactoring. The easiest way to achieve this is by using the scan and bulk apis using a simple script.

Closes elastic#6107
Closes elastic#8134
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants