Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return matching nested inner objects per hit #3022

Closed
martijnvg opened this issue May 10, 2013 · 80 comments · Fixed by #8153
Closed

Return matching nested inner objects per hit #3022

martijnvg opened this issue May 10, 2013 · 80 comments · Fixed by #8153
Assignees
Labels
>feature :Search/Search Search-related issues that do not fall into other categories

Comments

@martijnvg
Copy link
Member

Add support for including the matching nested inner objects per hit element.

@eranid
Copy link

eranid commented May 28, 2013

+1

1 similar comment
@roeena
Copy link

roeena commented May 28, 2013

+1

@btiernay
Copy link

I'm curious on the intended behaviour of this feature:

  • Will it be possible to do a global sort, offset, limit based on properties of the child?
  • Will it be possible to return the matching child AND parent?

The answers to these questions will have implications in how we proceed in implementing our current application.
Thanks!

@brusic
Copy link
Contributor

brusic commented May 28, 2013

Sorting on nested documents has been supported since the 0.90 release: #2662

Nested queries always returns the parent so I am assuming the behavior will remain the same. Hopefully this feature will have many settings, similar to most other elasticsearch features.

And I hate sounding like a broken record, but can we please stop with the +1s? The elasticsearch team is not influenced by them and they only create noise.

@btiernay
Copy link

Sorting on nested documents has been supported since the 0.90 release: #2662

By "global sort", a mean without regard to parent-nested relationship. That is, it is possible to return sorted children which may not be contiguous with respect to their parent. For example:

Hit 1. nested1,1 -> parent1
Hit 2. nested2,1 -> parent2
Hit 3. nested1,2 -> parent1

Notice how different parents are interleaved.

Nested queries always returns the parent so I am assuming the behavior will remain the same. Hopefully this feature will have many settings, similar to most other elasticsearch features.

It would be nice to have flexibility here as you describe.

And I hate sounding like a broken record, but can we please stop with the +1s? The elasticsearch team is not influenced by them and they only create noise.

Message received, sorry about that.

@brusic
Copy link
Contributor

brusic commented May 28, 2013

IMHO, your use case is better suited for parent/child documents and not nested documents. The way I see things is that inner/nested documents always form a single document with the outer/parent document. The inner/nested documents never appear separately. This feature breaks that model slightly by not returning certain nested documents, but the parent is always the same. Of course, I do not work for elasticsearch so my views and thoughts have no bearing on the issue. :)

BTW, there was nothing wrong with your comment. Adding discussion to an issue via a concrete use case provides value and is the type of comment we should be seeing. A comment with nothing but +1 does not provide value. Perhaps I should just create an email filter that ignores github messages with only +1.

@eranid
Copy link

eranid commented May 28, 2013

Parent-Child has the problem of using ALOT of in-memory for the joins.
I was using it at first, but as the index grew to hundreds of GB, it became a memory and CPU monster.
When most of my queries are "get me the photos that were tagged with certain tags with some value in a range of dates" (the nested document is the tag)
I have to use either parent-child or nested.

Since there might be lots of tags per photo, I want to get just the relevant tags (don't care about getting the parent really, though I'd rather not).

Parent-Child just can't handle this. with 7GB of memory, The machine takes forever to do the joins, and sometimes crashes.

Also, I did not know the +1 was a bother. I thought it helped you guys prioritize features.
My apologies. Will spread the word.

@brusic
Copy link
Contributor

brusic commented May 28, 2013

I never said parent-child was efficient, just that its functionality is better suited to your use case. :) Even if nested documents eventually supported your use case, the overhead of sorting will also be it grossly inefficient. Each parent document would need to be scored several times.

As far as +1 goes, there has been some discussion about them. There are a few issues that are 2-3 years old that have hundreds of +1s. You can make the judgement if they are effective or not. I am not on the elasticsearch team so everyone should follow their advice on proper github etiquette and not mine. :)

@btiernay
Copy link

Even if nested documents eventually supported your use case, the overhead of sorting will also be it grossly inefficient. Each parent document would need to be scored several times.

This may be true given what lucene currently supports for BlockJoinQuery and BlockJoinCollector. This is a good article describing the basics: http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html?m=1

The join can currently only go in one direction (mapping child docIDs to parent docIDs), but in some cases you need to map parent docIDs to child docIDs. For example, when searching songs, perhaps you want all matching songs sorted by their title. You can't easily do this today because the only way to get song hits is to group by album or band/artist.

@martijnvg
Copy link
Member Author

@btiernay @brusic The idea is that the nested inner objects hits are included in the root doc hit. Something like this:

"hits" : [ {
      "_index" : "test",
      "_type" : "type1",
      "_id" : "1",
      "_score" : 1.584377, "_source" : ....,
      "nested_hits" : {
        "total" : 2,
          "max_score" : 1.6391755,
          "hits" : [ {
            "_offset" : 1, 
            "_score" : 1.6391755, "_source" : ...
          }, {
            "_offset" : 0,
            "_score" : 1.5295786, "_source" : ...
          } ]
      }
}

In the above case _offset is nested field's array offset in the _source.

It should be possible to specify a global sort and a sort inside the root document and what to show per nested hit (the complete inner object based on the source or just some fields). In addition supporting highlighting and other per hit features makes a lot of sense as well.

@eranid The memory usage of the parent/child have been reduced in the new 0.90.1 version. Hopefully parent/child queries can work now better in your environment.

@brusic
Copy link
Contributor

brusic commented May 30, 2013

@martijnvg, so the full source will still be returned? The nested hits is a great idea in terms of flexibility and makes more sense than editing the source (which I referred to above in "breaking the model"), I just hope that it is efficient. I have some convoluted logic to deal with filtering nested documents on the client side, and the serialization/deserialization using Jackson is a bit of a performance hit.

Can scoring be avoid on the nested hits results? My use case calls for scoring using the fields in the parent document, but only filtering the nested documents. Not sure if you thought of this scenario, but a flexible scoring model would be a great feature.

@martijnvg
Copy link
Member Author

@brusic The full source can optionally returned if that is requested, but it isn't necessary. The source of the nested inner object will be separately returned, but is based on the source in the root document. The source can also be disabled and individual fields can be separately be set to stored in the mapping, these individual fields can then be requested instead of the source.

The overhead of fetching inner nested objects should be small. This should be done in the fetch phase (so only for the competitive root docs) by re-executing the inner query of the nested query only on the nested docs of the root docs to be retrieved (a big filter).

Not sure what you mean with the avoiding the scoring on neste hits. Just use a field from the parent for scoring via sorting by script?

@btiernay
Copy link

@martijnvg: Very nice proposal. A couple of clarifications:

It should be possible to specify a global sort and a sort inside the root document

When you say "global sort" do you mean global with respect to the root document, or with respect to nested documents? I could see how you might be implying the ability to do either.

...based on the sort or just some fields

I assume you mean "source" not "sort"?

@btiernay
Copy link

@brusic: With respect to:

The nested hits is a great idea in terms of flexibility and makes more sense than editing the source (which I referred to above in "breaking the model"), I just hope that it is efficient.

I think this really depends on the size and structure of your documents. We have some very large documents (deep and wide) for which the ability to return the nested documents without "editing" the source would be much more efficient.

@clintongormley
Copy link

@eranid to add to what @martijnvg said: up until 0.90.1, parent-child relationships required the parent IDs and child IDs to be held in memory. From 0.90.1 onwards, only the parent IDs need to be held in memory. This is a massive saving and should make parent-child much lighter.

@martijnvg
Copy link
Member Author

@btiernay The global sort is with respect to the root document. You could use nested sorting as global sorting which will base the ordering of root docs based on aggregate sort values from the nested inner objects.

I assume you mean "source" not "sort"?

Yes, I meant source.

@martijnvg
Copy link
Member Author

We definitely want to get this feature in, but in order get in it in right, a refactoring is needed in the fetch phase.
The fetch phase needs to have "a hit in a hit" concept (inner hits), that should cover both nested hits and getting child hits as part of the parent hit. All features that currently work on normal hits like for example explain, highlighting, fields and partial fields should also work for inner hits (if applicable).

@btiernay
Copy link

btiernay commented Jun 1, 2013

@martijnvg To be clear, I suppose there would be no way of inverting the relationship to sort globally based on nested docs (effectively ignoring the root-nested grouping) globally? If so, is this due to a Lucene imposed limitation?

@martijnvg
Copy link
Member Author

@btiernay You can sort globally based on the nested docs with the current nested sorting support. The global nested sorting won't be changed when inner hits are added that allows to sort nested hits per root / main document hit. Makes sense?

@btiernay
Copy link

@martijnvg: Sorry for being so dense here, but it is still unclear if I can return nested docs as the root document using this approach. Then, I would be able to sort by the nested doc, without regard to parents, very similar to how parent-child relationships work.

@martijnvg
Copy link
Member Author

@btiernay No, with this approach the nested inner objects can't be a root document on its own. Nested inner objects are always part of the root document.

@btiernay
Copy link

@martijnvg: Thanks again for the clarification. Much appreciated. I realize your answer / solution is consistent with the other aspects of nested docs (e.g whole part relationships). However, I'm very curious if my proposal is technically feasible since I think it could be very powerful and more performant than the alternative parent-child approach.

@martijnvg
Copy link
Member Author

@btiernay I think your idea is technically possible. Right now the inner nested objects don't have a unique identifier like regular root document have. In theory we could use the path + the offset in the nested array as additional data to the root documents's unique identifier for the inner nested object's unique key.

Also inner objects are tightly coupled to the lifecycle of the root document. If a root document is removed all the nested inner objects (which are stored as separate Lucene documents) are removed as well. Updating or adding individual nested inner objects isn't possible without reindexing the root document and all other nested inner objects (Lucene document block). If nested inner objects were exposed as independent hits in the search result, I guess the fact that these hits have limitations would be confusing.

@btiernay
Copy link

That gives me hope then :)

we could use the path + the offset in the nested array as additional data to the root documents's unique identifier for the inner nested object's unique key

That's an interesting idea. I hadn't thought about the id field. I like it :)

If nested inner objects were exposed as independent hits in the search result, I guess the fact that these hits have limitations would be confusing.

Perhaps, but consider "write once" applications in which the documents rarely, (if ever) change. Given the potential speedup / memory improvements that can be achieved using block documents (especially for deeply nested or wide documents), it would be a shame to not expose this functionality.

@ghost ghost assigned martijnvg Aug 27, 2013
@julianhille
Copy link

any progress on this one? cause i'd love to see this.
Otherwise any etimated time or any way to help out?

@GabrielKast
Copy link

I would also like to know if there is any progress on that feature. Any way we could help out?
I have more or less the same use case as described. I wouls like to select some children in a tree of data where the chlildren have sense only when they are included in their parent. (The use case is : I have a company, with many establisments linked to that company, I would like to query/retrieve the establishment based on their geographical position. The position belongs to the children, but all the "good data" are linkde to the parent document ie the Company)
I can manage to do something with parent/children, but I need to duplicate some data from the parent to the children and vice-versa.
Another way to avoid issues would be to be able to embed the parent in a query with a "has_parent"/embed the child in a query with a "has_child". I know it's not in the perimeter of that issue but maybe it's a simpler idea?
I have the intuition that nested_hits would be a faster solution.
Something might also be difficult (I am not familiar with ES internals..) : how do you compute the "nested_count" ? to know how many are the nested hits. Maybe it's more of a parent/children feature.
(please be kind if I'm a little clumsy I don't usually post comment on github :) )

@voleg
Copy link

voleg commented Oct 3, 2014

+1

@brusic
Copy link
Contributor

brusic commented Oct 14, 2014

Since #7164 has been merged, where does that leave this issue?

@martijnvg
Copy link
Member Author

@brusic It is getting close. Work is being done on a PR that adds inner_hits for including nested inner objects / children hits in regular search hits.

@andrerom
Copy link

+1

1 similar comment
@pspanja
Copy link

pspanja commented Oct 21, 2014

+1

@cphoover
Copy link

@martijnvg this feature would be super useful for my use case. We have products that contain an array of material subdocuments with attributes attached to those materials (price, title, color... etc). We need the ability to be able to see results on both the product and material level.

Any word on a timeline for this "inner_hits" feature?

For now I am contemplating having two product types a rolled_up product and a material type. Search now entails two queries one for the matching style. Then one for the material that has a style code matching the first query.

@andrerom
Copy link

In our case we have a CMS with (like most other such systems) a model of Content -< Location, and we would like to be able to search on content as well as locations without having to index twice.

Potentially tricky thing is how this feature would work when searching for the nested documents (Locations) and getting hits for several of them. Ideally in our case we would prefer several search hits (Content) with corresponding inner object hits (Location), so sorting is correct from elastic search side.

@brusic
Copy link
Contributor

brusic commented Oct 21, 2014

@martijnvg Thanks for all the hard work. What is the current PR that is being worked on? I would like to try out some development branches. I'm hoping to see a 1.5 tag someday. :)

@martijnvg
Copy link
Member Author

@cphoover @andrerom @brusic I opened a PR for this feature: #8153

I think the PR is in a good state and it is currently in the review state.

@cphoover
Copy link

Thank you @martijnvg would love to see this PR land, as it would be perfect for our use case, and I'm sure, many others' as well.

@s1monw
Copy link
Contributor

s1monw commented Oct 23, 2014

+1

@brusic
Copy link
Contributor

brusic commented Oct 23, 2014

Well if @s1monw +1ed the issue, then it must be important. Nevermind my constant pestering. :)

@onuralp
Copy link

onuralp commented Nov 11, 2014

+1

2 similar comments
@spiffistan
Copy link

+1

@rogerwaldvogel
Copy link

+1

@clintongormley clintongormley added the :Search/Search Search-related issues that do not fall into other categories label Nov 29, 2014
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Dec 2, 2014
Inner hits allows to embed nested inner objects, children documents or the parent document that contributed to the matching of the returned search hit as inner hits, which would otherwise be hidden.

Closes elastic#8153
Closes elastic#3022
Closes elastic#3152
martijnvg added a commit that referenced this issue Dec 2, 2014
Inner hits allows to embed nested inner objects, children documents or the parent document that contributed to the matching of the returned search hit as inner hits, which would otherwise be hidden.

Closes #8153
Closes #3022
Closes #3152
@ricardo-silveira
Copy link

+1

@brusic
Copy link
Contributor

brusic commented Apr 23, 2015

Ricardo, this feature is already live. I believe in version 1.5
On Apr 23, 2015 7:48 AM, "Ricardo Silveira" notifications@github.com
wrote:

+1


Reply to this email directly or view it on GitHub
#3022 (comment)
.

@ricardo-silveira
Copy link

sorry, the topic was huge and I couldn't read it all. You mean that I can make a query and return the nested documents, instead of the main doc? So far I have a workaround, I use _source to help myself and I plug some python to the mix....

@ricardo-silveira
Copy link

Are you sure?

I get the following error:

nested: QueryParsingException[[crawler_2015-04-14] [nested] filter does not support [inner_hits]]; }]",
"status": 400

@brusic
Copy link
Contributor

brusic commented Apr 23, 2015

Which version are you using? The feature was added in 1.5
On Apr 23, 2015 6:25 PM, "Ricardo Silveira" notifications@github.com
wrote:

Are you sure?

I get the following error:

nested: QueryParsingException[[crawler_2015-04-14] [nested] filter does
not support [inner_hits]]; }]",
"status": 400


Reply to this email directly or view it on GitHub
#3022 (comment)
.

@ricardo-silveira
Copy link

In my case I am using the version 1.4.4...

We were using 0.9, now we have just migrated to 1.4, and then you tell me that this feature is avaiable in a new release? :(

@y0j0
Copy link

y0j0 commented Aug 10, 2016

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging a pull request may close this issue.