New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return term vectors as part of the search response #10729
Conversation
Adds a new parameter to the search API called `term_vectors` which takes as input `true`, `false` or an `object` of parameters. The parameters are exactly the same as the ones specified in the Term Vectors API, with the exception of `_index`, `_type`, `_id`, `doc`, `_routing`, `_version` and `_version_type`.
Test suite does not seem to pass. Can you fix that before review? |
The parameters are the same as for the <<docs-termvectors,Term Vectors API>>. | ||
Use `"term_vectors": true` with no parameters, to only return the term vectors | ||
stored for each document hit. This will ensure that if the term vectors are | ||
not stored, they will not be computed on the fly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, when I do not want term vectors to be generated on the fly if they are not there, then I cannot configure any options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'd do "term_vectors" : true
, which will return the stored term vectors only.
The code looks good to me but I have concerns about the API: the ability to get term vectors for every search hit sounds a bit esoteric to me and we have been leaning towards removing esoteric features recently. I will let other chime in but at the minimum I think this should be marked experimental. |
Thanks for the review. I think this will be useful especially for NLP tasks. Note that term vectors are never returned by default. |
I agree that the vectorize API should be moved to a plugin. However, I think that this would also be useful for MLT to find documents similar to a group of documents provided by a given query. So I'm not sure anymore whether this PR should be moved to the vectorize plugin. Any thoughts? |
What is the status quo concerning this PR. We can't remove the TVs API as it is being used by MLT. We agreed that the Vectorize API should be a plugin. However to its support this particular integration is useful:
Also this integration does not add much complexity and its usage is purely optional. On the other hand, it is not too difficult to leave this operation to the application client. It would just mean that the user would have to perform a lot of TVs requests for each document returned. @brwe @jpountz @clintongormley WDYT? |
I think any new parameter adds complexity by increasing the surface area of an API.
+1 on this option
You could still avoid the round trips with a multi-term-vectors request? |
I just talked with @brwe and one option I had not considered was to add this option to the search API as a plugin. If I'm not mistaken, you can't plug custom phases in the search API today, so this is infrastructure we would need to add. But then I'm worried about exposing even more internals of elasticsearch than today, since it would make it harder to perform internal refactorings without impacting plugins and would potentially allow plugins to "poison" internal workings of elasticsearch. |
Another option we explored with @brwe was including a native script in a plugin, then the script can be called (with parameters) as a script_field in the search API. |
just for reference, here is my workaround script for the vectorizer right now, term_vectors would look similar: https://github.com/brwe/es-token-plugin/blob/master/src/main/java/org/elasticsearch/script/SparseVectorizerScript.java |
If understand correctly in order to have TVs returned as part of a scan and scroll request, I would have to:
I'm sorry to insist but I think that support for this option is a natural one. We just ask for more information than just the source and it would make implementing certain features (vectorizer for example as a plugin) a lot easier. |
Since one of the arguments against pluggable sub-phases was that it might make it harder to maintain the code I took a closer look at what would need to be done. I made a pr with a prof of concept here to show what would happen: #12400 |
Adds a new parameter to the search API called
term_vectors
which takes asinput
true
,false
or anobject
of parameters. The parameters are exactlythe same as the ones specified in the Term Vectors API, with the exception of
_index
,_type
,_id
,doc
,_routing
,_version
and_version_type
.Relates to #10823