Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return term vectors as part of the search response #10729

Closed

Conversation

alexksikes
Copy link
Contributor

Adds a new parameter to the search API called term_vectors which takes as
input true, false or an object of parameters. The parameters are exactly
the same as the ones specified in the Term Vectors API, with the exception of
_index, _type, _id, doc, _routing, _version and _version_type.

Relates to #10823

Adds a new parameter to the search API called `term_vectors` which takes as
input `true`, `false` or an `object` of parameters. The parameters are exactly
the same as the ones specified in the Term Vectors API, with the exception of
`_index`, `_type`, `_id`, `doc`, `_routing`, `_version` and `_version_type`.
@alexksikes alexksikes added >feature v2.0.0-beta1 review :Search/Search Search-related issues that do not fall into other categories v1.6.0 labels Apr 22, 2015
@brwe
Copy link
Contributor

brwe commented Apr 22, 2015

Test suite does not seem to pass. Can you fix that before review?

The parameters are the same as for the <<docs-termvectors,Term Vectors API>>.
Use `"term_vectors": true` with no parameters, to only return the term vectors
stored for each document hit. This will ensure that if the term vectors are
not stored, they will not be computed on the fly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, when I do not want term vectors to be generated on the fly if they are not there, then I cannot configure any options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'd do "term_vectors" : true, which will return the stored term vectors only.

@jpountz
Copy link
Contributor

jpountz commented Apr 28, 2015

The code looks good to me but I have concerns about the API: the ability to get term vectors for every search hit sounds a bit esoteric to me and we have been leaning towards removing esoteric features recently. I will let other chime in but at the minimum I think this should be marked experimental.

@alexksikes
Copy link
Contributor Author

Thanks for the review. I think this will be useful especially for NLP tasks. Note that term vectors are never returned by default.

@clintongormley
Copy link

I agree with @jpountz here - I think that the existing term vectors APIs, plus this integration, plus vectorizers (#10823) should be moved into a plugin to reduce the complexity in core. See this issue about reducing complexity in core: #10368

@alexksikes
Copy link
Contributor Author

I agree that the vectorize API should be moved to a plugin. However, I think that this would also be useful for MLT to find documents similar to a group of documents provided by a given query. So I'm not sure anymore whether this PR should be moved to the vectorize plugin. Any thoughts?

@clintongormley clintongormley changed the title Search API: returns term vectors as part of the search response Return term vectors as part of the search response Jun 6, 2015
@clintongormley clintongormley removed the :Search/Search Search-related issues that do not fall into other categories label Jun 6, 2015
@alexksikes
Copy link
Contributor Author

What is the status quo concerning this PR. We can't remove the TVs API as it is being used by MLT. We agreed that the Vectorize API should be a plugin. However to its support this particular integration is useful:

  1. for the Vectorize plugin
  2. in MLT search by query scenario
  3. in NLP tasks it could be useful to return the result of analysis for a set of documents.

Also this integration does not add much complexity and its usage is purely optional. On the other hand, it is not too difficult to leave this operation to the application client. It would just mean that the user would have to perform a lot of TVs requests for each document returned.

@brwe @jpountz @clintongormley WDYT?

@jpountz
Copy link
Contributor

jpountz commented Jul 1, 2015

Also this integration does not add much complexity and its usage is purely optional.

I think any new parameter adds complexity by increasing the surface area of an API.

it is not too difficult to leave this operation to the application client

+1 on this option

It would just mean that the user would have to perform a lot of TVs requests for each document returned

You could still avoid the round trips with a multi-term-vectors request?

@jpountz
Copy link
Contributor

jpountz commented Jul 1, 2015

I just talked with @brwe and one option I had not considered was to add this option to the search API as a plugin. If I'm not mistaken, you can't plug custom phases in the search API today, so this is infrastructure we would need to add. But then I'm worried about exposing even more internals of elasticsearch than today, since it would make it harder to perform internal refactorings without impacting plugins and would potentially allow plugins to "poison" internal workings of elasticsearch.

@jpountz jpountz closed this Jul 1, 2015
@clintongormley
Copy link

Another option we explored with @brwe was including a native script in a plugin, then the script can be called (with parameters) as a script_field in the search API.

@brwe
Copy link
Contributor

brwe commented Jul 1, 2015

just for reference, here is my workaround script for the vectorizer right now, term_vectors would look similar: https://github.com/brwe/es-token-plugin/blob/master/src/main/java/org/elasticsearch/script/SparseVectorizerScript.java
just don't look too closely, just a hack...

@alexksikes
Copy link
Contributor Author

If understand correctly in order to have TVs returned as part of a scan and scroll request, I would have to:

  1. either build the infrastructure to allow for fetch phase extensions

  2. or perform a mTVs request on the client side

  3. use a script_field

  4. adds complexity for so far only one use case

  5. is not doable on thousands of documents. This would be a huge request without the session mechanism of scan / scroll, and there may also be version conflicts.

  6. seems to be more of hack

I'm sorry to insist but I think that support for this option is a natural one. We just ask for more information than just the source and it would make implementing certain features (vectorizer for example as a plugin) a lot easier.

@brwe
Copy link
Contributor

brwe commented Jul 22, 2015

Since one of the arguments against pluggable sub-phases was that it might make it harder to maintain the code I took a closer look at what would need to be done. I made a pr with a prof of concept here to show what would happen: #12400
I now actually think we would remove a little complexity by this. Let me know what you think!

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Term Vectors labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories stalled v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants