Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Item Query #11814

Closed
alexksikes opened this issue Jun 22, 2015 · 1 comment
Closed

Item Query #11814

alexksikes opened this issue Jun 22, 2015 · 1 comment
Labels
>feature Meta :Search/Search Search-related issues that do not fall into other categories

Comments

@alexksikes
Copy link
Contributor

Introduction

This a proposal for a different implementation of More Like This (MLT) which is purely based on the term vectors API. The goal of MLT is to find documents that are more like or similar to a given input document (or set of user input documents). In order to do so MLT simply performs nearest neighbor search in the vector space model induced by the query. This consists of forming a disjunctive query of all the terms in the input document, to execute that query and to finally return the results. However, in order to speed things up, MLT first performs a greedy selection of the best terms, that is the terms that would most contribute to the score of each matching document. In this respect, MLT assumes that the default Lucene similarity measure is being used, and therefore selects the terms with highest tf-idf. Furthermore, in order to limit the expensive call on the document frequency of each term, MLT prunes the selection space by only considering terms that have a certain minimum term frequency or length.

Item Query better decouples the process performed by MLT in three different steps. First, a characteristic vector of the input document is found using the Term Vectors API with the terms filtering (#9561). Second, the query is formed from these terms. Third, the same query is then sent to and executed on every shard. This approach has a couple of advantages that we now list below.

Advantages

  • This approach is a cleaner and a more faithful implementation to nearest neighbor search with feature selection. It is simpler to implement and easier to understand. For example, we can easily explain the query without even having executed it. We could also always return the query formed together with the results returned at no additional cost.
  • The same query is executed on every shard, which has some nice performance implication with the file system cache. Furthermore, this approach lowers the load on the cluster as terms filtering is only performed on the single shard holding the requested document.
  • Together with the ongoing work on query refactoring (Refactor parsing of queries/filters, aggs, suggester APIs #10217), we would no longer need to serialize all candidate terms but only the selected ones used to form the query. Implementation wise we would simply re-use serialization of the terms query.
  • Allows for more complex terms filtering with scripting, and therefore for different feature selection procedures. This would allow us to perform feature selection on a different feature metric space, not necessarily limited to the Lucene default similarity measure. See the example below on Bayesian Sets.

Disadvantages

  • We could now have a speed bottleneck if the shard holding the document and performing filtering is busy or slow.
  • We are no longer based off Lucene MLT code.
  • We are loosing on the terms that increase recall as the selection is only performed on the shard holding the input document. MLT does perform the term selection on every shard. However, with enough data document frequencies should be well distributed. With small indices it is also always possible to allow for the dfs option of the Terms Vectors API.

Example of Usage: Bayesian Sets

Bayesian Sets (BSets) is an item based search algorithm that has been shown to return good results for multimedia documents, such as images, but not only. There are two important ideas behind item based search. First, the query is made of items (not only textual documents), and the results are items. Second, the similarity measure is solely implicitly defined by the choice of features, requiring very little tweaking of the underlying matching algorithm. The features of an item could be thought as defining some summary statistics characterizing this item. It usually takes the form of a vector of characteristics. For example, an image could be turned into a feature vector by using the histogram of its most common colors.

The formula employed by BSets to compute the score of each item with respect to the input item(s) is given as follow:

s = c + X · q

where s is the score vector for an item i, c is a pre-computed constant, X is an item-feature matrix and q is a computed query vector based off the input item(s).

This formula is in fact very similar to the vector space model but under the new similarity measure given by q. In essence, each feature j is weighted according to some q__j_. Therefore, we can readily approximate BSets by changing the similarity measure to q, and by using an Item Query where the best features (represented as unique terms) are selected according to the q__j_. It is true that we do not have all the q__j_ as this would be too expensive, but we do now have the ones which would most contribute to the final score of each matching document, under this new similarity measure.

Implementation Notes

Item Query would probably be implemented as an experimental feature or as a plugin. We pretty much have all the elements required for Item Query, but here is a list of things to do and additional features:

  • add scripting for term vectors filtering.
  • add an option to dfs only the best n + k terms, and return the best n terms.
  • implement like_text as an artificial document with cross fields.
  • benchmark traditional MLT vs. Item Query.
  • provide an example of using BSets on an image dataset such as MIRFLICKR.
@clintongormley
Copy link

Nothing further here - closing

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Term Vectors labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Meta :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

2 participants