Item Query #11814

alexksikes · 2015-06-22T23:21:04Z

Introduction

This a proposal for a different implementation of More Like This (MLT) which is purely based on the term vectors API. The goal of MLT is to find documents that are more like or similar to a given input document (or set of user input documents). In order to do so MLT simply performs nearest neighbor search in the vector space model induced by the query. This consists of forming a disjunctive query of all the terms in the input document, to execute that query and to finally return the results. However, in order to speed things up, MLT first performs a greedy selection of the best terms, that is the terms that would most contribute to the score of each matching document. In this respect, MLT assumes that the default Lucene similarity measure is being used, and therefore selects the terms with highest tf-idf. Furthermore, in order to limit the expensive call on the document frequency of each term, MLT prunes the selection space by only considering terms that have a certain minimum term frequency or length.

Item Query better decouples the process performed by MLT in three different steps. First, a characteristic vector of the input document is found using the Term Vectors API with the terms filtering (#9561). Second, the query is formed from these terms. Third, the same query is then sent to and executed on every shard. This approach has a couple of advantages that we now list below.

Advantages

This approach is a cleaner and a more faithful implementation to nearest neighbor search with feature selection. It is simpler to implement and easier to understand. For example, we can easily explain the query without even having executed it. We could also always return the query formed together with the results returned at no additional cost.
The same query is executed on every shard, which has some nice performance implication with the file system cache. Furthermore, this approach lowers the load on the cluster as terms filtering is only performed on the single shard holding the requested document.
Together with the ongoing work on query refactoring (Refactor parsing of queries/filters, aggs, suggester APIs #10217), we would no longer need to serialize all candidate terms but only the selected ones used to form the query. Implementation wise we would simply re-use serialization of the terms query.
Allows for more complex terms filtering with scripting, and therefore for different feature selection procedures. This would allow us to perform feature selection on a different feature metric space, not necessarily limited to the Lucene default similarity measure. See the example below on Bayesian Sets.

Disadvantages

We could now have a speed bottleneck if the shard holding the document and performing filtering is busy or slow.
We are no longer based off Lucene MLT code.
We are loosing on the terms that increase recall as the selection is only performed on the shard holding the input document. MLT does perform the term selection on every shard. However, with enough data document frequencies should be well distributed. With small indices it is also always possible to allow for the dfs option of the Terms Vectors API.

Example of Usage: Bayesian Sets

Bayesian Sets (BSets) is an item based search algorithm that has been shown to return good results for multimedia documents, such as images, but not only. There are two important ideas behind item based search. First, the query is made of items (not only textual documents), and the results are items. Second, the similarity measure is solely implicitly defined by the choice of features, requiring very little tweaking of the underlying matching algorithm. The features of an item could be thought as defining some summary statistics characterizing this item. It usually takes the form of a vector of characteristics. For example, an image could be turned into a feature vector by using the histogram of its most common colors.

The formula employed by BSets to compute the score of each item with respect to the input item(s) is given as follow:

s = c + X · q

where s is the score vector for an item i, c is a pre-computed constant, X is an item-feature matrix and q is a computed query vector based off the input item(s).

This formula is in fact very similar to the vector space model but under the new similarity measure given by q. In essence, each feature j is weighted according to some q__j_. Therefore, we can readily approximate BSets by changing the similarity measure to q, and by using an Item Query where the best features (represented as unique terms) are selected according to the q__j_. It is true that we do not have all the q__j_ as this would be too expensive, but we do now have the ones which would most contribute to the final score of each matching document, under this new similarity measure.

Implementation Notes

Item Query would probably be implemented as an experimental feature or as a plugin. We pretty much have all the elements required for Item Query, but here is a list of things to do and additional features:

add scripting for term vectors filtering.
add an option to dfs only the best n + k terms, and return the best n terms.
implement like_text as an artificial document with cross fields.
benchmark traditional MLT vs. Item Query.
provide an example of using BSets on an image dataset such as MIRFLICKR.

clintongormley · 2016-01-18T20:42:49Z

Nothing further here - closing

alexksikes added >feature v2.0.0-beta1 :Term Vectors Meta labels Jun 22, 2015

alexksikes mentioned this issue Jul 16, 2015

Item Query (the API) #12293

Closed

clintongormley removed the v2.0.0-beta1 label Aug 11, 2015

clintongormley closed this as completed Jan 18, 2016

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Term Vectors labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Item Query #11814

Item Query #11814

alexksikes commented Jun 22, 2015

clintongormley commented Jan 18, 2016

Item Query #11814

Item Query #11814

Comments

alexksikes commented Jun 22, 2015

Introduction

Advantages

Disadvantages

Example of Usage: Bayesian Sets

Implementation Notes

clintongormley commented Jan 18, 2016