You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This a proposal for a different implementation of More Like This (MLT) which is purely based on the term vectors API. The goal of MLT is to find documents that are more like or similar to a given input document (or set of user input documents). In order to do so MLT simply performs nearest neighbor search in the vector space model induced by the query. This consists of forming a disjunctive query of all the terms in the input document, to execute that query and to finally return the results. However, in order to speed things up, MLT first performs a greedy selection of the best terms, that is the terms that would most contribute to the score of each matching document. In this respect, MLT assumes that the default Lucene similarity measure is being used, and therefore selects the terms with highest tf-idf. Furthermore, in order to limit the expensive call on the document frequency of each term, MLT prunes the selection space by only considering terms that have a certain minimum term frequency or length.
Item Query better decouples the process performed by MLT in three different steps. First, a characteristic vector of the input document is found using the Term Vectors API with the terms filtering (#9561). Second, the query is formed from these terms. Third, the same query is then sent to and executed on every shard. This approach has a couple of advantages that we now list below.
Advantages
This approach is a cleaner and a more faithful implementation to nearest neighbor search with feature selection. It is simpler to implement and easier to understand. For example, we can easily explain the query without even having executed it. We could also always return the query formed together with the results returned at no additional cost.
The same query is executed on every shard, which has some nice performance implication with the file system cache. Furthermore, this approach lowers the load on the cluster as terms filtering is only performed on the single shard holding the requested document.
Together with the ongoing work on query refactoring (Refactor parsing of queries/filters, aggs, suggester APIs #10217), we would no longer need to serialize all candidate terms but only the selected ones used to form the query. Implementation wise we would simply re-use serialization of the terms query.
Allows for more complex terms filtering with scripting, and therefore for different feature selection procedures. This would allow us to perform feature selection on a different feature metric space, not necessarily limited to the Lucene default similarity measure. See the example below on Bayesian Sets.
Disadvantages
We could now have a speed bottleneck if the shard holding the document and performing filtering is busy or slow.
We are no longer based off Lucene MLT code.
We are loosing on the terms that increase recall as the selection is only performed on the shard holding the input document. MLT does perform the term selection on every shard. However, with enough data document frequencies should be well distributed. With small indices it is also always possible to allow for the dfs option of the Terms Vectors API.
Example of Usage: Bayesian Sets
Bayesian Sets (BSets) is an item based search algorithm that has been shown to return good results for multimedia documents, such as images, but not only. There are two important ideas behind item based search. First, the query is made of items (not only textual documents), and the results are items. Second, the similarity measure is solely implicitly defined by the choice of features, requiring very little tweaking of the underlying matching algorithm. The features of an item could be thought as defining some summary statistics characterizing this item. It usually takes the form of a vector of characteristics. For example, an image could be turned into a feature vector by using the histogram of its most common colors.
The formula employed by BSets to compute the score of each item with respect to the input item(s) is given as follow:
s = c + X · q
where s is the score vector for an item i, c is a pre-computed constant, X is an item-feature matrix and q is a computed query vector based off the input item(s).
This formula is in fact very similar to the vector space model but under the new similarity measure given by q. In essence, each feature j is weighted according to some q__j_. Therefore, we can readily approximate BSets by changing the similarity measure to q, and by using an Item Query where the best features (represented as unique terms) are selected according to the q__j_. It is true that we do not have all the q__j_ as this would be too expensive, but we do now have the ones which would most contribute to the final score of each matching document, under this new similarity measure.
Implementation Notes
Item Query would probably be implemented as an experimental feature or as a plugin. We pretty much have all the elements required for Item Query, but here is a list of things to do and additional features:
add scripting for term vectors filtering.
add an option to dfs only the best n + k terms, and return the best n terms.
implement like_text as an artificial document with cross fields.
benchmark traditional MLT vs. Item Query.
provide an example of using BSets on an image dataset such as MIRFLICKR.
The text was updated successfully, but these errors were encountered:
Introduction
This a proposal for a different implementation of More Like This (MLT) which is purely based on the term vectors API. The goal of MLT is to find documents that are more like or similar to a given input document (or set of user input documents). In order to do so MLT simply performs nearest neighbor search in the vector space model induced by the query. This consists of forming a disjunctive query of all the terms in the input document, to execute that query and to finally return the results. However, in order to speed things up, MLT first performs a greedy selection of the best terms, that is the terms that would most contribute to the score of each matching document. In this respect, MLT assumes that the default Lucene similarity measure is being used, and therefore selects the terms with highest tf-idf. Furthermore, in order to limit the expensive call on the document frequency of each term, MLT prunes the selection space by only considering terms that have a certain minimum term frequency or length.
Item Query better decouples the process performed by MLT in three different steps. First, a characteristic vector of the input document is found using the Term Vectors API with the terms filtering (#9561). Second, the query is formed from these terms. Third, the same query is then sent to and executed on every shard. This approach has a couple of advantages that we now list below.
Advantages
Disadvantages
Example of Usage: Bayesian Sets
Bayesian Sets (BSets) is an item based search algorithm that has been shown to return good results for multimedia documents, such as images, but not only. There are two important ideas behind item based search. First, the query is made of items (not only textual documents), and the results are items. Second, the similarity measure is solely implicitly defined by the choice of features, requiring very little tweaking of the underlying matching algorithm. The features of an item could be thought as defining some summary statistics characterizing this item. It usually takes the form of a vector of characteristics. For example, an image could be turned into a feature vector by using the histogram of its most common colors.
The formula employed by BSets to compute the score of each item with respect to the input item(s) is given as follow:
s = c + X · q
where s is the score vector for an item i, c is a pre-computed constant, X is an item-feature matrix and q is a computed query vector based off the input item(s).
This formula is in fact very similar to the vector space model but under the new similarity measure given by q. In essence, each feature j is weighted according to some q__j_. Therefore, we can readily approximate BSets by changing the similarity measure to q, and by using an Item Query where the best features (represented as unique terms) are selected according to the q__j_. It is true that we do not have all the q__j_ as this would be too expensive, but we do now have the ones which would most contribute to the final score of each matching document, under this new similarity measure.
Implementation Notes
Item Query would probably be implemented as an experimental feature or as a plugin. We pretty much have all the elements required for Item Query, but here is a list of things to do and additional features:
The text was updated successfully, but these errors were encountered: