Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support terms filtering #9561

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
106 changes: 106 additions & 0 deletions docs/reference/docs/termvectors.asciidoc
Expand Up @@ -85,6 +85,40 @@ Setting `dfs` to `true` (default is `false`) will return the term statistics
or the field statistics of the entire index, and not just at the shard. Use it
with caution as distributed frequencies can have a serious performance impact.

[float]
==== Terms Filtering coming[2.0]

With the parameter `filter`, the terms returned could also be filtered based

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about using the word filter here, as we try to reserve that purely for query DSL filters (although we do use it for fielddata filters as well http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/fielddata-formats.html#field-data-filtering). I'm struggling to come up with a better name though.

I'm wondering if we need this sub-level at all, or if we can just move the max_num_terms etc parameters up to the top level? There doesn't seem to be any name clash (although I do like the fact that having them as sub-params indicates their action more clearly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about using filter as a sub-level parameter is that filter : {} with no parameter will perform filtering but using the default sub parameters. Another reason would be to hide the complexity of a feature which might be used only in some very specific cases. It also allows us to expand on this sub feature, in order, for example, to allow for filtering based on a script more cleanly I think. But I do agree, I am not sure if filter should be the proper name for this feature.

on their tf-idf scores. This could be useful in order find out a good
characteristic vector of a document. This feature works in a similar manner to
the <<mlt-query-term-selection,second phase>> of the
<<query-dsl-mlt-query,More Like This Query>>. See <<docs-termvectors-terms-filtering,example 5>>
for usage.

[NOTE]
====
In order to compute the score of each term `term_statistics` and
`field_statistics` must be set to `true`.
====

The following sub-parameters are supported:

[horizontal]
`max_num_terms`::
Maximum number of terms that must be returned per field. Defaults to `25`.
`min_term_freq`::
Ignore words with less than this frequency in the source doc. Defaults to `1`.
`max_term_freq`::
Ignore words with more than this frequency in the source doc. Defaults to unbounded.
`min_doc_freq`::
Ignore terms which do not occur in at least this many docs. Defaults to `1`.
`max_doc_freq`::
Ignore words which occur in more than this many docs. Defaults to unbounded.
`min_word_length`::
The minimum word length below which words will be ignored. Defaults to `0`.
`max_word_length`::
The maximum word length above which words will be ignored. Defaults to unbounded (`0`).

[float]
=== Behaviour

Expand Down Expand Up @@ -337,3 +371,75 @@ Response:
}
}
--------------------------------------------------

[float]
[[docs-termvectors-terms-filtering]]
=== Example 5

Finally, the terms returned could be filtered based on their tf-idf scores. In
the example below we obtain the three most "interesting" keywords from the
artificial document having the given "plot" field value. Additionally, we are
asking for distributed frequencies to obtain more accurate results. Notice
that the keyword "Tony" or any stop words are not part of the response, as
their tf-idf must be too low.

[source,js]
--------------------------------------------------
GET /imdb/movies/_termvectors
{
"doc": {
"plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
},
"term_statistics" : true,
"field_statistics" : true,
"dfs": true,
"positions": false,
"offsets": false,
"filter" : {
"max_num_terms" : 3,
"min_term_freq" : 1,
"min_doc_freq" : 1
}
}
--------------------------------------------------

Response:

[source,js]
--------------------------------------------------
{
"_index": "imdb",
"_type": "movies",
"_version": 0,
"found": true,
"term_vectors": {
"plot": {
"field_statistics": {
"sum_doc_freq": 3384269,
"doc_count": 176214,
"sum_ttf": 3753460
},
"terms": {
"armored": {
"doc_freq": 27,
"ttf": 27,
"term_freq": 1,
"score": 9.74725
},
"industrialist": {
"doc_freq": 88,
"ttf": 88,
"term_freq": 1,
"score": 8.590818
},
"stark": {
"doc_freq": 44,
"ttf": 47,
"term_freq": 1,
"score": 9.272792
}
}
}
}
}
--------------------------------------------------
1 change: 1 addition & 0 deletions docs/reference/query-dsl/queries/mlt-query.asciidoc
Expand Up @@ -178,6 +178,7 @@ The text to find documents like it.
A list of documents following the same syntax as the <<docs-multi-get,Multi GET API>>.

[float]
[[mlt-query-term-selection]]
==== Term Selection Parameters

[horizontal]
Expand Down
Expand Up @@ -22,6 +22,7 @@
import com.carrotsearch.hppc.ObjectLongOpenHashMap;
import com.carrotsearch.hppc.cursors.ObjectLongCursor;
import org.apache.lucene.index.*;
import org.apache.lucene.search.BoostAttribute;
import org.apache.lucene.util.*;
import org.elasticsearch.common.bytes.BytesReference;
import org.elasticsearch.common.io.stream.BytesStreamInput;
Expand All @@ -40,7 +41,7 @@
* <tt>-1,</tt>, if no positions were returned by the {@link TermVectorsRequest}.
* <p/>
* The data is stored in two byte arrays ({@code headerRef} and
* {@code termVectors}, both {@link ByteRef}) that have the following format:
* {@code termVectors}, both {@link BytesRef}) that have the following format:
* <p/>
* {@code headerRef}: Stores offsets per field in the {@code termVectors} array
* and some header information as {@link BytesRef}. Format is
Expand Down Expand Up @@ -113,6 +114,7 @@ public final class TermVectorsFields extends Fields {
private final BytesReference termVectors;
final boolean hasTermStatistic;
final boolean hasFieldStatistic;
public final boolean hasScores;

/**
* @param headerRef Stores offsets per field in the {@code termVectors} and some
Expand All @@ -130,6 +132,7 @@ public TermVectorsFields(BytesReference headerRef, BytesReference termVectors) t
assert version == -1;
hasTermStatistic = header.readBoolean();
hasFieldStatistic = header.readBoolean();
hasScores = header.readBoolean();
final int numFields = header.readVInt();
for (int i = 0; i < numFields; i++) {
fieldMap.put((header.readString()), header.readVLong());
Expand Down Expand Up @@ -226,6 +229,7 @@ public TermsEnum iterator(TermsEnum reuse) throws IOException {
int[] endOffsets = new int[1];
BytesRefBuilder[] payloads = new BytesRefBuilder[1];
final BytesRefBuilder spare = new BytesRefBuilder();
BoostAttribute boostAtt = this.attributes().addAttribute(BoostAttribute.class);

@Override
public BytesRef next() throws IOException {
Expand All @@ -250,6 +254,11 @@ public BytesRef next() throws IOException {
// currentPosition etc. so that we can just iterate
// later
writeInfos(perFieldTermVectorInput);

// read the score if available
if (hasScores) {
boostAtt.setBoost(perFieldTermVectorInput.readFloat());
}
return spare.get();

} else {
Expand Down Expand Up @@ -487,5 +496,4 @@ int readPotentiallyNegativeVInt(BytesStreamInput stream) throws IOException {
long readPotentiallyNegativeVLong(BytesStreamInput stream) throws IOException {
return stream.readVLong() - 1;
}

}