Term Vectors: Support for artificial documents

This adds the ability to the Term Vector API to generate term vectors for artifical documents, that is for documents not present in the index. Following a similar syntax to the Percolator API, a new 'doc' parameter is used, instead of '_id', that specifies the document of interest. The parameters '_index' and '_type' determine the mapping and therefore analyzers to apply to each value field. Closes #7530
elastic · Sep 5, 2014 · c2ec379 · c2ec379
1 parent ebd4007
commit c2ec379
Show file tree

Hide file tree

Showing 13 changed files with 454 additions and 67 deletions.
diff --git a/docs/reference/docs/multi-termvectors.asciidoc b/docs/reference/docs/multi-termvectors.asciidoc
@@ -1,7 +1,10 @@
 [[docs-multi-termvectors]]
 == Multi termvectors API
 
-Multi termvectors API allows to get multiple termvectors based on an index, type and id. The response includes a `docs`
+Multi termvectors API allows to get multiple termvectors at once. The
+documents from which to retrieve the term vectors are specified by an index,
+type and id. But the documents could also be artificially provided coming[1.4.0].
+The response includes a `docs`
 array with all the fetched termvectors, each element having the structure
 provided by the <<docs-termvectors,termvectors>>
 API. Here is an example:
@@ -89,4 +92,31 @@ curl 'localhost:9200/testidx/test/_mtermvectors' -d '{
 }'
 --------------------------------------------------
 
-Parameters can also be set by passing them as uri parameters (see <<docs-termvectors,termvectors>>). uri parameters are the default parameters and are overwritten by any parameter setting defined in the body.
+Additionally coming[1.4.0], just like for the <<docs-termvectors,termvectors>>
+API, term vectors could be generated for user provided documents. The syntax
+is similar to the <<search-percolate,percolator>> API. The mapping used is
+determined by `_index` and `_type`.
+
+[source,js]
+--------------------------------------------------
+curl 'localhost:9200/_mtermvectors' -d '{
+   "docs": [
+      {
+         "_index": "testidx",
+         "_type": "test",
+         "doc" : {
+            "fullname" : "John Doe",
+            "text" : "twitter test test test"
+         }
+      },
+      {
+         "_index": "testidx",
+         "_type": "test",
+         "doc" : {
+           "fullname" : "Jane Doe",
+           "text" : "Another twitter test ..."
+         }
+      }
+   ]
+}'
+--------------------------------------------------
diff --git a/docs/reference/docs/termvectors.asciidoc b/docs/reference/docs/termvectors.asciidoc
@@ -3,10 +3,11 @@
 
 added[1.0.0.Beta1]
 
-Returns information and statistics on terms in the fields of a
-particular document as stored in the index. Note that this is a
-near realtime API as the term vectors are not available until the
-next refresh.
+Returns information and statistics on terms in the fields of a particular
+document. The document could be stored in the index or artificially provided
+by the user coming[1.4.0]. Note that for documents stored in the index, this
+is a near realtime API as the term vectors are not available until the next
+refresh.
 
 [source,js]
 --------------------------------------------------
@@ -41,10 +42,10 @@ statistics are returned for all fields but no term statistics.
  * term payloads (`payloads` : true), as base64 encoded bytes
 
 If the requested information wasn't stored in the index, it will be
-computed on the fly if possible. See <<mapping-types,type mapping>>
-for how to configure your index to store term vectors.
+computed on the fly if possible. Additionally, term vectors could be computed
+for documents not even existing in the index, but instead provided by the user.
 
-coming[1.4.0,The ability to computed term vectors on the fly is only available from 1.4.0 onwards (see below)]
+coming[1.4.0,The ability to computed term vectors on the fly as well as support for artificial documents is only available from 1.4.0 onwards (see below example 2 and 3 respectively)]
 
 [WARNING]
 ======
@@ -86,7 +87,9 @@ The term and field statistics are not accurate. Deleted documents
 are not taken into account. The information is only retrieved for the
 shard the requested document resides in. The term and field statistics
 are therefore only useful as relative measures whereas the absolute
-numbers have no meaning in this context.
+numbers have no meaning in this context. By default, when requesting
+term vectors of artificial documents, a shard to get the statistics from
+is randomly selected. Use `routing` only to hit a particular shard.
 
 [float]
 === Example 1
@@ -231,7 +234,7 @@ Response:
 [float]
 === Example 2 coming[1.4.0]
 
-Additionally, term vectors which are not explicitly stored in the index are automatically
+Term vectors which are not explicitly stored in the index are automatically
 computed on the fly. The following request returns all information and statistics for the
 fields in document `1`, even though the terms haven't been explicitly stored in the index.
 Note that for the field `text`, the terms are not re-generated.
@@ -246,3 +249,29 @@ curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvector?pretty=true' -d '{
   "field_statistics" : true
 }'
 --------------------------------------------------
+
+[float]
+=== Example 3 coming[1.4.0]
+
+Additionally, term vectors can also be generated for artificial documents,
+that is for documents not present in the index. The syntax is similar to the
+<<search-percolate,percolator>> API. For example, the following request would
+return the same results as in example 1. The mapping used is determined by the
+`index` and `type`.
+
+[WARNING]
+======
+If dynamic mapping is turned on (default), the document fields not in the original
+mapping will be dynamically created.
+======
+
+[source,js]
+--------------------------------------------------
+curl -XGET 'http://localhost:9200/twitter/tweet/_termvector' -d '{
+  "doc" : {
+    "fullname" : "John Doe",
+    "text" : "twitter test test test"
+  }
+}'
+--------------------------------------------------
+
diff --git a/src/main/java/org/elasticsearch/action/termvector/MultiTermVectorsRequest.java b/src/main/java/org/elasticsearch/action/termvector/MultiTermVectorsRequest.java
@@ -90,7 +90,6 @@ public void add(TermVectorRequest template, BytesReference data) throws Exceptio
                     if (token == XContentParser.Token.FIELD_NAME) {
                         currentFieldName = parser.currentName();
                     } else if (token == XContentParser.Token.START_ARRAY) {
-
                         if ("docs".equals(currentFieldName)) {
                             while ((token = parser.nextToken()) != XContentParser.Token.END_ARRAY) {
                                 if (token != XContentParser.Token.START_OBJECT) {

diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java
@@ -26,12 +26,17 @@
 import org.elasticsearch.action.ValidateActions;
 import org.elasticsearch.action.get.MultiGetRequest;
 import org.elasticsearch.action.support.single.shard.SingleShardOperationRequest;
+import org.elasticsearch.common.bytes.BytesReference;
 import org.elasticsearch.common.io.stream.StreamInput;
 import org.elasticsearch.common.io.stream.StreamOutput;
+import org.elasticsearch.common.xcontent.XContentBuilder;
 import org.elasticsearch.common.xcontent.XContentParser;
 
 import java.io.IOException;
 import java.util.*;
+import java.util.concurrent.atomic.AtomicInteger;
+
+import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;
 
 /**
  * Request returning the term vector (doc frequency, positions, offsets) for a
@@ -46,10 +51,14 @@ public class TermVectorRequest extends SingleShardOperationRequest<TermVectorReq
 
     private String id;
 
+    private BytesReference doc;
+
     private String routing;
 
     protected String preference;
 
+    private static final AtomicInteger randomInt = new AtomicInteger(0);
+
     // TODO: change to String[]
     private Set<String> selectedFields;
 
@@ -129,6 +138,23 @@ public TermVectorRequest id(String id) {
         return this;
     }
 
+    /**
+     * Returns the artificial document from which term vectors are requested for.
+     */
+    public BytesReference doc() {
+        return doc;
+    }
+
+    /**
+     * Sets an artificial document from which term vectors are requested for.
+     */
+    public TermVectorRequest doc(XContentBuilder documentBuilder) {
+        // assign a random id to this artificial document, for routing
+        this.id(String.valueOf(randomInt.getAndAdd(1)));
+        this.doc = documentBuilder.bytes();
+        return this;
+    }
+
     /**
      * @return The routing for this request.
      */
@@ -281,8 +307,8 @@ public ActionRequestValidationException validate() {
         if (type == null) {
             validationException = ValidateActions.addValidationError("type is missing", validationException);
         }
-        if (id == null) {
-            validationException = ValidateActions.addValidationError("id is missing", validationException);
+        if (id == null && doc == null) {
+            validationException = ValidateActions.addValidationError("id or doc is missing", validationException);
         }
         return validationException;
     }
@@ -303,6 +329,12 @@ public void readFrom(StreamInput in) throws IOException {
         }
         type = in.readString();
         id = in.readString();
+
+        if (in.getVersion().onOrAfter(Version.V_1_4_0)) {
+            if (in.readBoolean()) {
+                doc = in.readBytesReference();
+            }
+        }
         routing = in.readOptionalString();
         preference = in.readOptionalString();
         long flags = in.readVLong();
@@ -331,6 +363,13 @@ public void writeTo(StreamOutput out) throws IOException {
         }
         out.writeString(type);
         out.writeString(id);
+
+        if (out.getVersion().onOrAfter(Version.V_1_4_0)) {
+            out.writeBoolean(doc != null);
+            if (doc != null) {
+                out.writeBytesReference(doc);
+            }
+        }
         out.writeOptionalString(routing);
         out.writeOptionalString(preference);
         long longFlags = 0;
@@ -389,7 +428,15 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar
                 } else if ("_type".equals(currentFieldName)) {
                     termVectorRequest.type = parser.text();
                 } else if ("_id".equals(currentFieldName)) {
+                    if (termVectorRequest.doc != null) {
+                        throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!");
+                    }
                     termVectorRequest.id = parser.text();
+                } else if ("doc".equals(currentFieldName)) {
+                    if (termVectorRequest.id != null) {
+                        throw new ElasticsearchParseException("Either \"id\" or \"doc\" can be specified, but not both!");
+                    }
+                    termVectorRequest.doc(jsonBuilder().copyCurrentStructure(parser));
                 } else if ("_routing".equals(currentFieldName) || "routing".equals(currentFieldName)) {
                     termVectorRequest.routing = parser.text();
                 } else {
@@ -398,7 +445,6 @@ public static void parseRequest(TermVectorRequest termVectorRequest, XContentPar
                 }
             }
         }
-
         if (fields.size() > 0) {
             String[] fieldsAsArray = new String[fields.size()];
             termVectorRequest.selectedFields(fields.toArray(fieldsAsArray));

diff --git a/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java b/src/main/java/org/elasticsearch/action/termvector/TermVectorRequestBuilder.java
@@ -22,6 +22,7 @@
 import org.elasticsearch.action.ActionListener;
 import org.elasticsearch.action.ActionRequestBuilder;
 import org.elasticsearch.client.Client;
+import org.elasticsearch.common.xcontent.XContentBuilder;
 
 /**
  */
@@ -35,6 +36,38 @@ public TermVectorRequestBuilder(Client client, String index, String type, String
         super(client, new TermVectorRequest(index, type, id));
     }
 
+    /**
+     * Sets the index where the document is located.
+     */
+    public TermVectorRequestBuilder setIndex(String index) {
+        request.index(index);
+        return this;
+    }
+
+    /**
+     * Sets the type of the document.
+     */
+    public TermVectorRequestBuilder setType(String type) {
+        request.type(type);
+        return this;
+    }
+
+    /**
+     * Sets the id of the document.
+     */
+    public TermVectorRequestBuilder setId(String id) {
+        request.id(id);
+        return this;
+    }
+
+    /**
+     * Sets the artificial document from which to generate term vectors.
+     */
+    public TermVectorRequestBuilder setDoc(XContentBuilder xContent) {
+        request.doc(xContent);
+        return this;
+    }
+
     /**
      * Sets the routing. Required if routing isn't id based.
      */