Support for shard level caching of term vectors #8395

alexksikes · 2014-11-07T18:10:46Z

This commits adds caching support to the Term Vectors API. A new _cache body
parameter is introduced in the term vector request. When set to true, the
shard query cache is solicited so to keep the same near real-time promise as
uncached requests. This caching mechanism makes sense in a MLT scenario in
which the same term vector request is performed multiple times, once per
shard, or when the request is especially expensive, for example when asking
for term statistics or distributed frequencies.

In order to keep the real-time promise of the term vectors API, caching is to
false by default. Additionally term vector requests are now timed.

This commits adds caching support to the Term Vectors API. A new `_cache` body parameter is introduced in the term vector request. When set to `true`, the shard query cache is solicited so to keep the same near real-time promise as uncached requests. This caching mechanism makes sense in a MLT scenario in which the same term vector request is performed multiple times, once per shard, or when the request is especially expensive, for example when asking for term statistics or distributed frequencies. In order to keep the real-time promise of the term vectors API, caching is to `false` by default. Additionally term vector requests are now timed.

jpountz · 2014-11-17T23:26:06Z

I would be ok to push such a work-around temporarily if it is needed for performance reasons (is it or does the fs cache already do a good job?), but I think the right fix would be to parse the mlt query and fetch term vectors on the coordinating node only, and then to send to shards all the information that they need to find similar documents? I know this is a high hanging fruit given the current design, but this would also be useful to other areas of elasticsearch...

alexksikes · 2014-11-20T10:16:08Z

Yes I agree and this would provide a first hand solution to deprecating MLT API. Also note that caching is set to false by default. There are also some plans of reducing MLT to TVs with dfs only. Beyond MLT this caching mechanism might be useful, think of a blog that generates a word cloud of its top entries, each time the user hit the front page. @s1monw WDYT?

Should we use the same configuration parameters as Query Cache, or should this have its own independent set of parameters?

jpountz · 2014-11-27T09:16:46Z

src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java

@@ -421,10 +442,18 @@ public void readFrom(StreamInput in) throws IOException {
            }
            this.realtime = in.readBoolean();
        }
+        if (in.getVersion().onOrAfter(Version.V_2_0_0)) {


No need for these version checks, a full cluster restart will be required for 2.0 anyway

jpountz · 2014-11-27T09:33:32Z

Something that worries me a bit here is that this is using the same cache as the query cache. And since cache keys consist of the json request and the reader version, you could have collisions on requests that would be valid for both the _search and _termvectors apis.

s1monw · 2014-11-28T11:36:51Z

I agree with @jpountz I think we should wrap the TV request with a special cache key prefix to make sure we don't collide with the search request, would that fix your concerns?

jpountz · 2014-11-28T14:08:44Z

Not sure about prefixing, what about adding a new property to IndicesQueryCache.Key to describe what the cache value is? This could eg. be _search for search requests and _termvectors for termvectors requests?

alexksikes · 2014-11-28T14:15:22Z

That seems to be a cleaner solution, I'm ok either way.

jpountz · 2014-12-04T09:58:52Z

src/main/java/org/elasticsearch/action/termvector/TermVectorRequest.java

        super.writeTo(out);
        if (out.getVersion().before(Version.V_1_4_0_Beta1)) {
            //term vector used to read & write the index twice, here and in the parent class
            out.writeString(index);
        }
        out.writeString(type);
-        out.writeString(id);
+        if (!asKey || (asKey && this.doc() == null)) {


can be simplified to if (!asKey || this.doc() == null)?

jpountz · 2014-12-04T10:17:04Z

@alexksikes Left minor comments. Can you also add tests to this PR (preferably unit tests)?

jpountz · 2014-12-04T10:20:05Z

I also think we should raise an error when both "realtime" and "cache" are true since they are mutually exclusive?

alexksikes · 2014-12-05T10:11:40Z

Actually they are not quite mutually exclusive as having both "realtime" and "cache" set to true, will generate the term vectors from the document in the transaction log on the first request, if no other request like it has been performed. So it is realtime on first unseen request, and then NRT on subsequent requests because of the cache.

jpountz · 2014-12-05T11:18:48Z

If you specify realtime: true and get results for old data, I think it's a bug. I would rather fail requests that ask for both realtime results and caching.

alexksikes · 2014-12-05T13:05:03Z

Sounds good but one issue I'm not completely satisfied about is that since realtime: true is the default, then to use the cache, the user would have to set two options realtime: false and _cache: true. But I think it is fine this way. Thanks for the comments!

jpountz · 2014-12-17T12:08:55Z

To be honest I'm worried about the complexity that we are adding here compared to the value since the option cannot be on by default. I think we already have too much complexity in elasticsearch and should not add more unless necessary. So I'd vote to close this PR and resurrect it if there are performance issues due to multiple calls to the term vectors API that would be solved with the cache.

alexksikes · 2014-12-17T13:08:22Z

That's a possibility but I don't think it adds that much complexity. I'm noting that we go from 30ms (file system cache) to 1ms on most requests when this caching mechanism is on, which it will be by default when performing a MLT query.

Also, one of the raison d'être of this PR is to deprecate the MLT API. So either we push this PR or we fix the problems with the current MLT API with this other PR #8028, or we do both. I let @s1monw do the final vote on this one.

alexksikes · 2015-07-06T22:04:47Z

Closing in favor of #10217

alexksikes added v2.0.0-beta1 >feature >enhancement review and removed >feature labels Nov 7, 2014

jpountz reviewed Nov 27, 2014
View reviewed changes

alexksikes added 2 commits December 1, 2014 11:35

removed version checks + keyType

c1177c3

fixed error msg, fixed test, caching with artificial docs

8e053b1

jpountz reviewed Dec 4, 2014
View reviewed changes

addressed comments

ff1f755

alexksikes added stalled and removed review labels Jan 26, 2015

drewr force-pushed the master branch from dcc3da0 to 7c20a8a Compare February 20, 2015 16:48

clintongormley changed the title ~~Term Vectors: Support for shard level caching~~ Support for shard level caching of term vectors Jun 7, 2015

alexksikes closed this Jul 6, 2015

alexksikes deleted the feature/tv-caching branch July 6, 2015 22:05

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Term Vectors labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for shard level caching of term vectors #8395

Support for shard level caching of term vectors #8395

alexksikes commented Nov 7, 2014

jpountz commented Nov 17, 2014

alexksikes commented Nov 20, 2014

jpountz Nov 27, 2014

jpountz commented Nov 27, 2014

s1monw commented Nov 28, 2014

jpountz commented Nov 28, 2014

alexksikes commented Nov 28, 2014

jpountz Dec 4, 2014

jpountz commented Dec 4, 2014

jpountz commented Dec 4, 2014

alexksikes commented Dec 5, 2014

jpountz commented Dec 5, 2014

alexksikes commented Dec 5, 2014

jpountz commented Dec 17, 2014

alexksikes commented Dec 17, 2014

alexksikes commented Jul 6, 2015

Support for shard level caching of term vectors #8395

Support for shard level caching of term vectors #8395

Conversation

alexksikes commented Nov 7, 2014

jpountz commented Nov 17, 2014

alexksikes commented Nov 20, 2014

jpountz Nov 27, 2014

Choose a reason for hiding this comment

jpountz commented Nov 27, 2014

s1monw commented Nov 28, 2014

jpountz commented Nov 28, 2014

alexksikes commented Nov 28, 2014

jpountz Dec 4, 2014

Choose a reason for hiding this comment

jpountz commented Dec 4, 2014

jpountz commented Dec 4, 2014

alexksikes commented Dec 5, 2014

jpountz commented Dec 5, 2014

alexksikes commented Dec 5, 2014

jpountz commented Dec 17, 2014

alexksikes commented Dec 17, 2014

alexksikes commented Jul 6, 2015