Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for shard level caching of term vectors #8395

Closed
wants to merge 4 commits into from

Conversation

alexksikes
Copy link
Contributor

This commits adds caching support to the Term Vectors API. A new _cache body
parameter is introduced in the term vector request. When set to true, the
shard query cache is solicited so to keep the same near real-time promise as
uncached requests. This caching mechanism makes sense in a MLT scenario in
which the same term vector request is performed multiple times, once per
shard, or when the request is especially expensive, for example when asking
for term statistics or distributed frequencies.

In order to keep the real-time promise of the term vectors API, caching is to
false by default. Additionally term vector requests are now timed.

This commits adds caching support to the Term Vectors API. A new `_cache` body
parameter is introduced in the term vector request. When set to `true`, the
shard query cache is solicited so to keep the same near real-time promise as
uncached requests. This caching mechanism makes sense in a MLT scenario in
which the same term vector request is performed multiple times, once per
shard, or when the request is especially expensive, for example when asking
for term statistics or distributed frequencies.

In order to keep the real-time promise of the term vectors API, caching is to
`false` by default. Additionally term vector requests are now timed.
@jpountz
Copy link
Contributor

jpountz commented Nov 17, 2014

I would be ok to push such a work-around temporarily if it is needed for performance reasons (is it or does the fs cache already do a good job?), but I think the right fix would be to parse the mlt query and fetch term vectors on the coordinating node only, and then to send to shards all the information that they need to find similar documents? I know this is a high hanging fruit given the current design, but this would also be useful to other areas of elasticsearch...

@alexksikes
Copy link
Contributor Author

Yes I agree and this would provide a first hand solution to deprecating MLT API. Also note that caching is set to false by default. There are also some plans of reducing MLT to TVs with dfs only. Beyond MLT this caching mechanism might be useful, think of a blog that generates a word cloud of its top entries, each time the user hit the front page. @s1monw WDYT?

Should we use the same configuration parameters as Query Cache, or should this have its own independent set of parameters?

@@ -421,10 +442,18 @@ public void readFrom(StreamInput in) throws IOException {
}
this.realtime = in.readBoolean();
}
if (in.getVersion().onOrAfter(Version.V_2_0_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for these version checks, a full cluster restart will be required for 2.0 anyway

@jpountz
Copy link
Contributor

jpountz commented Nov 27, 2014

Something that worries me a bit here is that this is using the same cache as the query cache. And since cache keys consist of the json request and the reader version, you could have collisions on requests that would be valid for both the _search and _termvectors apis.

@s1monw
Copy link
Contributor

s1monw commented Nov 28, 2014

I agree with @jpountz I think we should wrap the TV request with a special cache key prefix to make sure we don't collide with the search request, would that fix your concerns?

@jpountz
Copy link
Contributor

jpountz commented Nov 28, 2014

Not sure about prefixing, what about adding a new property to IndicesQueryCache.Key to describe what the cache value is? This could eg. be _search for search requests and _termvectors for termvectors requests?

@alexksikes
Copy link
Contributor Author

That seems to be a cleaner solution, I'm ok either way.

super.writeTo(out);
if (out.getVersion().before(Version.V_1_4_0_Beta1)) {
//term vector used to read & write the index twice, here and in the parent class
out.writeString(index);
}
out.writeString(type);
out.writeString(id);
if (!asKey || (asKey && this.doc() == null)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be simplified to if (!asKey || this.doc() == null)?

@jpountz
Copy link
Contributor

jpountz commented Dec 4, 2014

@alexksikes Left minor comments. Can you also add tests to this PR (preferably unit tests)?

@jpountz
Copy link
Contributor

jpountz commented Dec 4, 2014

I also think we should raise an error when both "realtime" and "cache" are true since they are mutually exclusive?

@alexksikes
Copy link
Contributor Author

Actually they are not quite mutually exclusive as having both "realtime" and "cache" set to true, will generate the term vectors from the document in the transaction log on the first request, if no other request like it has been performed. So it is realtime on first unseen request, and then NRT on subsequent requests because of the cache.

@jpountz
Copy link
Contributor

jpountz commented Dec 5, 2014

If you specify realtime: true and get results for old data, I think it's a bug. I would rather fail requests that ask for both realtime results and caching.

@alexksikes
Copy link
Contributor Author

Sounds good but one issue I'm not completely satisfied about is that since realtime: true is the default, then to use the cache, the user would have to set two options realtime: false and _cache: true. But I think it is fine this way. Thanks for the comments!

@jpountz
Copy link
Contributor

jpountz commented Dec 17, 2014

To be honest I'm worried about the complexity that we are adding here compared to the value since the option cannot be on by default. I think we already have too much complexity in elasticsearch and should not add more unless necessary. So I'd vote to close this PR and resurrect it if there are performance issues due to multiple calls to the term vectors API that would be solved with the cache.

@alexksikes
Copy link
Contributor Author

That's a possibility but I don't think it adds that much complexity. I'm noting that we go from 30ms (file system cache) to 1ms on most requests when this caching mechanism is on, which it will be by default when performing a MLT query.

Also, one of the raison d'être of this PR is to deprecate the MLT API. So either we push this PR or we fix the problems with the current MLT API with this other PR #8028, or we do both. I let @s1monw do the final vote on this one.

@alexksikes alexksikes added stalled and removed review labels Jan 26, 2015
@clintongormley clintongormley changed the title Term Vectors: Support for shard level caching Support for shard level caching of term vectors Jun 7, 2015
@alexksikes
Copy link
Contributor Author

Closing in favor of #10217

@alexksikes alexksikes closed this Jul 6, 2015
@alexksikes alexksikes deleted the feature/tv-caching branch July 6, 2015 22:05
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Term Vectors labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories stalled v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants