New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for shard level caching of term vectors #8395
Conversation
This commits adds caching support to the Term Vectors API. A new `_cache` body parameter is introduced in the term vector request. When set to `true`, the shard query cache is solicited so to keep the same near real-time promise as uncached requests. This caching mechanism makes sense in a MLT scenario in which the same term vector request is performed multiple times, once per shard, or when the request is especially expensive, for example when asking for term statistics or distributed frequencies. In order to keep the real-time promise of the term vectors API, caching is to `false` by default. Additionally term vector requests are now timed.
I would be ok to push such a work-around temporarily if it is needed for performance reasons (is it or does the fs cache already do a good job?), but I think the right fix would be to parse the mlt query and fetch term vectors on the coordinating node only, and then to send to shards all the information that they need to find similar documents? I know this is a high hanging fruit given the current design, but this would also be useful to other areas of elasticsearch... |
Yes I agree and this would provide a first hand solution to deprecating MLT API. Also note that caching is set to Should we use the same configuration parameters as Query Cache, or should this have its own independent set of parameters? |
@@ -421,10 +442,18 @@ public void readFrom(StreamInput in) throws IOException { | |||
} | |||
this.realtime = in.readBoolean(); | |||
} | |||
if (in.getVersion().onOrAfter(Version.V_2_0_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for these version checks, a full cluster restart will be required for 2.0 anyway
Something that worries me a bit here is that this is using the same cache as the query cache. And since cache keys consist of the json request and the reader version, you could have collisions on requests that would be valid for both the |
I agree with @jpountz I think we should wrap the TV request with a special cache key prefix to make sure we don't collide with the search request, would that fix your concerns? |
Not sure about prefixing, what about adding a new property to IndicesQueryCache.Key to describe what the cache value is? This could eg. be |
That seems to be a cleaner solution, I'm ok either way. |
super.writeTo(out); | ||
if (out.getVersion().before(Version.V_1_4_0_Beta1)) { | ||
//term vector used to read & write the index twice, here and in the parent class | ||
out.writeString(index); | ||
} | ||
out.writeString(type); | ||
out.writeString(id); | ||
if (!asKey || (asKey && this.doc() == null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be simplified to if (!asKey || this.doc() == null)
?
@alexksikes Left minor comments. Can you also add tests to this PR (preferably unit tests)? |
I also think we should raise an error when both "realtime" and "cache" are true since they are mutually exclusive? |
Actually they are not quite mutually exclusive as having both "realtime" and "cache" set to true, will generate the term vectors from the document in the transaction log on the first request, if no other request like it has been performed. So it is realtime on first unseen request, and then NRT on subsequent requests because of the cache. |
If you specify |
Sounds good but one issue I'm not completely satisfied about is that since |
To be honest I'm worried about the complexity that we are adding here compared to the value since the option cannot be on by default. I think we already have too much complexity in elasticsearch and should not add more unless necessary. So I'd vote to close this PR and resurrect it if there are performance issues due to multiple calls to the term vectors API that would be solved with the cache. |
That's a possibility but I don't think it adds that much complexity. I'm noting that we go from 30ms (file system cache) to 1ms on most requests when this caching mechanism is on, which it will be by default when performing a MLT query. Also, one of the raison d'être of this PR is to deprecate the MLT API. So either we push this PR or we fix the problems with the current MLT API with this other PR #8028, or we do both. I let @s1monw do the final vote on this one. |
Closing in favor of #10217 |
This commits adds caching support to the Term Vectors API. A new
_cache
bodyparameter is introduced in the term vector request. When set to
true
, theshard query cache is solicited so to keep the same near real-time promise as
uncached requests. This caching mechanism makes sense in a MLT scenario in
which the same term vector request is performed multiple times, once per
shard, or when the request is especially expensive, for example when asking
for term statistics or distributed frequencies.
In order to keep the real-time promise of the term vectors API, caching is to
false
by default. Additionally term vector requests are now timed.