New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Critical lucene LRUQueryCache bug may cause query cache evict all cached item #100755
Comments
Pinging @elastic/es-search (Team:Search) |
This has been fixed in lucene by apache/lucene#12614 |
@psc0606 Could you share what you did to resolve the issue on your cluster? I may be running into something similar on my end. |
The fix was included in Lucene 9.9, released earlier this week. Elasticsearch 8.12 will be based on it and contain the fix. Is there anything left to fix on the Elasticsearch side? |
@YousF9 You can wait for the Elasticsearch 8.12 version, just like @javanna said, or you can download the source code of the corresponding version of lucence, and then you can refer to apache/lucene#12614 to make simple modifications, compile and package it into jar, replace the lucence jar into the elasticsearch cluster, then restart node by node. |
@javanna No fixes required on the Elasticsearch side. |
We are also facing similar issue with respect to cache evictions after upgrading elastic to 8.10 from 8.7. But even after upgrading to Elastic 8.12, still we are seeing a similar behaviour that cache evictions are happening even we have not crossed the threshold of cache size(3.5 GB is 10 % of heap) reaching around 2.5 GB of cache size and no of queries is not crossing 3k. could you please let us know if you have any suggestions here ? |
Elasticsearch Version
8.9.0
Installed Plugins
No response
Java Version
bundled
OS Version
Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Thu May 25 17:04:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
BUG Affected versions
from Elasticsearch 8.0.0
from Lucene 9.0.0
Problem Description
Refer to Lucene 9.7.0 bug#12614 .
Recently we upgrade our es cluster from 7.1.1 to 8.9.0, but we got lots of slow query logs after running for a few days.
We obtained some slow query requests from the log and analyzed them by the profile api. We found that we could not reproduce the problem by profile api which search cost is smaller than 10ms. But when we search dsl normal without profile, search cost will be 200ms above.
Therefore, we checked some monitoring indicators on grafana and found that there was an abnormality in the query cache.
(to reproduce problem, we set the
indices.queries.cache.size
to 350mb):ramBytesUsed:
Search time cost:
We have identified the cause and fixed the issue in our cluster. The root of the problem lies in the above-mentioned Lucene bug#12614 .
Lucene Bug Explain:
The class
ElasticsearchLRUQueryCache
inherits from Lucene'sLRUQueryCache
.LRUQueryCache
has two conditions that trigger cache eviction:indices.queries.cache.size
, default 10% of jvm heap size.indices.queries.cache.count
, default 10_000If any condition is met, cache eviction will be triggered. But lucene failed to calculate
ramBytesUsed
correctly.When putting a
query
andcached
into LRUQueryCache. Thequery
which is Accountable will be((Accountable) query).ramBytesUsed()
, or it will be a constantQUERY_DEFAULT_RAM_BYTES_USED
size.But on cache eviction it only counts constant
QUERY_DEFAULT_RAM_BYTES_USED
which is smaller thanramBytesUsed
.The result of long-running cache eviction is that the cache is actually empty, but
ramBytesUsed
is still larger thanindices.queries.cache.size
. Even if the cache is cleared, the cache eviction conditions are still met. That means no any item will be cached anymore.slow logs dsl example:
Steps to Reproduce
indices.queries.cache.size
to 350mb.indices.queries.cache.count
to default value.Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: