Critical lucene LRUQueryCache bug may cause query cache evict all cached item #100755

psc0606 · 2023-10-12T09:27:22Z

Elasticsearch Version

8.9.0

Installed Plugins

No response

Java Version

bundled

OS Version

Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Thu May 25 17:04:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

BUG Affected versions

from Elasticsearch 8.0.0
from Lucene 9.0.0

Problem Description

Refer to Lucene 9.7.0 bug#12614 .

Recently we upgrade our es cluster from 7.1.1 to 8.9.0, but we got lots of slow query logs after running for a few days.
We obtained some slow query requests from the log and analyzed them by the profile api. We found that we could not reproduce the problem by profile api which search cost is smaller than 10ms. But when we search dsl normal without profile, search cost will be 200ms above.

Therefore, we checked some monitoring indicators on grafana and found that there was an abnormality in the query cache.
(to reproduce problem, we set the indices.queries.cache.size to 350mb):
ramBytesUsed:

Search time cost:

We have identified the cause and fixed the issue in our cluster. The root of the problem lies in the above-mentioned Lucene bug#12614 .

Lucene Bug Explain:
The class ElasticsearchLRUQueryCache inherits from Lucene's LRUQueryCache. LRUQueryCache has two conditions that trigger cache eviction:

indices.queries.cache.size, default 10% of jvm heap size.
indices.queries.cache.count, default 10_000
If any condition is met, cache eviction will be triggered. But lucene failed to calculate ramBytesUsed correctly.

When putting a query and cached into LRUQueryCache. The query which is Accountable will be ((Accountable) query).ramBytesUsed(), or it will be a constant QUERY_DEFAULT_RAM_BYTES_USED size.

  private void putIfAbsent(Query query, CacheAndCount cached, IndexReader.CacheHelper cacheHelper) {
    assert query instanceof BoostQuery == false;
    assert query instanceof ConstantScoreQuery == false;
    // under a lock to make sure that mostRecentlyUsedQueries and cache remain sync'ed
    lock.lock();
    try {
      Query singleton = uniqueQueries.putIfAbsent(query, query);
      if (singleton == null) {
        if (query instanceof Accountable) {
          onQueryCache(
              query, LINKED_HASHTABLE_RAM_BYTES_PER_ENTRY + ((Accountable) query).ramBytesUsed());
        } else {
          onQueryCache(query, LINKED_HASHTABLE_RAM_BYTES_PER_ENTRY + QUERY_DEFAULT_RAM_BYTES_USED);
        }
      } else {
        query = singleton;
      }
      final IndexReader.CacheKey key = cacheHelper.getKey();
      LeafCache leafCache = cache.get(key);
      if (leafCache == null) {
        leafCache = new LeafCache(key);
        final LeafCache previous = cache.put(key, leafCache);
        ramBytesUsed += HASHTABLE_RAM_BYTES_PER_ENTRY;
        assert previous == null;
        // we just created a new leaf cache, need to register a close listener
        cacheHelper.addClosedListener(this::clearCoreCacheKey);
      }
      leafCache.putIfAbsent(query, cached);
      evictIfNecessary();
    } finally {
      lock.unlock();
    }
  }

But on cache eviction it only counts constant QUERY_DEFAULT_RAM_BYTES_USED which is smaller than ramBytesUsed.

The result of long-running cache eviction is that the cache is actually empty, but ramBytesUsed is still larger than indices.queries.cache.size. Even if the cache is cleared, the cache eviction conditions are still met. That means no any item will be cached anymore.

slow logs dsl example:

{"from":0,"size":8,"query":{"bool":{"filter":[{"terms":{"status":["active","blocked","soft_blocked"],"boost":1}},{"term":{"visitor":{"value":0}}},{"term":{"account_off":{"value":0}}},{"terms":{"user_role":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"boost":1}}],"should":[{"constant_score":{"filter":{"term":{"nickname.keyword":{"value":"yin y"}}},"boost":5}},{"constant_score":{"filter":{"match_phrase_prefix":{"nickname.single":{"query":"yin y"}}},"boost":5}},{"constant_score":{"filter":{"match":{"nickname.single":{"query":"yin y","minimum_should_match":"100%"}}},"boost":1}},{"constant_score":{"filter":{"match_phrase":{"nickname.pinyin":{"query":"yin y"}}},"boost":1}},{"constant_score":{"filter":{"term":{"role":{"value":"AUTHOR"}}},"boost":0.5}}],"minimum_should_match":"1","boost":1}},"_source":{"includes":["id","nickname","is_official","user_role"],"excludes":[]},"sort":[{"_score":{"order":"desc"}},{"follower_count":{"order":"desc"}}]}

Steps to Reproduce

Set indices.queries.cache.size to 350mb.
Set indices.queries.cache.count to default value.
Construct a query with terms in filters dsl, or any dsl will be cached by elasticsearch.
Full fill the query cache.
Trigger cache eviction for some hours to reproduce the bug.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-10-12T22:09:35Z

Pinging @elastic/es-search (Team:Search)

romseygeek · 2023-10-13T08:47:30Z

This has been fixed in lucene by apache/lucene#12614

YousF9 · 2023-12-01T16:06:08Z

@psc0606 Could you share what you did to resolve the issue on your cluster? I may be running into something similar on my end.

javanna · 2023-12-07T11:23:33Z

The fix was included in Lucene 9.9, released earlier this week. Elasticsearch 8.12 will be based on it and contain the fix. Is there anything left to fix on the Elasticsearch side?

psc0606 · 2024-01-02T12:46:11Z

@YousF9 You can wait for the Elasticsearch 8.12 version, just like @javanna said, or you can download the source code of the corresponding version of lucence, and then you can refer to apache/lucene#12614 to make simple modifications, compile and package it into jar, replace the lucence jar into the elasticsearch cluster, then restart node by node.

psc0606 · 2024-01-02T12:49:17Z

@javanna No fixes required on the Elasticsearch side.

gangaeswari · 2024-05-06T06:52:12Z

We are also facing similar issue with respect to cache evictions after upgrading elastic to 8.10 from 8.7.

But even after upgrading to Elastic 8.12, still we are seeing a similar behaviour that cache evictions are happening even we have not crossed the threshold of cache size(3.5 GB is 10 % of heap) reaching around 2.5 GB of cache size and no of queries is not crossing 3k.

could you please let us know if you have any suggestions here ?

psc0606 added >bug needs:triage Requires assignment of a team area label labels Oct 12, 2023

psc0606 changed the title ~~Lucene LRUQueryCache bug may cause query-cache evict all cached item~~ Critical lucene LRUQueryCache bug may cause query cache evict all cached item Oct 12, 2023

mark-vieira added the :Search/Search Search-related issues that do not fall into other categories label Oct 12, 2023

elasticsearchmachine added Team:Search Meta label for search team and removed needs:triage Requires assignment of a team area label labels Oct 12, 2023

javanna closed this as completed Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical lucene LRUQueryCache bug may cause query cache evict all cached item #100755

Critical lucene LRUQueryCache bug may cause query cache evict all cached item #100755

psc0606 commented Oct 12, 2023 •

edited

elasticsearchmachine commented Oct 12, 2023

romseygeek commented Oct 13, 2023

YousF9 commented Dec 1, 2023

javanna commented Dec 7, 2023

psc0606 commented Jan 2, 2024

psc0606 commented Jan 2, 2024

gangaeswari commented May 6, 2024

Critical lucene LRUQueryCache bug may cause query cache evict all cached item #100755

Critical lucene LRUQueryCache bug may cause query cache evict all cached item #100755

Comments

psc0606 commented Oct 12, 2023 • edited

Elasticsearch Version

Installed Plugins

Java Version

OS Version

BUG Affected versions

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Oct 12, 2023

romseygeek commented Oct 13, 2023

YousF9 commented Dec 1, 2023

javanna commented Dec 7, 2023

psc0606 commented Jan 2, 2024

psc0606 commented Jan 2, 2024

gangaeswari commented May 6, 2024

psc0606 commented Oct 12, 2023 •

edited