Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical lucene LRUQueryCache bug may cause query cache evict all cached item #100755

Closed
psc0606 opened this issue Oct 12, 2023 · 7 comments
Closed
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@psc0606
Copy link

psc0606 commented Oct 12, 2023

Elasticsearch Version

8.9.0

Installed Plugins

No response

Java Version

bundled

OS Version

Linux 3.10.0-514.21.1.el7.x86_64 #1 SMP Thu May 25 17:04:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

BUG Affected versions

from Elasticsearch 8.0.0
from Lucene 9.0.0

Problem Description

Refer to Lucene 9.7.0 bug#12614 .

Recently we upgrade our es cluster from 7.1.1 to 8.9.0, but we got lots of slow query logs after running for a few days.
We obtained some slow query requests from the log and analyzed them by the profile api. We found that we could not reproduce the problem by profile api which search cost is smaller than 10ms. But when we search dsl normal without profile, search cost will be 200ms above.

Therefore, we checked some monitoring indicators on grafana and found that there was an abnormality in the query cache.
(to reproduce problem, we set the indices.queries.cache.size to 350mb):
ramBytesUsed:
image
image
image

Search time cost:
image

We have identified the cause and fixed the issue in our cluster. The root of the problem lies in the above-mentioned Lucene bug#12614 .

Lucene Bug Explain:
The class ElasticsearchLRUQueryCache inherits from Lucene's LRUQueryCache. LRUQueryCache has two conditions that trigger cache eviction:

  • indices.queries.cache.size, default 10% of jvm heap size.
  • indices.queries.cache.count, default 10_000
    If any condition is met, cache eviction will be triggered. But lucene failed to calculate ramBytesUsed correctly.

When putting a query and cached into LRUQueryCache. The query which is Accountable will be ((Accountable) query).ramBytesUsed(), or it will be a constant QUERY_DEFAULT_RAM_BYTES_USED size.

  private void putIfAbsent(Query query, CacheAndCount cached, IndexReader.CacheHelper cacheHelper) {
    assert query instanceof BoostQuery == false;
    assert query instanceof ConstantScoreQuery == false;
    // under a lock to make sure that mostRecentlyUsedQueries and cache remain sync'ed
    lock.lock();
    try {
      Query singleton = uniqueQueries.putIfAbsent(query, query);
      if (singleton == null) {
        if (query instanceof Accountable) {
          onQueryCache(
              query, LINKED_HASHTABLE_RAM_BYTES_PER_ENTRY + ((Accountable) query).ramBytesUsed());
        } else {
          onQueryCache(query, LINKED_HASHTABLE_RAM_BYTES_PER_ENTRY + QUERY_DEFAULT_RAM_BYTES_USED);
        }
      } else {
        query = singleton;
      }
      final IndexReader.CacheKey key = cacheHelper.getKey();
      LeafCache leafCache = cache.get(key);
      if (leafCache == null) {
        leafCache = new LeafCache(key);
        final LeafCache previous = cache.put(key, leafCache);
        ramBytesUsed += HASHTABLE_RAM_BYTES_PER_ENTRY;
        assert previous == null;
        // we just created a new leaf cache, need to register a close listener
        cacheHelper.addClosedListener(this::clearCoreCacheKey);
      }
      leafCache.putIfAbsent(query, cached);
      evictIfNecessary();
    } finally {
      lock.unlock();
    }
  }

But on cache eviction it only counts constant QUERY_DEFAULT_RAM_BYTES_USED which is smaller than ramBytesUsed.

The result of long-running cache eviction is that the cache is actually empty, but ramBytesUsed is still larger than indices.queries.cache.size. Even if the cache is cleared, the cache eviction conditions are still met. That means no any item will be cached anymore.

slow logs dsl example:

{"from":0,"size":8,"query":{"bool":{"filter":[{"terms":{"status":["active","blocked","soft_blocked"],"boost":1}},{"term":{"visitor":{"value":0}}},{"term":{"account_off":{"value":0}}},{"terms":{"user_role":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],"boost":1}}],"should":[{"constant_score":{"filter":{"term":{"nickname.keyword":{"value":"yin y"}}},"boost":5}},{"constant_score":{"filter":{"match_phrase_prefix":{"nickname.single":{"query":"yin y"}}},"boost":5}},{"constant_score":{"filter":{"match":{"nickname.single":{"query":"yin y","minimum_should_match":"100%"}}},"boost":1}},{"constant_score":{"filter":{"match_phrase":{"nickname.pinyin":{"query":"yin y"}}},"boost":1}},{"constant_score":{"filter":{"term":{"role":{"value":"AUTHOR"}}},"boost":0.5}}],"minimum_should_match":"1","boost":1}},"_source":{"includes":["id","nickname","is_official","user_role"],"excludes":[]},"sort":[{"_score":{"order":"desc"}},{"follower_count":{"order":"desc"}}]}

Steps to Reproduce

  1. Set indices.queries.cache.size to 350mb.
  2. Set indices.queries.cache.count to default value.
  3. Construct a query with terms in filters dsl, or any dsl will be cached by elasticsearch.
  4. Full fill the query cache.
  5. Trigger cache eviction for some hours to reproduce the bug.

Logs (if relevant)

No response

@psc0606 psc0606 added >bug needs:triage Requires assignment of a team area label labels Oct 12, 2023
@psc0606 psc0606 changed the title Lucene LRUQueryCache bug may cause query-cache evict all cached item Critical lucene LRUQueryCache bug may cause query cache evict all cached item Oct 12, 2023
@mark-vieira mark-vieira added the :Search/Search Search-related issues that do not fall into other categories label Oct 12, 2023
@elasticsearchmachine elasticsearchmachine added Team:Search Meta label for search team and removed needs:triage Requires assignment of a team area label labels Oct 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@romseygeek
Copy link
Contributor

This has been fixed in lucene by apache/lucene#12614

@YousF9
Copy link

YousF9 commented Dec 1, 2023

@psc0606 Could you share what you did to resolve the issue on your cluster? I may be running into something similar on my end.

@javanna
Copy link
Member

javanna commented Dec 7, 2023

The fix was included in Lucene 9.9, released earlier this week. Elasticsearch 8.12 will be based on it and contain the fix. Is there anything left to fix on the Elasticsearch side?

@psc0606
Copy link
Author

psc0606 commented Jan 2, 2024

@YousF9 You can wait for the Elasticsearch 8.12 version, just like @javanna said, or you can download the source code of the corresponding version of lucence, and then you can refer to apache/lucene#12614 to make simple modifications, compile and package it into jar, replace the lucence jar into the elasticsearch cluster, then restart node by node.

@psc0606
Copy link
Author

psc0606 commented Jan 2, 2024

@javanna No fixes required on the Elasticsearch side.

@javanna javanna closed this as completed Jan 2, 2024
@gangaeswari
Copy link

We are also facing similar issue with respect to cache evictions after upgrading elastic to 8.10 from 8.7.

But even after upgrading to Elastic 8.12, still we are seeing a similar behaviour that cache evictions are happening even we have not crossed the threshold of cache size(3.5 GB is 10 % of heap) reaching around 2.5 GB of cache size and no of queries is not crossing 3k.

could you please let us know if you have any suggestions here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

7 participants