1.3.4 heap used growing and crashing ES #8249

johnc10uk · 2014-10-28T10:36:14Z

I have upgraded from v1.0.0 to 1.3.4 and now see the heap used grow from 60 to 99% over 7 days and crashes ES. When hits 70% heap used see 100% on one CPU at a time. ES is working fine, just appears to be a memory leak or failing to complete GC. Can I safely downgrade to v1.3.1 to see if fixes the problem? Currently have 55M documents, 23GB size.

Thanks, John.

clintongormley · 2014-10-28T10:45:38Z

@johnc10uk Few questions:

what custom settings are you using?
what is your ES_HEAP_SIZE
do you have swap disabled?

can you send the output of these two requests:

curl localhost:9200/_nodes > nodes.json
curl localhost:9200/_nodes/stats?fields=* > nodes_stats.json

johnc10uk · 2014-10-28T11:04:32Z

No custom settings, 8.8GB heap and no swap. Again, this was working fine at 1.0.0 .

johnc10uk · 2014-10-28T11:05:36Z

Files https://gist.github.com/johnc10uk/441b58a6181bc07af4c1

johnc10uk · 2014-10-28T11:12:30Z

Tried doubling RAM to 16GB but just lasts longer before filling heap. Upgraded Java from 25 to 55, no difference.

java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

clintongormley · 2014-10-28T11:33:19Z

@johnc10uk thanks for the info. Could I ask you to provide this output as well please:

curl localhost:9200/_nodes/hot_threads > hot_threads.txt

johnc10uk · 2014-10-28T11:39:21Z

Added to Gist https://gist.github.com/johnc10uk/441b58a6181bc07af4c1

clintongormley · 2014-10-28T11:42:15Z

Thanks. Nothing terribly unusual so far. Do you have a heap dump left from when one of the nodes crashed? If not, could you take a heap dump when memory usage is high, and share it with us?

thanks

clintongormley · 2014-10-28T11:43:31Z

Link for heap dump: http://stackoverflow.com/questions/17135721/generating-heap-dumps-java-jre7

Please could you .tar.gz it, and you can share its location with me privately: clinton dot gormley at elasticsearch dot com.

Also, your mappings and config please.

johnc10uk · 2014-10-28T14:26:03Z

Had to add -F to get jmap to work "Unable to open socket file: target process not responding or HotSpot VM not loaded"

jmap has just taken elasticsearch1 out of the cluster and taking forever to dump. Had to kill it, dump only 10Mb. Found out need to run jmap as elasticsearch user. Emailed link to files on S3.

Thanks, John.

johnc10uk · 2014-10-28T14:39:00Z

Ended up with both elasticsearch1 and 3 server processes being restarted to get all in sync after 1 dropped out. However the heap used on 2 also dropped to about 10% and all now at 40% heap used. Some sort of communication failure not completing jobs?

mikemccand · 2014-10-28T19:27:29Z

I looked at the heap dump here: it's dominated by the filter cache, which for some reason is not clearing itself. What settings do you have for the filter cache?

It's filled with many (5.5 million) TermFilter instances, and the Term in each of these is typically very large (~200 bytes).

clintongormley · 2014-10-29T09:52:22Z

@johnc10uk when I looked at the stats you sent, filters were only using 0.5GB of memory. It looks like you are using the default filter cache settings, which defaults to 10% of heap, ie 0.9GB in your case (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html#filter)

Please could you keep an eye on the output from the following request, and see if the filter cache is growing unbounded:

curl -XGET "http://localhost:9200/_nodes/stats/indices/filter_cache?human"

In particular, I'd be interested in hearing if the filter cache usage doesn't grow, but memory usage does.

johnc10uk · 2014-10-29T10:09:23Z

Yes, all defaults. The HQ node diagnostics is currently showing 320Mb filter cache and I don't think it ever reached 880Mb to start any evictions. I'll keep a log of filter cache, it is growing.

:{"filter_cache":{"memory_size":"322.3mb","memory_size_in_bytes":337965460,"evictions":0}}}

{"filter_cache":{"memory_size":"325.8mb","memory_size_in_bytes":341677692,"evictions":0}}}

Thanks, John.

Yakx · 2014-10-29T12:46:02Z

We are expiriencing this, in production, as well since the upgrade to 1.3.4.
when do you have plans to fix this?

johnc10uk · 2014-10-29T13:13:08Z

We have turned off caching for the TermFilter you identified which should stop the memory creep (it didn't need caching anyway). Restarted each node to clear out heap and will monitor. Thanks for help, very informative.

cregev · 2014-10-29T14:19:31Z

Hi ,
We have upgraded our Elasticsearch cluster to version 1.3.4, after the upgrade we see that the heap used grow from 60 to 99% in a few hours, when the Heap reaches 99% some of the machines disconnect from the cluster and reconnect again.
In addition , when the Heap used reaches 90% + it does not return to normal.

I have added your esdiagdump.sh script output , so you could check our cluster stats.
Please advise ?

https://drive.google.com/a/totango.com/file/d/0B17CHQUEl6W8dE15MGVveWlycFk/view?usp=sharing

Thanks,
Costya.

clintongormley · 2014-10-29T15:28:15Z

@CostyaRegev you have high filter eviction rates. I'm guessing that you're also using filters with very long keys which don't match any documents?

cregev · 2014-10-29T16:10:37Z

@clintongormley - as for your question:
It is a test environment of us, and we don't query it much. We usually have less than 1 query per second, and at most 20 filter cache evictions per second. Most of the time, there are 0 evictions per second. And yet, the JVM memory % of some nodes hits the high nineties and never go down. The nodes also crash every here and there. BTW, the CPU usage percentage is not that high, something that might indicate a memory leak of some sort.

Here's our Marvel screenshot -

clintongormley · 2014-10-29T16:18:07Z

20 filter evictions per second is a lot - it indicates that you're caching lots of filters that shouldn't be cached. Also, you didn't answer my question about the length of cache keys. Are you using filters with, eg, many string IDs in them?

cregev · 2014-10-29T16:32:02Z

Practically, no - not in our test environment. But then again, we have search shard query rate of <0.1 almost always - it's mostly idle. And we have almost no indexing at all all day long (except for a few peaks)... Why is all the memory occupied to the point that nodes just crash?

clintongormley · 2014-10-29T17:25:04Z

@CostyaRegev Please try this to confirm it is the same problem:

curl -XPOST 'http://localhost:9200/_cache/clear?filter=true'

Within one minute of sending the request, you should see your heap size decrease (although it may take a while longer for a GC to kick in).

If that doesn't work then I suggest taking a heap dump of one of the nodes, and using eg the YourKit profiler to figure out what is taking so much memory.

cregev · 2014-10-29T18:07:59Z

After trying the sent request , indeed i saw our heap size decrease.
It there a workaround for this problem ? When will you fix this problem?

s1monw · 2014-10-29T19:43:17Z

@clintongormley regarding waiting here maybe we should do the same thing we did for FieldData here too in 65ce5ac

clintongormley · 2014-10-30T09:53:02Z

@s1monw I've opened #8285 for that.

@CostyaRegev The problem is as follows:

The filter cache size is the sum of the size of the values in the cache. The size does not take the size of the cache keys into account, because this info is difficult to access.

Previously, cached filters used 1 bit for every document in the segment, so the sum of the size of the values worked as a reasonable heuristic for the total amount of memory consumed. When the cache filled up with too many values, old filters were evicted.

Now, cached filters which match (almost) all docs or (almost) no docs can be represented in a much more efficient manner, and consume very little memory indeed. The result is that many more of these filters can be cached before triggering evictions.

If you have:

many unique filters
with long cache keys (eg terms filters on thousands of IDs)
that match almost all or almost no docs

...then you will run into this situation, where the cache keys can consume all of your heap, while the sum of the size of the values in the cache is small.

We are looking at ways to deal with this situation better in #8268. Either we will figure out a way to include the size of cache keys in the calculation (difficult), or we will just limit the number of entries allowed in the cache (easy).

Workarounds for now:

Reduce the size of the filter cache so that evictions are triggered earlier (eg set indices.cache.filter.size to 2% instead of the default 10%, see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html#node-filter)
Use a custom (short) _cache_key - see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-terms-filter.html#_terms_lookup_twitter_example
Disable caching on these filters
Manually clear the cache once a day or as often as needed (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-clearcache.html#indices-clearcache)

johnc10uk · 2014-10-30T10:01:40Z

@clintongormley as far as getting the size of the cache keys, is that something that could/should be added to Lucene API?

clintongormley · 2014-10-30T11:05:58Z

@johnc10uk no - the caching happens on our side.

Yakx · 2014-11-01T11:45:42Z

ES folks,
Thanks for the suggestions so far. It helped. However, IMO the problem is much bigger then you think. we have found the following:

The ID cache is also having the same problem as Field data and filter cache. we had to stop using all our parent-child queries. This was not a big loss as this model is practically not usable since it is way too slow.
THE MAIN problem is the cluster management. When a node id crashing after OOM cluster management does not recognize that and then:
Calls for the status api return all green
Cluster is not using replicas and queries never return

Our production cluster is crashing every day. The workarounds suggested reduce the number of crashes on our test env and will be implemented in production.
The main problem though, is the cluster recovery.

Let us know what other info/help you need in order to fix.

clintongormley · 2014-11-01T14:56:05Z

@Yakx it sounds like you have a number of configuration issues, which are unrelated to this thread. Please ask these questions in the mailing list instead: http://elasticsearch.org/community

clintongormley · 2014-11-03T11:26:33Z

Closed by #8304

clintongormley added the feedback_needed label Oct 28, 2014

clintongormley added >bug v1.3.5 v1.4.0 labels Oct 29, 2014

mikemccand mentioned this issue Oct 29, 2014

Core: filter cache heap usage should include RAM used by the cache keys (Filter) #8268

Closed

dakrone mentioned this issue Nov 3, 2014

Use a 1024 byte minimum weight for filter cache entries #8304

Merged

clintongormley closed this as completed Nov 3, 2014

clintongormley mentioned this issue Nov 14, 2014

Elasticsearch is constantly falling #8477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3.4 heap used growing and crashing ES #8249

1.3.4 heap used growing and crashing ES #8249

johnc10uk commented Oct 28, 2014

clintongormley commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

clintongormley commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

clintongormley commented Oct 28, 2014

clintongormley commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

mikemccand commented Oct 28, 2014

clintongormley commented Oct 29, 2014

johnc10uk commented Oct 29, 2014

Yakx commented Oct 29, 2014

johnc10uk commented Oct 29, 2014

cregev commented Oct 29, 2014

clintongormley commented Oct 29, 2014

cregev commented Oct 29, 2014

clintongormley commented Oct 29, 2014

cregev commented Oct 29, 2014

clintongormley commented Oct 29, 2014

cregev commented Oct 29, 2014

s1monw commented Oct 29, 2014

clintongormley commented Oct 30, 2014

johnc10uk commented Oct 30, 2014

clintongormley commented Oct 30, 2014

Yakx commented Nov 1, 2014

clintongormley commented Nov 1, 2014

clintongormley commented Nov 3, 2014

1.3.4 heap used growing and crashing ES #8249

1.3.4 heap used growing and crashing ES #8249

Comments

johnc10uk commented Oct 28, 2014

clintongormley commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

clintongormley commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

clintongormley commented Oct 28, 2014

clintongormley commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

johnc10uk commented Oct 28, 2014

mikemccand commented Oct 28, 2014

clintongormley commented Oct 29, 2014

johnc10uk commented Oct 29, 2014

Yakx commented Oct 29, 2014

johnc10uk commented Oct 29, 2014

cregev commented Oct 29, 2014

clintongormley commented Oct 29, 2014

cregev commented Oct 29, 2014

clintongormley commented Oct 29, 2014

cregev commented Oct 29, 2014

clintongormley commented Oct 29, 2014

cregev commented Oct 29, 2014

s1monw commented Oct 29, 2014

clintongormley commented Oct 30, 2014

johnc10uk commented Oct 30, 2014

clintongormley commented Oct 30, 2014

Yakx commented Nov 1, 2014

clintongormley commented Nov 1, 2014

clintongormley commented Nov 3, 2014