New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.3.4 heap used growing and crashing ES #8249
Comments
@johnc10uk Few questions:
can you send the output of these two requests:
|
No custom settings, 8.8GB heap and no swap. Again, this was working fine at 1.0.0 . <script src="https://gist.github.com/johnc10uk/441b58a6181bc07af4c1.js"></script> |
Tried doubling RAM to 16GB but just lasts longer before filling heap. Upgraded Java from 25 to 55, no difference. java version "1.7.0_55" |
@johnc10uk thanks for the info. Could I ask you to provide this output as well please:
|
Added to Gist https://gist.github.com/johnc10uk/441b58a6181bc07af4c1 |
Thanks. Nothing terribly unusual so far. Do you have a heap dump left from when one of the nodes crashed? If not, could you take a heap dump when memory usage is high, and share it with us? thanks |
Link for heap dump: http://stackoverflow.com/questions/17135721/generating-heap-dumps-java-jre7 Please could you .tar.gz it, and you can share its location with me privately: clinton dot gormley at elasticsearch dot com. Also, your mappings and config please. |
Had to add -F to get jmap to work "Unable to open socket file: target process not responding or HotSpot VM not loaded" jmap has just taken elasticsearch1 out of the cluster and taking forever to dump. Had to kill it, dump only 10Mb. Found out need to run jmap as elasticsearch user. Emailed link to files on S3. Thanks, John. |
Ended up with both elasticsearch1 and 3 server processes being restarted to get all in sync after 1 dropped out. However the heap used on 2 also dropped to about 10% and all now at 40% heap used. Some sort of communication failure not completing jobs? |
I looked at the heap dump here: it's dominated by the filter cache, which for some reason is not clearing itself. What settings do you have for the filter cache? It's filled with many (5.5 million) TermFilter instances, and the Term in each of these is typically very large (~200 bytes). |
@johnc10uk when I looked at the stats you sent, filters were only using 0.5GB of memory. It looks like you are using the default filter cache settings, which defaults to 10% of heap, ie 0.9GB in your case (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html#filter) Please could you keep an eye on the output from the following request, and see if the filter cache is growing unbounded:
In particular, I'd be interested in hearing if the filter cache usage doesn't grow, but memory usage does. |
Yes, all defaults. The HQ node diagnostics is currently showing 320Mb filter cache and I don't think it ever reached 880Mb to start any evictions. I'll keep a log of filter cache, it is growing. :{"filter_cache":{"memory_size":"322.3mb","memory_size_in_bytes":337965460,"evictions":0}}} {"filter_cache":{"memory_size":"325.8mb","memory_size_in_bytes":341677692,"evictions":0}}} Thanks, John. |
We are expiriencing this, in production, as well since the upgrade to 1.3.4. |
We have turned off caching for the TermFilter you identified which should stop the memory creep (it didn't need caching anyway). Restarted each node to clear out heap and will monitor. Thanks for help, very informative. |
Hi , I have added your esdiagdump.sh script output , so you could check our cluster stats. https://drive.google.com/a/totango.com/file/d/0B17CHQUEl6W8dE15MGVveWlycFk/view?usp=sharing Thanks, |
@CostyaRegev you have high filter eviction rates. I'm guessing that you're also using filters with very long keys which don't match any documents? |
@clintongormley - as for your question: |
20 filter evictions per second is a lot - it indicates that you're caching lots of filters that shouldn't be cached. Also, you didn't answer my question about the length of cache keys. Are you using filters with, eg, many string IDs in them? |
Practically, no - not in our test environment. But then again, we have search shard query rate of <0.1 almost always - it's mostly idle. And we have almost no indexing at all all day long (except for a few peaks)... Why is all the memory occupied to the point that nodes just crash? |
@CostyaRegev Please try this to confirm it is the same problem:
Within one minute of sending the request, you should see your heap size decrease (although it may take a while longer for a GC to kick in). If that doesn't work then I suggest taking a heap dump of one of the nodes, and using eg the YourKit profiler to figure out what is taking so much memory. |
After trying the sent request , indeed i saw our heap size decrease. |
@clintongormley regarding waiting here maybe we should do the same thing we did for FieldData here too in 65ce5ac |
@s1monw I've opened #8285 for that. @CostyaRegev The problem is as follows: The filter cache size is the sum of the size of the values in the cache. The size does not take the size of the cache keys into account, because this info is difficult to access. Previously, cached filters used 1 bit for every document in the segment, so the sum of the size of the values worked as a reasonable heuristic for the total amount of memory consumed. When the cache filled up with too many values, old filters were evicted. Now, cached filters which match (almost) all docs or (almost) no docs can be represented in a much more efficient manner, and consume very little memory indeed. The result is that many more of these filters can be cached before triggering evictions. If you have:
...then you will run into this situation, where the cache keys can consume all of your heap, while the sum of the size of the values in the cache is small. We are looking at ways to deal with this situation better in #8268. Either we will figure out a way to include the size of cache keys in the calculation (difficult), or we will just limit the number of entries allowed in the cache (easy). Workarounds for now:
|
@clintongormley as far as getting the size of the cache keys, is that something that could/should be added to Lucene API? |
@johnc10uk no - the caching happens on our side. |
ES folks,
Our production cluster is crashing every day. The workarounds suggested reduce the number of crashes on our test env and will be implemented in production. Let us know what other info/help you need in order to fix. |
@Yakx it sounds like you have a number of configuration issues, which are unrelated to this thread. Please ask these questions in the mailing list instead: http://elasticsearch.org/community |
Closed by #8304 |
I have upgraded from v1.0.0 to 1.3.4 and now see the heap used grow from 60 to 99% over 7 days and crashes ES. When hits 70% heap used see 100% on one CPU at a time. ES is working fine, just appears to be a memory leak or failing to complete GC. Can I safely downgrade to v1.3.1 to see if fixes the problem? Currently have 55M documents, 23GB size.
Thanks, John.
The text was updated successfully, but these errors were encountered: