New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394
Comments
A little bit of digging in the code and I came across the "index.load_fixed_bitset_filters_eagerly" setting. Setting this to false seems to avoid my initial problem. Has the default changed? Is this something new? Are there any impacts I might need to look out for in setting this to false? |
Hey @andrassy how many nested object fields do you have in all your mappings? Since 1.4 we eagerly load the filters and keep them around to make nested query execution as fast as possible. Under the hood the nested query relies on the fact that these filters are in memory as bitsets. The |
_stats reports doc count just above 600 million docs for the index (which includes the nesteds, right?) - 10 shards across 5 data nodes at present. There are quite a few nested mappings which we do use, but I think that we're probably not hitting the full parent doc set due to other filters being applied when we actually query - would that keep the bit filter caches smaller? It's just that we don't seem to have hit any OOM limits recently, operating with 1.3.x and prior versions for some time. We could restructure the data to avoid many of the nested mappings I think but this'll take us some time :( and involve some code changes right the way up our stack. We'll try with the index.load_fixed_bitset_filters_eagerly setting as false and see how we get on. Thought it was worth sharing the issue here. Thanks for the rapid response @martijnvg! |
@andrassy Sharing this is really important! ES may needs to change its default behaviour when it comes to eager loading filters associated with nested object fields. Yes, the doc count does include nested documents. You said you have quite a few nested object fields. Can you give share how many nested fields you have (check the mapping) ? (or an estimation) In the node stats api we also expose how much the bit set filter is taking (under the fixed_bit_set_memory_in_bytes key). Are you able to check this? |
We have three types within the index with 5, 5, and 4 (totalling 14) nested properties. fixed_bit_set_memory_in_bytes currently says 0, but I only just started to recover with load_fixed_bitset_filters_eagerly set to false. I'll check again once we've seen some traffic - it'll probably be Monday now as it's our DEV box and everyone else went home already :D |
Ok, would be great to know how much Do you by any chance also have a Also beyond that do you have any other warming configured? (warmer queries, eager field data loading) |
Also if you are able to share your mappings (or a dummy mapping that show the structure of your nested object fields) that would be helpful to see if we can improve this. Having 14 types and 600M docs shouldn't result in a OOM with your available heap space. |
I just upgraded from 1.3.2-1 to 1.4.1 and am seeing the following OOMs:
Is this related to this problem? And if so, do I have to change something else for my indexes, or should this change in 1.4.1 have fixed this already? See also my comment in: #8487 |
@portante ES version 1.4.1 should have fixed OOM issue related to the fixed bitset cache. if possible can you share the following:
|
@martijnvg: Loaded the above in the following gist: https://gist.github.com/portante/711aa2428461a7485384 I did not provide all the mappings for each index, instead I gave you one representative one of each type, sosreport, sar, and marvel (which is already known). I also provided a /_cat/shards output so you can see the relative sizes of the indexes. The vos.sar-* indexes are about 10 - 13 GB, while all the others seem to be in sub-1 GB ranges. I have successfully loaded all .marvel_, tvos._, vos.sosreport-* indexes, but have been unsuccessful with the vos.sar-* indexes. |
I see that the fixed bitset cache already takes 10GB and many of your shards are not started. In total you have assigned 206GB of jvm heap to ES, which feels more than sufficient, so I don't see directly why you would run OOM. However in general this amount of heap for a single node is too high and should be split across more nodes (can be on the same physical machine). That being sad this shouldn't result in the situation you're in now. Also the vos.sar-20141019 index has in total 14 unique nested object fields. Do the other indices have the same nested fields? And how many Lucene documents (this different than the number of documents in ES when nested fields have been defined in the mapping) do those indices have in total (more or less)? This can be found in the indices stats api under docs.stats? As I commented earlier here ES since 1.4 loads a data-structure eagerly in memory in order for nested query/filter and nested aggregations to run fast. (not loading it when it is needed). In order to get all shards started I recommend setting |
@portante It is better to run the indices stats after you configured the mentioned setting and the cat indices api maybe provides a better view to the metric: localhost:9200/_cat/indices/vos.sar-* |
@martijnvg, can you explain why having more memory is too much? I can certainly break this up, but that seems counter intuitive. All the vos.sar-* indices CAN have 14 unique indexes. Most about about 6 - 8, if I understand the data set correctly. In the provided gist, you can see that value: https://gist.github.com/portante/711aa2428461a7485384#file-shards-cat-L71 Each indexed sar document represents one sample collected as reported by the I had disabled the index warmers on those large indexes as a work-around. After enabling the warmers, and applying the setting above, the instance now takes about a 3 minutes to load up from ES start. Much better. Thanks! |
@portante This is the reason: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html#compressed_oops Now that all shards are started can you share how much docs all the vos.sar-* indices have? |
@martijnvg, I have updated that gist with the output requested above (using wildcards did not work on the _cat command for some reason for me), see https://gist.github.com/portante/711aa2428461a7485384#file-shards-txt I'll have to think about the compressed_oops and how we can restructure to take advantage of that. It seems like this would be a nice feature to have for ES where it would break itself up into smaller instances automatically instead of having to require the users to do it. |
I think this ticket can be closed now? Feel free to reopen if more discussion is needed |
We have one particularly large index in our cluster - it contains 10s of millions of documents and has quite a lot of nesteds too. Prior to 1.4.0 Beta 1 (including 1.2.x and 1.3.x) the index re-initialised on a node with 8GB allocated to ElasticSearch (16GB+ available in OS). Since 1.4.0 Beta 1 (and still on 1.4.0) we're getting an OOM exception (startup log and exception stack below). At this point, the node ceases recovery (expected, I guess) and becomes unresponsive. All data nodes suffer the same fate and the entire cluster becomes unresponsive.
The text was updated successfully, but these errors were encountered: