Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394

andrassy · 2014-11-07T17:51:58Z

We have one particularly large index in our cluster - it contains 10s of millions of documents and has quite a lot of nesteds too. Prior to 1.4.0 Beta 1 (including 1.2.x and 1.3.x) the index re-initialised on a node with 8GB allocated to ElasticSearch (16GB+ available in OS). Since 1.4.0 Beta 1 (and still on 1.4.0) we're getting an OOM exception (startup log and exception stack below). At this point, the node ceases recovery (expected, I guess) and becomes unresponsive. All data nodes suffer the same fate and the entire cluster becomes unresponsive.

[2014-11-07 17:12:39,895][WARN ][common.jna               ] unable to link C library. native methods (mlockall) will be disabled.
[2014-11-07 17:12:40,077][INFO ][node                     ] [dvlp_FRONTEND2] version[1.4.0], pid[9052], build[bc94bd8/2014-11-05T14:26:12Z]
[2014-11-07 17:12:40,077][INFO ][node                     ] [dvlp_FRONTEND2] initializing ...
[2014-11-07 17:12:40,129][INFO ][plugins                  ] [dvlp_FRONTEND2] loaded [cloud-aws], sites [bigdesk, head, inquisitor, kopf]
[2014-11-07 17:12:45,220][INFO ][node                     ] [dvlp_FRONTEND2] initialized
[2014-11-07 17:12:45,220][INFO ][node                     ] [dvlp_FRONTEND2] starting ...
[2014-11-07 17:12:45,438][INFO ][transport                ] [dvlp_FRONTEND2] bound_address {inet[/0:0:0:0:0:0:0:0:50882]}, publish_address {inet[FRONTEND2/192.168.10.73:50882]}
[2014-11-07 17:12:45,452][INFO ][discovery                ] [dvlp_FRONTEND2] dvlp/C2f-euXcRc-cEv3dnsBnXw
[2014-11-07 17:13:15,451][WARN ][discovery                ] [dvlp_FRONTEND2] waited for 30s and no initial state was set by the discovery
[2014-11-07 17:13:15,468][INFO ][http                     ] [dvlp_FRONTEND2] bound_address {inet[/0:0:0:0:0:0:0:0:50881]}, publish_address {inet[frontend2/192.168.10.73:50881]}
[2014-11-07 17:13:15,468][INFO ][node                     ] [dvlp_FRONTEND2] started
[2014-11-07 17:13:48,552][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:14:51,597][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:15:54,633][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:16:57,647][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:18:00,664][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:19:03,675][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:20:06,684][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:20:36,950][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [NodeDisconnectedException[[dvlp_FRONTEND2_coordinator][inet[/192.168.10.73:55591]][internal:discovery/zen/join] disconnected]]
[2014-11-07 17:20:41,171][WARN ][transport.netty          ] [dvlp_FRONTEND2] Message not fully read (response) for [85] handler future(org.elasticsearch.transport.EmptyTransportResponseHandler@2060e2c8), error [true], resetting
[2014-11-07 17:20:41,171][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND1_coordinator][4y8Hh5kAQPK2Ie3gzc58Ww][FRONTEND1][inet[/192.168.10.70:55858]]{datacentrename=site1, data=false, nodename=dvlp_FRONTEND1_coordinator, master=true}], reason [RemoteTransportException[Failed to deserialize exception response from stream]; nested: TransportSerializationException[Failed to deserialize exception response from stream]; nested: StreamCorruptedException[unexpected end of block data]; ]
[2014-11-07 17:20:45,520][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [RemoteTransportException[[dvlp_FRONTEND2_coordinator][inet[/192.168.10.73:55591]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[FRONTEND2/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}] not master for join request from [[dvlp_FRONTEND2][C2f-euXcRc-cEv3dnsBnXw][FRONTEND2][inet[/192.168.10.73:50882]]{datacentrename=site2, nodename=dvlp_FRONTEND2, master=false}]]; ], tried [3] times
[2014-11-07 17:20:48,831][INFO ][cluster.service          ] [dvlp_FRONTEND2] detected_master [dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}, added {[dvlp_DEVBATH01.exabre.co.uk_loadbalancer][8i4izXAUQiWeS2arwV9LeA][DEVBATH01][inet[/192.168.10.65:12184]]{datacentrename=site1, data=false, nodename=dvlp_DEVBATH01.exabre.co.uk_loadbalancer, master=true},[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true},[dvlp_FRONTEND2_loadbalancer][joVXc_fGTx-SC_YwJ2YBmQ][FRONTEND2][inet[/192.168.10.73:65341]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_loadbalancer, master=false},[dvlp_FRONTEND1_loadbalancer][snDHwo0YTR6VsAFV9nBcxw][FRONTEND1][inet[/192.168.10.70:55054]]{datacentrename=site1, data=false, nodename=dvlp_FRONTEND1_loadbalancer, master=false},}, reason: zen-disco-receive(from master [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}])
[2014-11-07 17:21:01,937][INFO ][cluster.service          ] [dvlp_FRONTEND2] added {[dvlp_FRONTEND1_coordinator][4y8Hh5kAQPK2Ie3gzc58Ww][FRONTEND1][inet[/192.168.10.70:55858]]{datacentrename=site1, data=false, nodename=dvlp_FRONTEND1_coordinator, master=true},}, reason: zen-disco-receive(from master [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}])
[2014-11-07 17:25:25,598][INFO ][monitor.jvm              ] [dvlp_FRONTEND2] [gc][old][739][27] duration [8s], collections [1]/[9s], total [8s]/[8.8s], memory [7.8gb]->[7.7gb]/[7.9gb], all_pools {[young] [172.4mb]->[46.5mb]/[199.6mb]}{[survivor] [24.9mb]->[0b]/[24.9mb]}{[old] [7.6gb]->[7.7gb]/[7.7gb]}
[2014-11-07 17:25:46,387][INFO ][monitor.jvm              ] [dvlp_FRONTEND2] [gc][old][746][32] duration [5s], collections [1]/[6s], total [5s]/[23.6s], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [195mb]->[199.6mb]/[199.6mb]}{[survivor] [0b]->[10.9mb]/[24.9mb]}{[old] [7.7gb]->[7.7gb]/[7.7gb]}
[2014-11-07 17:28:16,136][WARN ][index.warmer             ] [dvlp_FRONTEND2] [dvlp_13_67_item_20140410][7] failed to load fixed bitset for [org.elasticsearch.index.search.nested.NonNestedDocsFilter@fd00879d]
org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
    at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.getAndLoadIfNotPresent(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.access$100(FixedBitSetFilterCache.java:75)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$FixedBitSetFilterWarmer$1.run(FixedBitSetFilterCache.java:287)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.util.FixedBitSet.<init>(FixedBitSet.java:187)
    at org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:104)
    at org.elasticsearch.common.lucene.search.NotFilter.getDocIdSet(NotFilter.java:49)
    at org.elasticsearch.index.search.nested.NonNestedDocsFilter.getDocIdSet(NonNestedDocsFilter.java:46)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:142)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742)
    at org.elasticsearch.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
    at org.elasticsearch.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
    at org.elasticsearch.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
    ... 8 more
[2014-11-07 17:28:29,215][INFO ][monitor.jvm              ] [dvlp_FRONTEND2] [gc][old][749][40] duration [22.9s], collections [4]/[2.3m], total [22.9s]/[1m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [199.5mb]->[199.6mb]/[199.6mb]}{[survivor] [22.9mb]->[23.1mb]/[24.9mb]}{[old] [7.7gb]->[7.7gb]/[7.7gb]}
[2014-11-07 17:28:23,797][WARN ][index.warmer             ] [dvlp_FRONTEND2] [dvlp_13_67_item_20140410][7] failed to load fixed bitset for [org.elasticsearch.index.search.nested.NestedDocsFilter@fd00879d]
org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
    at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.getAndLoadIfNotPresent(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.access$100(FixedBitSetFilterCache.java:75)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$FixedBitSetFilterWarmer$1.run(FixedBitSetFilterCache.java:287)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.util.FixedBitSet.<init>(FixedBitSet.java:187)
    at org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:104)
    at org.elasticsearch.index.search.nested.NestedDocsFilter.getDocIdSet(NestedDocsFilter.java:50)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:142)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742)
    at org.elasticsearch.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
    at org.elasticsearch.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
    at org.elasticsearch.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
    ... 8 more

The text was updated successfully, but these errors were encountered:

andrassy · 2014-11-07T18:15:42Z

A little bit of digging in the code and I came across the "index.load_fixed_bitset_filters_eagerly" setting. Setting this to false seems to avoid my initial problem. Has the default changed? Is this something new? Are there any impacts I might need to look out for in setting this to false?

martijnvg · 2014-11-07T18:17:24Z

Hey @andrassy how many nested object fields do you have in all your mappings?

Since 1.4 we eagerly load the filters and keep them around to make nested query execution as fast as possible. Under the hood the nested query relies on the fact that these filters are in memory as bitsets.

The index.load_fixed_bitset_filters_eagerly setting has been added to disable the eager loading, but at some point when running nested queries these filters will end up in the heap as bitsets. Maybe disabling in your case make sense in the case if you have many nested object fields, but not all are used.

andrassy · 2014-11-07T18:31:48Z

_stats reports doc count just above 600 million docs for the index (which includes the nesteds, right?) - 10 shards across 5 data nodes at present. There are quite a few nested mappings which we do use, but I think that we're probably not hitting the full parent doc set due to other filters being applied when we actually query - would that keep the bit filter caches smaller? It's just that we don't seem to have hit any OOM limits recently, operating with 1.3.x and prior versions for some time.

We could restructure the data to avoid many of the nested mappings I think but this'll take us some time :( and involve some code changes right the way up our stack. We'll try with the index.load_fixed_bitset_filters_eagerly setting as false and see how we get on.

Thought it was worth sharing the issue here. Thanks for the rapid response @martijnvg!

martijnvg · 2014-11-07T18:41:08Z

@andrassy Sharing this is really important! ES may needs to change its default behaviour when it comes to eager loading filters associated with nested object fields.

Yes, the doc count does include nested documents. You said you have quite a few nested object fields. Can you give share how many nested fields you have (check the mapping) ? (or an estimation)

In the node stats api we also expose how much the bit set filter is taking (under the fixed_bit_set_memory_in_bytes key). Are you able to check this?

andrassy · 2014-11-07T18:51:20Z

We have three types within the index with 5, 5, and 4 (totalling 14) nested properties.

fixed_bit_set_memory_in_bytes currently says 0, but I only just started to recover with load_fixed_bitset_filters_eagerly set to false. I'll check again once we've seen some traffic - it'll probably be Monday now as it's our DEV box and everyone else went home already :D

martijnvg · 2014-11-07T19:03:39Z

Ok, would be great to know how much fixed_bit_set_memory_in_bytes is being reported.

Do you by any chance also have a _parent fields configured in your mappings? Each parent type increases the entries in the bitset cache.

Also beyond that do you have any other warming configured? (warmer queries, eager field data loading)

martijnvg · 2014-11-07T19:54:32Z

Also if you are able to share your mappings (or a dummy mapping that show the structure of your nested object fields) that would be helpful to see if we can improve this. Having 14 types and 600M docs shouldn't result in a OOM with your available heap space.

martijnvg · 2014-11-14T20:45:26Z

By changing the default eager loading behaviour and changing the dependency on bitset based filters in nested and parent/child via the following issues: #8454, #8440 and #8414, running out of memory like happened here will not happen anymore.

…1.4.0. See elastic/elasticsearch#8394

portante · 2014-12-05T18:45:23Z

I just upgraded from 1.3.2-1 to 1.4.1 and am seeing the following OOMs:

[2014-12-05 13:32:14,166][WARN ][index.warmer             ] [Patriots] [foo.bar-20140831][0] failed to load fixed bitset for [org.elasticsearch.index.search.nested.NonNestedDocsFilter@a801f786]  
org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space  
        at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2201)  
        at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)  
        at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)  
        at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.getAndLoadIfNotPresent(FixedBitSetFilterCache.java:137)  
        at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.access$100(FixedBitSetFilterCache.java:73)  
        at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$FixedBitSetFilterWarmer$1.run(FixedBitSetFilterCache.java:278)  
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  
        at java.lang.Thread.run(Thread.java:745)  
Caused by: java.lang.OutOfMemoryError: Java heap space

Is this related to this problem? And if so, do I have to change something else for my indexes, or should this change in 1.4.1 have fixed this already?

See also my comment in: #8487

martijnvg · 2014-12-05T21:46:25Z

@portante ES version 1.4.1 should have fixed OOM issue related to the fixed bitset cache.

if possible can you share the following:

Your mappings: localhost:9200/_mappings
Cluster stats: localhost:9200/_cluster/stats?human&pretty

portante · 2014-12-05T22:30:35Z

@martijnvg: Loaded the above in the following gist: https://gist.github.com/portante/711aa2428461a7485384

I did not provide all the mappings for each index, instead I gave you one representative one of each type, sosreport, sar, and marvel (which is already known).

I also provided a /_cat/shards output so you can see the relative sizes of the indexes. The vos.sar-* indexes are about 10 - 13 GB, while all the others seem to be in sub-1 GB ranges.

I have successfully loaded all .marvel_, tvos._, vos.sosreport-* indexes, but have been unsuccessful with the vos.sar-* indexes.

martijnvg · 2014-12-05T23:34:40Z

I see that the fixed bitset cache already takes 10GB and many of your shards are not started. In total you have assigned 206GB of jvm heap to ES, which feels more than sufficient, so I don't see directly why you would run OOM. However in general this amount of heap for a single node is too high and should be split across more nodes (can be on the same physical machine). That being sad this shouldn't result in the situation you're in now.

Also the vos.sar-20141019 index has in total 14 unique nested object fields. Do the other indices have the same nested fields? And how many Lucene documents (this different than the number of documents in ES when nested fields have been defined in the mapping) do those indices have in total (more or less)? This can be found in the indices stats api under docs.stats?

As I commented earlier here ES since 1.4 loads a data-structure eagerly in memory in order for nested query/filter and nested aggregations to run fast. (not loading it when it is needed).

In order to get all shards started I recommend setting index.load_fixed_bitset_filters_eagerly to false in your elasticsearch.yml file and restart. This disables the eager loading and prevents the OOM caused by the stack trace you send earlier.

martijnvg · 2014-12-05T23:42:08Z

@portante It is better to run the indices stats after you configured the mentioned setting and the cat indices api maybe provides a better view to the metric: localhost:9200/_cat/indices/vos.sar-*

portante · 2014-12-06T03:24:34Z

@martijnvg, can you explain why having more memory is too much? I can certainly break this up, but that seems counter intuitive.

All the vos.sar-* indices CAN have 14 unique indexes. Most about about 6 - 8, if I understand the data set correctly.

In the provided gist, you can see that value: https://gist.github.com/portante/711aa2428461a7485384#file-shards-cat-L71

Each indexed sar document represents one sample collected as reported by the sadf command from sysstat. On some systems they might collect 144 samples a day (10 min intervals), some have 8,600+ samples (10 seconds per day). What really seems to affect the size of things is the number of nested elements. We have seen VM hosts with close to 1,000 nics servicing VMs, which ends up have one per net-dev and net-edev docs. Or they might have 400+ block devices ending up with that many nested docs for disks.

I had disabled the index warmers on those large indexes as a work-around. After enabling the warmers, and applying the setting above, the instance now takes about a 3 minutes to load up from ES start.

Much better. Thanks!

martijnvg · 2014-12-07T21:27:02Z

@portante This is the reason: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html#compressed_oops

Now that all shards are started can you share how much docs all the vos.sar-* indices have?
Best way to share this is via running: curl 'localhost:9200/_cat/shards/vos.sar-*'.
This can give a good indication how much heap memory the fixed bitset cache will take if everything is being loaded.

portante · 2014-12-08T16:23:53Z

@martijnvg, I have updated that gist with the output requested above (using wildcards did not work on the _cat command for some reason for me), see https://gist.github.com/portante/711aa2428461a7485384#file-shards-txt

I'll have to think about the compressed_oops and how we can restructure to take advantage of that. It seems like this would be a nice feature to have for ES where it would break itself up into smaller instances automatically instead of having to require the users to do it.

clintongormley · 2015-02-28T04:56:22Z

I think this ticket can be closed now? Feel free to reopen if more discussion is needed

andrassy changed the title ~~Index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException~~ Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException Nov 7, 2014

clintongormley assigned martijnvg Nov 7, 2014

clintongormley added the feedback_needed label Nov 10, 2014

martijnvg mentioned this issue Nov 10, 2014

Don't eagerly load NestedDocsFilter in bitset filter cache, because it is never used. #8414

Merged

jcsorvasi added a commit to meltwater/puppet-es that referenced this issue Dec 3, 2014

Disable eager filter cache loading which makes nodes go OOM as of es …

c296e23

…1.4.0. See elastic/elasticsearch#8394

jcsorvasi mentioned this issue Dec 3, 2014

Disable eager filter cache loading which makes nodes go OOM as of es 1.4... meltwater/puppet-es#11

Merged

portante mentioned this issue Dec 6, 2014

ElasticSearch 1.3.4 recovery slow on larger clusters (50+ total nodes) #8487

Closed

martijnvg mentioned this issue Dec 16, 2014

java.lang.OutOfMemoryError: Java heap space after upgrade from 0.90 #8810

Closed

clintongormley closed this as completed Feb 28, 2015

drewdahlke mentioned this issue Mar 24, 2015

Excessive fixed_bit_set_memory_in_bytes causing GC thrashing #10224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394

Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394

andrassy commented Nov 7, 2014

andrassy commented Nov 7, 2014

martijnvg commented Nov 7, 2014

andrassy commented Nov 7, 2014

martijnvg commented Nov 7, 2014

andrassy commented Nov 7, 2014

martijnvg commented Nov 7, 2014

martijnvg commented Nov 7, 2014

martijnvg commented Nov 14, 2014

portante commented Dec 5, 2014

martijnvg commented Dec 5, 2014

portante commented Dec 5, 2014

martijnvg commented Dec 5, 2014

martijnvg commented Dec 5, 2014

portante commented Dec 6, 2014

martijnvg commented Dec 7, 2014

portante commented Dec 8, 2014

clintongormley commented Feb 28, 2015

Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394

Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394

Comments

andrassy commented Nov 7, 2014

andrassy commented Nov 7, 2014

martijnvg commented Nov 7, 2014

andrassy commented Nov 7, 2014

martijnvg commented Nov 7, 2014

andrassy commented Nov 7, 2014

martijnvg commented Nov 7, 2014

martijnvg commented Nov 7, 2014

martijnvg commented Nov 14, 2014

portante commented Dec 5, 2014

martijnvg commented Dec 5, 2014

portante commented Dec 5, 2014

martijnvg commented Dec 5, 2014

martijnvg commented Dec 5, 2014

portante commented Dec 6, 2014

martijnvg commented Dec 7, 2014

portante commented Dec 8, 2014

clintongormley commented Feb 28, 2015