Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException #8394

Closed
andrassy opened this issue Nov 7, 2014 · 17 comments
Assignees

Comments

@andrassy
Copy link

andrassy commented Nov 7, 2014

We have one particularly large index in our cluster - it contains 10s of millions of documents and has quite a lot of nesteds too. Prior to 1.4.0 Beta 1 (including 1.2.x and 1.3.x) the index re-initialised on a node with 8GB allocated to ElasticSearch (16GB+ available in OS). Since 1.4.0 Beta 1 (and still on 1.4.0) we're getting an OOM exception (startup log and exception stack below). At this point, the node ceases recovery (expected, I guess) and becomes unresponsive. All data nodes suffer the same fate and the entire cluster becomes unresponsive.

[2014-11-07 17:12:39,895][WARN ][common.jna               ] unable to link C library. native methods (mlockall) will be disabled.
[2014-11-07 17:12:40,077][INFO ][node                     ] [dvlp_FRONTEND2] version[1.4.0], pid[9052], build[bc94bd8/2014-11-05T14:26:12Z]
[2014-11-07 17:12:40,077][INFO ][node                     ] [dvlp_FRONTEND2] initializing ...
[2014-11-07 17:12:40,129][INFO ][plugins                  ] [dvlp_FRONTEND2] loaded [cloud-aws], sites [bigdesk, head, inquisitor, kopf]
[2014-11-07 17:12:45,220][INFO ][node                     ] [dvlp_FRONTEND2] initialized
[2014-11-07 17:12:45,220][INFO ][node                     ] [dvlp_FRONTEND2] starting ...
[2014-11-07 17:12:45,438][INFO ][transport                ] [dvlp_FRONTEND2] bound_address {inet[/0:0:0:0:0:0:0:0:50882]}, publish_address {inet[FRONTEND2/192.168.10.73:50882]}
[2014-11-07 17:12:45,452][INFO ][discovery                ] [dvlp_FRONTEND2] dvlp/C2f-euXcRc-cEv3dnsBnXw
[2014-11-07 17:13:15,451][WARN ][discovery                ] [dvlp_FRONTEND2] waited for 30s and no initial state was set by the discovery
[2014-11-07 17:13:15,468][INFO ][http                     ] [dvlp_FRONTEND2] bound_address {inet[/0:0:0:0:0:0:0:0:50881]}, publish_address {inet[frontend2/192.168.10.73:50881]}
[2014-11-07 17:13:15,468][INFO ][node                     ] [dvlp_FRONTEND2] started
[2014-11-07 17:13:48,552][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:14:51,597][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:15:54,633][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:16:57,647][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:18:00,664][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:19:03,675][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:20:06,684][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2014-11-07 17:20:36,950][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][jwhGk5NyTx-E1HInKTLDkg][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [NodeDisconnectedException[[dvlp_FRONTEND2_coordinator][inet[/192.168.10.73:55591]][internal:discovery/zen/join] disconnected]]
[2014-11-07 17:20:41,171][WARN ][transport.netty          ] [dvlp_FRONTEND2] Message not fully read (response) for [85] handler future(org.elasticsearch.transport.EmptyTransportResponseHandler@2060e2c8), error [true], resetting
[2014-11-07 17:20:41,171][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND1_coordinator][4y8Hh5kAQPK2Ie3gzc58Ww][FRONTEND1][inet[/192.168.10.70:55858]]{datacentrename=site1, data=false, nodename=dvlp_FRONTEND1_coordinator, master=true}], reason [RemoteTransportException[Failed to deserialize exception response from stream]; nested: TransportSerializationException[Failed to deserialize exception response from stream]; nested: StreamCorruptedException[unexpected end of block data]; ]
[2014-11-07 17:20:45,520][INFO ][discovery.zen            ] [dvlp_FRONTEND2] failed to send join request to master [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}], reason [RemoteTransportException[[dvlp_FRONTEND2_coordinator][inet[/192.168.10.73:55591]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[FRONTEND2/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}] not master for join request from [[dvlp_FRONTEND2][C2f-euXcRc-cEv3dnsBnXw][FRONTEND2][inet[/192.168.10.73:50882]]{datacentrename=site2, nodename=dvlp_FRONTEND2, master=false}]]; ], tried [3] times
[2014-11-07 17:20:48,831][INFO ][cluster.service          ] [dvlp_FRONTEND2] detected_master [dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}, added {[dvlp_DEVBATH01.exabre.co.uk_loadbalancer][8i4izXAUQiWeS2arwV9LeA][DEVBATH01][inet[/192.168.10.65:12184]]{datacentrename=site1, data=false, nodename=dvlp_DEVBATH01.exabre.co.uk_loadbalancer, master=true},[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true},[dvlp_FRONTEND2_loadbalancer][joVXc_fGTx-SC_YwJ2YBmQ][FRONTEND2][inet[/192.168.10.73:65341]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_loadbalancer, master=false},[dvlp_FRONTEND1_loadbalancer][snDHwo0YTR6VsAFV9nBcxw][FRONTEND1][inet[/192.168.10.70:55054]]{datacentrename=site1, data=false, nodename=dvlp_FRONTEND1_loadbalancer, master=false},}, reason: zen-disco-receive(from master [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}])
[2014-11-07 17:21:01,937][INFO ][cluster.service          ] [dvlp_FRONTEND2] added {[dvlp_FRONTEND1_coordinator][4y8Hh5kAQPK2Ie3gzc58Ww][FRONTEND1][inet[/192.168.10.70:55858]]{datacentrename=site1, data=false, nodename=dvlp_FRONTEND1_coordinator, master=true},}, reason: zen-disco-receive(from master [[dvlp_FRONTEND2_coordinator][-O87CxU3RRSTHZkuC985Yw][FRONTEND2][inet[/192.168.10.73:55591]]{datacentrename=site2, data=false, nodename=dvlp_FRONTEND2_coordinator, master=true}])
[2014-11-07 17:25:25,598][INFO ][monitor.jvm              ] [dvlp_FRONTEND2] [gc][old][739][27] duration [8s], collections [1]/[9s], total [8s]/[8.8s], memory [7.8gb]->[7.7gb]/[7.9gb], all_pools {[young] [172.4mb]->[46.5mb]/[199.6mb]}{[survivor] [24.9mb]->[0b]/[24.9mb]}{[old] [7.6gb]->[7.7gb]/[7.7gb]}
[2014-11-07 17:25:46,387][INFO ][monitor.jvm              ] [dvlp_FRONTEND2] [gc][old][746][32] duration [5s], collections [1]/[6s], total [5s]/[23.6s], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [195mb]->[199.6mb]/[199.6mb]}{[survivor] [0b]->[10.9mb]/[24.9mb]}{[old] [7.7gb]->[7.7gb]/[7.7gb]}
[2014-11-07 17:28:16,136][WARN ][index.warmer             ] [dvlp_FRONTEND2] [dvlp_13_67_item_20140410][7] failed to load fixed bitset for [org.elasticsearch.index.search.nested.NonNestedDocsFilter@fd00879d]
org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
    at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.getAndLoadIfNotPresent(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.access$100(FixedBitSetFilterCache.java:75)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$FixedBitSetFilterWarmer$1.run(FixedBitSetFilterCache.java:287)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.util.FixedBitSet.<init>(FixedBitSet.java:187)
    at org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:104)
    at org.elasticsearch.common.lucene.search.NotFilter.getDocIdSet(NotFilter.java:49)
    at org.elasticsearch.index.search.nested.NonNestedDocsFilter.getDocIdSet(NonNestedDocsFilter.java:46)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:142)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742)
    at org.elasticsearch.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
    at org.elasticsearch.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
    at org.elasticsearch.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
    ... 8 more
[2014-11-07 17:28:29,215][INFO ][monitor.jvm              ] [dvlp_FRONTEND2] [gc][old][749][40] duration [22.9s], collections [4]/[2.3m], total [22.9s]/[1m], memory [7.9gb]->[7.9gb]/[7.9gb], all_pools {[young] [199.5mb]->[199.6mb]/[199.6mb]}{[survivor] [22.9mb]->[23.1mb]/[24.9mb]}{[old] [7.7gb]->[7.7gb]/[7.7gb]}
[2014-11-07 17:28:23,797][WARN ][index.warmer             ] [dvlp_FRONTEND2] [dvlp_13_67_item_20140410][7] failed to load fixed bitset for [org.elasticsearch.index.search.nested.NestedDocsFilter@fd00879d]
org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
    at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.getAndLoadIfNotPresent(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.access$100(FixedBitSetFilterCache.java:75)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$FixedBitSetFilterWarmer$1.run(FixedBitSetFilterCache.java:287)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.util.FixedBitSet.<init>(FixedBitSet.java:187)
    at org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(MultiTermQueryWrapperFilter.java:104)
    at org.elasticsearch.index.search.nested.NestedDocsFilter.getDocIdSet(NestedDocsFilter.java:50)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:142)
    at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:139)
    at org.elasticsearch.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742)
    at org.elasticsearch.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
    at org.elasticsearch.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
    at org.elasticsearch.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
    at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
    ... 8 more
@andrassy andrassy changed the title Index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException Large index no longer initialises under 1.4.0 and 1.4.0 Beta 1 due to OutOfMemoryException Nov 7, 2014
@andrassy
Copy link
Author

andrassy commented Nov 7, 2014

A little bit of digging in the code and I came across the "index.load_fixed_bitset_filters_eagerly" setting. Setting this to false seems to avoid my initial problem. Has the default changed? Is this something new? Are there any impacts I might need to look out for in setting this to false?

@martijnvg
Copy link
Member

Hey @andrassy how many nested object fields do you have in all your mappings?

Since 1.4 we eagerly load the filters and keep them around to make nested query execution as fast as possible. Under the hood the nested query relies on the fact that these filters are in memory as bitsets.

The index.load_fixed_bitset_filters_eagerly setting has been added to disable the eager loading, but at some point when running nested queries these filters will end up in the heap as bitsets. Maybe disabling in your case make sense in the case if you have many nested object fields, but not all are used.

@andrassy
Copy link
Author

andrassy commented Nov 7, 2014

_stats reports doc count just above 600 million docs for the index (which includes the nesteds, right?) - 10 shards across 5 data nodes at present. There are quite a few nested mappings which we do use, but I think that we're probably not hitting the full parent doc set due to other filters being applied when we actually query - would that keep the bit filter caches smaller? It's just that we don't seem to have hit any OOM limits recently, operating with 1.3.x and prior versions for some time.

We could restructure the data to avoid many of the nested mappings I think but this'll take us some time :( and involve some code changes right the way up our stack. We'll try with the index.load_fixed_bitset_filters_eagerly setting as false and see how we get on.

Thought it was worth sharing the issue here. Thanks for the rapid response @martijnvg!

@martijnvg
Copy link
Member

@andrassy Sharing this is really important! ES may needs to change its default behaviour when it comes to eager loading filters associated with nested object fields.

Yes, the doc count does include nested documents. You said you have quite a few nested object fields. Can you give share how many nested fields you have (check the mapping) ? (or an estimation)

In the node stats api we also expose how much the bit set filter is taking (under the fixed_bit_set_memory_in_bytes key). Are you able to check this?

@andrassy
Copy link
Author

andrassy commented Nov 7, 2014

We have three types within the index with 5, 5, and 4 (totalling 14) nested properties.

fixed_bit_set_memory_in_bytes currently says 0, but I only just started to recover with load_fixed_bitset_filters_eagerly set to false. I'll check again once we've seen some traffic - it'll probably be Monday now as it's our DEV box and everyone else went home already :D

@martijnvg
Copy link
Member

Ok, would be great to know how much fixed_bit_set_memory_in_bytes is being reported.

Do you by any chance also have a _parent fields configured in your mappings? Each parent type increases the entries in the bitset cache.

Also beyond that do you have any other warming configured? (warmer queries, eager field data loading)

@martijnvg
Copy link
Member

Also if you are able to share your mappings (or a dummy mapping that show the structure of your nested object fields) that would be helpful to see if we can improve this. Having 14 types and 600M docs shouldn't result in a OOM with your available heap space.

@martijnvg
Copy link
Member

By changing the default eager loading behaviour and changing the dependency on bitset based filters in nested and parent/child via the following issues: #8454, #8440 and #8414, running out of memory like happened here will not happen anymore.

@portante
Copy link

portante commented Dec 5, 2014

I just upgraded from 1.3.2-1 to 1.4.1 and am seeing the following OOMs:

[2014-12-05 13:32:14,166][WARN ][index.warmer             ] [Patriots] [foo.bar-20140831][0] failed to load fixed bitset for [org.elasticsearch.index.search.nested.NonNestedDocsFilter@a801f786]  
org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space  
        at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2201)  
        at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)  
        at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)  
        at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.getAndLoadIfNotPresent(FixedBitSetFilterCache.java:137)  
        at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.access$100(FixedBitSetFilterCache.java:73)  
        at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$FixedBitSetFilterWarmer$1.run(FixedBitSetFilterCache.java:278)  
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  
        at java.lang.Thread.run(Thread.java:745)  
Caused by: java.lang.OutOfMemoryError: Java heap space  

Is this related to this problem? And if so, do I have to change something else for my indexes, or should this change in 1.4.1 have fixed this already?

See also my comment in: #8487

@martijnvg
Copy link
Member

@portante ES version 1.4.1 should have fixed OOM issue related to the fixed bitset cache.

if possible can you share the following:

  • Your mappings: localhost:9200/_mappings
  • Cluster stats: localhost:9200/_cluster/stats?human&pretty

@portante
Copy link

portante commented Dec 5, 2014

@martijnvg: Loaded the above in the following gist: https://gist.github.com/portante/711aa2428461a7485384

I did not provide all the mappings for each index, instead I gave you one representative one of each type, sosreport, sar, and marvel (which is already known).

I also provided a /_cat/shards output so you can see the relative sizes of the indexes. The vos.sar-* indexes are about 10 - 13 GB, while all the others seem to be in sub-1 GB ranges.

I have successfully loaded all .marvel_, tvos._, vos.sosreport-* indexes, but have been unsuccessful with the vos.sar-* indexes.

@martijnvg
Copy link
Member

I see that the fixed bitset cache already takes 10GB and many of your shards are not started. In total you have assigned 206GB of jvm heap to ES, which feels more than sufficient, so I don't see directly why you would run OOM. However in general this amount of heap for a single node is too high and should be split across more nodes (can be on the same physical machine). That being sad this shouldn't result in the situation you're in now.

Also the vos.sar-20141019 index has in total 14 unique nested object fields. Do the other indices have the same nested fields? And how many Lucene documents (this different than the number of documents in ES when nested fields have been defined in the mapping) do those indices have in total (more or less)? This can be found in the indices stats api under docs.stats?

As I commented earlier here ES since 1.4 loads a data-structure eagerly in memory in order for nested query/filter and nested aggregations to run fast. (not loading it when it is needed).

In order to get all shards started I recommend setting index.load_fixed_bitset_filters_eagerly to false in your elasticsearch.yml file and restart. This disables the eager loading and prevents the OOM caused by the stack trace you send earlier.

@martijnvg
Copy link
Member

@portante It is better to run the indices stats after you configured the mentioned setting and the cat indices api maybe provides a better view to the metric: localhost:9200/_cat/indices/vos.sar-*

@portante
Copy link

portante commented Dec 6, 2014

@martijnvg, can you explain why having more memory is too much? I can certainly break this up, but that seems counter intuitive.

All the vos.sar-* indices CAN have 14 unique indexes. Most about about 6 - 8, if I understand the data set correctly.

In the provided gist, you can see that value: https://gist.github.com/portante/711aa2428461a7485384#file-shards-cat-L71

Each indexed sar document represents one sample collected as reported by the sadf command from sysstat. On some systems they might collect 144 samples a day (10 min intervals), some have 8,600+ samples (10 seconds per day). What really seems to affect the size of things is the number of nested elements. We have seen VM hosts with close to 1,000 nics servicing VMs, which ends up have one per net-dev and net-edev docs. Or they might have 400+ block devices ending up with that many nested docs for disks.

I had disabled the index warmers on those large indexes as a work-around. After enabling the warmers, and applying the setting above, the instance now takes about a 3 minutes to load up from ES start.

Much better. Thanks!

@martijnvg
Copy link
Member

@portante This is the reason: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html#compressed_oops

Now that all shards are started can you share how much docs all the vos.sar-* indices have?
Best way to share this is via running: curl 'localhost:9200/_cat/shards/vos.sar-*'.
This can give a good indication how much heap memory the fixed bitset cache will take if everything is being loaded.

@portante
Copy link

portante commented Dec 8, 2014

@martijnvg, I have updated that gist with the output requested above (using wildcards did not work on the _cat command for some reason for me), see https://gist.github.com/portante/711aa2428461a7485384#file-shards-txt

I'll have to think about the compressed_oops and how we can restructure to take advantage of that. It seems like this would be a nice feature to have for ES where it would break itself up into smaller instances automatically instead of having to require the users to do it.

@clintongormley
Copy link

I think this ticket can be closed now? Feel free to reopen if more discussion is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants