ES 1.4.2 random node disconnect #9212

dragosrosculete · 2015-01-09T03:34:25Z

Hey,

I am having trouble for some while. I am getting random node disconnects and I cannot explain why.
There is no increase in traffic ( search or index ) when this is happening , it feels so random to me .
I first thought it could be the aws cloud plugin so I removed it and used unicast and pointed directly to my nodes IPs but that didn't seem to be the problem .
I changed the type of instances, now m3.2xlarge, added more instances, made so much modifications in ES yml config and still nothing .
Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node
http://pastebin.com/GhKfRkaa

markwalkom · 2015-01-09T04:37:40Z

This may be a more suitable question for the mailing list(?).

How big is your cluster; node count, index count, shards and replicas, total data size, heap size, ES version? What is your config?

dragosrosculete · 2015-01-09T05:07:15Z

Right now :
6 m3.x2large, 1 master, 5 data nodes.
414 indices, index/day
7372 shards. 9 shards, 1 replica per index
208 million documents, 430 GB
15 gb heap size allocated per node
ES 1.4.2

Current yml config here :
http://pastebin.com/Nmdr7F6J

TinLe · 2015-01-09T05:10:00Z

That's a lot of shards. You might be running into resource limits. Check your ulimit on filehandles.

dragosrosculete · 2015-01-09T05:33:21Z

File limit in not the problem, it is set to :
elasticsearch - nofile 65535
elasticsearch - memlock unlimited

I have another cluster with older ES version, with over 9000 shards on 3 nodes and my nodes don't get random disconnects there .

markwalkom · 2015-01-09T06:51:27Z

Can you try closing some old indices for a while and seeing if it helps?

dragosrosculete · 2015-01-09T12:35:53Z

Ok, I will try that to 1/3 of my total shards and see what happens. I will do it Monday so have a nice weekend till than :)

bleskes · 2015-01-09T12:50:36Z

@Revan007 did you see any long GCs on the node?

Can you set the transport.netty log to DEBUG (we may see more there as to why it disconnected) and also it would be great if can post node stats + info (_nodes and _nodes/stats) so we can get more info about the resources of the nodes.

rangagopalan · 2015-01-09T15:36:43Z

Hi,

I am also seeing the same issue on our 3 node cluster - I have posted a lot of details in the elasticsearch users group here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/elasticsearch/fQkhRJ6Md9c

I've messed around with lowering tcp keep alive, setting the fd.ping_timeout etc. and the issue is still happening.

The root cause seems to be a Netty level timeout -
{code}
[2015-01-08 17:32:18,216][TRACE][transport.netty ] [es1] close connection exception caught on transport layer [[id: 0xc4e4b9a1, /10.152.16.37:59038 => /10.109.172.201:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

The last change I tried was to increase the fd.ping_timeout by setting the following last night - the issue is still happening frequently and causing failures on the client side in our application.
{code}
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 10
{code}

dragosrosculete · 2015-01-09T17:57:47Z

How do I enable transport.netty debug logging ?

I have also read this https://github.com/foundit/elasticsearch-transport-module - Recommended tweaks to existing settings: how can I modify this values in ES ?

rangagopalan · 2015-01-09T18:01:00Z

You can do something like the following: (The following works - I wasn't sure whether logger.transport or logger.org.elasticsearch.transport was the right one - I tried both and this works)

curl -XPUT :9200/_cluster/settings -d '
{
"transient" : {
"logger.transport" : "TRACE",
"logger.org.elasticsearch.transport" : "TRACE"
}
}'

dragosrosculete · 2015-01-09T18:12:14Z

Thank you @rangagopalan .

I guess I will wait for another node failure to see the results .
Here is the _nodes info
http://pastebin.com/LaThtSBV

and here is the _nodes/stats info
http://pastebin.com/mq0CuVM2

dragosrosculete · 2015-01-09T23:00:40Z

My node failed... this are the logs, in addition to the ones posted before :

http://pastebin.com/BkfXUw7K

rangagopalan · 2015-01-10T20:05:40Z

@Revan007 and @bleskes - This issue seemed to be faced by others as well - See https://groups.google.com/forum/#!searchin/elasticsearch/ReceiveTimeoutTransportException/elasticsearch/EvDNO_vALNg/mVzurNj1KRAJ

The solution there was to try to reduce the number of shards/indices/replicas combination since that might help the cluster:/nodes/stats API return within the hard-coded timeout of 15 seconds (assuming that is the cause of the connection failures).

I am trying it here on my 3-node cluster (we recently switched to time-based indexes and at least temporarily for testing I was able to close out a bunch of older time-based indices to reduce the total shards+replicas I have from 1200 to about 450). I will post again on whether this helped eliminate / reduce the timeouts or not in a few hours.

If this works, perhaps we can also request an enhancement to make this 15 second timeout configurable for use in clusters where there are a larger number of shards / indices. (I believe the hard-coded limit is in the following code - but I could be mistaken - https://github.com/elasticsearch/elasticsearch/blob/3712d979519db5453bea49c34642a391c51d88b3/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java)

bleskes · 2015-01-10T20:16:38Z

@Revan007 - it seems something is causing new-elastic-data3 to shutdown, which triggered the disconnect f - I assume this is not what you mean?

2015-01-09 22:55:06,608][INFO ][node                     ] [new-elastic-data3] stopped
[2015-01-09 22:55:06,609][INFO ][node                     ] [new-elastic-data3] closing ...
[2015-01-09 22:55:07,310][INFO ][node                     ] [new-elastic-data3] closed
[

[2015-01-09 22:55:06,599][TRACE][transport.netty          ] [new-elastic-master] disconnected from [[new-elastic-data3][IGz2I5e4ToSho_y6Le6cLA][new-elastic-data3][inet[new-elastic-data3/10.33.181.140:9300]]{master=false}], channel closed even

@rangagopalan theses stats should be very fast (pending some old solved bugs - which eversion are you on?). Any reason why you point at the InternalClusterInfoService which is not used for client calls but rather for the disk threshold allocation decider. Do you have issues there?

rangagopalan · 2015-01-10T20:25:30Z

Am using 1.4.2 - more details of my env are in the post here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/elasticsearch/fQkhRJ6Md9c

No I wasn't using the internal class in any way - I was just theorizing/postulating that since the timeout was seen in the nodes/stats call perhaps this 15 second timeout set in the internal cluster info service class was applicable - The test I am trying is to reduce the time for any cluster statistics calls by reducing the number of indices/replicas/shards combination (what I am doing may not be applicable if the stats apis return really fast always and aren't dependent on the number of indexes) - I am trying out what was posted in the other link that I referred to - to try and see if the timeouts / disconnects stop if I reduce the total number of indexes/shards -

dragosrosculete · 2015-01-10T20:30:46Z

@bleskes . That was when I did a manual restart of that node. If I let it recover on its own it takes like 10 minutes.

Here is the data node log when I let it recover on its own
http://pastebin.com/wwyvnGJT

bleskes · 2015-01-10T21:22:37Z

@Revan007 what does take 10 minutes exactly? I want to make sure I understand the sequence of events.

For what it's worth it may these are genuine network issues. The last error in your logs show a socket read timeout:

[2015-01-10 16:52:37,969][TRACE][transport.netty          ] [new-elastic-data4] close connection exception caught on transport layer [[id: 0x45972790, /10.97.131.39:35885 => /10.193.3.95:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

dragosrosculete · 2015-01-11T08:37:46Z

Hmm... it has been almost 20 hours now and no node disconnect so far. I am not sure if my change fixed it. I will wait a little longer till I can say for sure .

rangagopalan · 2015-01-11T11:49:32Z

An update - it's been about 18 hours since I closed a bunch of older indexes to reduce the total number of shards to below 500 from about 1100. No timeouts since then.

dragosrosculete · 2015-01-11T14:44:51Z

Ok, it did happen again, but after 22 hours now .
The change that I added yesterday :
/sbin/sysctl -w net.ipv4.tcp_keepalive_time=200 net.ipv4.tcp_keepalive_intvl=200 net.ipv4.tcp_keepalive_probes=5

It took ~17 minutes to d/c the node and go to yellow state and recover . That means for 17 minutes when I tried to query search the master it wouldn't give me any response .

Here is the complete log from Master and Data Node :
http://pastebin.com/3twDqdxG

dragosrosculete · 2015-01-11T19:50:57Z

Damn , I still have received 2 more node timeouts so I guess I was just lucky before that it took more time .

@rangagopalan still going ok for you after reducing the number of open shards ?

I don't find this a solution because I am using 1 index per day with 9 shards and I have months of data, I will be over 3000 shards whatever I do ...

TinLe · 2015-01-11T20:10:32Z

I used to run into similar issues until I reduced the number of active
shards. I was using 15 shards per day, up to 30 shards per day now as I
increased my nodes. The way to reduce the total number of shards is to
close off old indices. I only leave two weeks worth open, the rest I
close unless my users requested access to them.

Tin

On Sun, Jan 11, 2015 at 11:51 AM, Revan007 notifications@github.com wrote:

Damn , I still have received 2 more node timeouts so I guess I was just
lucky before that it took more time .

@rangagopalan https://github.com/rangagopalan still going ok for you
after reducing the number of open shards ?

I don't find this a solution because I am using 1 index per day with 9
shards and I have months of data, I will be over 3000 shards whatever I do
...

—
Reply to this email directly or view it on GitHub
#9212 (comment)
.

rangagopalan · 2015-01-11T20:42:18Z

@Revan007 - Yeah - things look fine here since the reduction of indexes/shards- no timeouts - (going for more than 26 hours now) -

In our cluster we had recently switched a couple of indices to be time-based - one with a daily index and one with a weekly index. This significantly increased the number of indices - since we had to re-index some historical data as well to be time-based. I was able to close out a many of the daily indexes and keep just a week's worth of data online. For the second index we need to keep a year's worth of data and I'm thinking of switching to a monthly index (we might need to make changes in our application to make that work). Right now falling back to the older (non-time-based index) via an alias to keep the application working.

bleskes · 2015-01-12T05:24:41Z

@Revan007 @rangagopalan there two things at play here - timeouts on indices / node stats and network level disconnects. They are two different things.

On slow file system - indices stats (and to a lesser degree node stats) can take longer to complete you have many shards on a node. In extreme cases this can cause a time out. This is how ever very different then network disconnects which should not be affected by the number of shards.

That said it is possible that the shards put too much load on the node causing it not to respond on time to pings from the master which in turn causes the node to be thrown off the cluster and the network channels to be closed. This is how not the cases as you can see the network is closed because of a socket level time out (still possible under load but less likely):

2015-01-11 14:27:49,966][TRACE][transport.netty          ] [new-elastic-master] close connection exception caught on transport layer [[id: 0x4ceb2806, /10.97.131.39:39658 => new-elastic-data3/10.33.181.140:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

@Revan007 : what do you mean exactly when you say "I tried to query search the master it wouldn't give me any response ." - did search requests to the master not return?

A couple of side notes:

For what it's worth daily indices with 9 shards is quite a lot. Given the fact that you use 5 data nodes total it seems to me that you do not need the indexing capacity of 9 shards and that 5 shards will do just great (maybe event less).
it is not a good idea to run your queries through the master. Rather, send them to any of the data nodes. The master should be free and available to do cluster level work. Also a single master is obviously a single point of failure

dragosrosculete · 2015-01-12T10:26:57Z

@bleskes
Hey,
I have added a balancer node: master false, data false, and use query against it now .
This is good because now even if the the indices stats problem starts and takes 17 minutes to d/c the failed node the query still works fine. So at least now I am having a functional cluster .

I will reduce the number of shards to 5 soon per index but I cannot reduce the number of indexes, they have to be per day because of the amount of data. So in 1 year I will still have 365 x 3 = 1095 indices. ( 3 types of indices)

The indices stats timeout of 15000 ms has to modified somehow in a future ES update/fix .

bleskes · 2015-01-12T13:55:19Z

@Revan007 I still think there is some confusion here as the the Indices stats API doesn't have a timeout. That 15s timeout reference (which I agree should be configurable) is to an internal disk free space monitoring calls issued from the master.

I'm sorry - but it is not clear to me what you exactly you mean when you say " takes 17 minutes to d/c the failed node the query still works fine." - node stats / indices stats should never be in the way of Search calls - they use another thread pool. I think something else is going on and it seems to all point at networking issues or general node load - do you monitor these?

I will reduce the number of shards to 5 soon per index but I cannot reduce the number of indexes, they have to be per day because of the amount of data. So in 1 year I will still have 365 x 3 = 1095 indices. ( 3 types of indices)

Thats OK - just make sure you need 5 shards - you might be surprised how much you can get out of a single one.

dragosrosculete · 2015-01-12T20:01:28Z

@bleskes

I'm sorry - but it is not clear to me what you exactly you mean when you say " takes 17 minutes to d/c the failed node the query still works fine." - node stats / indices stats should never be in the way of Search calls - they use another thread pool

Well, before, when I was querying the master node (doing a search , curl -XGET bla bla bla) when a node started to fail :

[2015-01-11 14:15:01,014][DEBUG][action.admin.cluster.node.stats] [new-elastic-master] failed to execute on node [f_K6HsHZQdKaiGsgFMqjRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [new-elastic-data3][inet[new-elastic-data3/10.33.181.140:9300]][cluster:monitor/nodes/stats[n]] request_id [135582894] timed out after [15000ms]

I would get no result , nothing would return. It takes like 17 minutes for the master to start spitting tons of errors and disconnect that node , than is starts to recover the shards when the nodes is reconnected and could query it and receive results .

Now I am querying the balancer node and even if it the node starts to fail at least I am able to query during that time .

Thats OK - just make sure you need 5 shards - you might be surprised how much you can get out of a single one.

So because I have 5 nodes I should preferably have 5 shards, 1 per node. Should I understand that If I now start another 4 nodes and have 9 nodes, 1 shard per node I should not see those disconnects anymore ?

bleskes · 2015-01-12T23:43:33Z

It really sounds to me like there are some networking issues causing both symptoms. Can you try monitoring that out of ES? If you don't have any monitoring infra in place maybe start a ping process or something between the nodes

So because I have 5 nodes I should preferably have 5 shards, 1 per node.

Remember you have replicas and also more indices (one per day) that should be spread around. I would try 1 shard and see if you have enough indexing / search capacity based on your data volumes. That will also give you an idea how much a shard can take and based on that decide how many shards you need.

Should I understand that If I now start another 4 nodes and have 9 nodes, 1 shard per node I should not see those disconnects anymore ?

I still don't see any correlation between the number of shards and disconnects.

rangagopalan · 2015-01-16T22:31:20Z

Just an update confirming that there is definitely some kind of problem related to the number of shards per node -

We had a separate testing system that was working fine - a single node that had about 647 shards/replicas (set up with 1 replica - so there were 647 unassigned shards)

When I added a second node to this test cluster, it readjusted and became green as expected. But after a short while I started seeing the same timeouts logged and the nodes disconnected/reconnected as seen earlier on the 3-node cluster that I reported about earlier.

{code}
[2015-01-16 19:34:55,117][DEBUG][action.admin.cluster.node.stats] [es2] failed to execute on node [wZDxvQoKSP2z3N7xGoiYew]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [es3][inet[/10.169.231.15:9300]][cluster:monitor/nodes/stats[n]] request_id [109468] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-01-16 19:37:25,209][DEBUG][action.admin.cluster.node.stats] [es2] failed to execute on node [wZDxvQoKSP2z3N7xGoiYew]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [es3][inet[/10.169.231.15:9300]][cluster:monitor/nodes/stats[n]] request_id [111394] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

I believe that (as suggested here too: https://groups.google.com/d/msg/elasticsearch/EvDNO_vALNg/BPK5yYSUFeQJ) that there is some kind of problem within Elasticsearch node-communications/monitoring when there is a multi-node cluster and the number of shards/replicas per cluster goes above a certain number. From our experience we can say that 647 shards/replicas per node surely causes it (on our testing server) and on the production cluster I believe we saw the issue at about 400 or 450 shards/replicas per node.

I think at the very least the Elasticsearch documentation should be updated to provide recommendations/guidelines on cluster sizing to avoid this kind of issue.

dragosrosculete · 2015-06-01T20:23:25Z

Ok, here it is the bug.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811
If you want to find the error it was causing you can look in the syslog for error xen_netfront: xennet: skb rides the rocket: 19 slots
It was fixed on kernel linux - 3.13.0-46.75 . If you can update your kernel than do it .
This doesn't happen on enhanced network instances like the ones in VPC, r ,c ,i . since they use the "ixgbevf" drivers instead unlike the ones on PV .

If you cant upgrade your kernel right now you can do: sudo ethtool -K eth0 sg off
This will disable your "scatter-gather" and may decrease your network performance. ( haven't in my case)

Good luck !

faxm0dem · 2015-06-01T21:19:46Z

Thanks a lot @Revan007
While this may have been be the reason for your case, I think it's unrelated to our symptoms: we use bare metal nodes and the mentioned kernel message doesn't show up.
I think these are at least two different kernel bugs

wilb · 2015-06-02T07:58:35Z

Agree. Given the vagueness of the issue I think there could be a few bugs at play and we may be experiencing different ones that manifest themselves in a similar way. Or I could just have a badly configured cluster... 😜

faxm0dem · 2015-06-02T08:45:37Z

@wilb the issue's description might be vage and the various comments may have different causes, but I know for a fact that our issue is worked around by using a specific kernel

wilb · 2015-06-02T08:52:10Z

@faxm0dem don't want you to think that I was having a dig there - just meant that this feels like it could be one of those issues that could have multiple root causes that all manifest themselves in similar ways.

roder · 2015-06-10T20:02:21Z

I am also having this issue and have submitted a support ticket, but have yet to resolve it.

We're running ES 1.4.4 on Ubuntu 14.04 on GCE. We did not have the "rides the rocket" log lines in syslog and have tested on 3.16 and 3.19 kernels. No resolution yet.

diranged · 2015-06-10T21:17:42Z

We upgraded to Ubuntu 14.04.02, and saw the same issue again after just about a week of runtime. :(

wilb · 2015-06-11T07:09:29Z

I got to the bottom of what my particular issue was. Not sure it's likely to help others, but worth detailing...

I use Chef to configure our cluster, one of the things that gets configured is the S3 Repo for backups. Unfortunately I wasn't doing this in an idempotent manner, which resulted in the call coming in every 30 minutes. This wasn't causing any problems initially, but the call will verify the repository by default (this can be disabled by a query string, but I wasn't doing that).

Over time this appeared to become more of an expensive operation (I assume because it's an active backup repo full of data) - this looks to have been causing hosts to briefly lock and result in them temporarily dropping out of the cluster whilst the verify was taking place. Disabling the verify made things behave in a much more sane manner so I rewrote the logic to actually be idempotent.

On top of this, there appears to be a point of degradation at which there is no return unless you restart your master. As I mentioned above, calls to the local node status endpoint of the master start to take 20-30 seconds to return and at this point the cluster generally struggles along with many more disconnects until you take action.

gurvindersingh · 2015-06-11T07:15:22Z

At least for us upgrading to 3.18 kernel fixed the issue. Our cluster is running now for 3 weeks without any node disconnect whereas earlier we had disconnect at least once a week.

It does seems that there are multiple issues which can cause node disconnect. May be in 1.6.0 the improvement they made to make cluster state changes async might help.

tlrx · 2015-06-11T09:37:57Z

@wilb Thanks for letting us know. As far as I see you are running a 1.4.5 version of elasticsearch and an issue very similar to yours has been reported in #10344 and fixed in #10366 for 1.5+ versions. Maybe you'll be interested in reading those issues and compare with yours.

wilb · 2015-06-11T10:34:31Z

Thankyou - that looks exactly like the issue. I'd held off from doing an upgrade to 1.5.x yet but looks like I should definitely do so.

clintongormley · 2015-06-12T16:15:37Z

It sounds like we've arrived at the root cause of this problem, so I'm going to close the issue. If anybody disagrees, ping me to reopen.

faxm0dem · 2015-06-12T18:56:20Z

Not really: @wilb had an unrelated issue

faxm0dem · 2015-06-12T18:57:05Z

Ping @clintongormley

clintongormley · 2015-06-12T19:07:56Z

@faxm0dem is the issue not to do with the "scatter-gather" bug? @wilb had an unrelated issue, which he has resolved. What else is left here?

faxm0dem · 2015-06-12T19:47:16Z

Nope, no kernel messages pointing Into that Direction. It's definitely something that happened between the Linux Version I mentioned somewhere above

clintongormley · 2015-06-12T19:48:20Z

OK, reopening

bradmac · 2015-07-07T04:38:11Z

Just FYI, we're seeing this pretty continuously with 1.6.0 on Ubuntu 14.04. Going to try some of the suggestions above.

faxm0dem · 2015-07-07T13:08:30Z

@bradmac downgrading kernel should work if jstack unfreezes ES

bradmac · 2015-07-07T21:34:54Z

What i'm seeing is that the server node is not stuck, most requests to it are successfully processed. Its just that apparently some of the threads in the client side thread pool are unable to connect to it. This is in lightly loaded conditions.

faxm0dem · 2015-07-08T06:54:54Z

@bradmac different issue I'd say

diranged · 2015-07-09T01:35:39Z

Ugh we just had this issue happen twice for us in 3 hours. The first time it happened to one node in our cluster, the sceond time it happened to a different node. In both cases, 1 of the 4 nodes went into a garbage-collection loop. When this happened, the other 3 nodes were disconnected from the gc-looping-node and the cluster basically became unresponsive almost entirely.

Node 75s GC loop -- this stopped on its own

Jul  8 19:19:26.717305 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039423][81040] duration [31.2s], collections [1]/[31.5s], total [31.2s]/[5.2h], memory [89.3gb]->[87.1gb]/[89.9gb], all_pools {[young] [841.6mb]->[51.4mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.4gb]->[87gb]/[88.9gb]}
Jul  8 19:20:21.630203 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039448][81042] duration [29.8s], collections [1]/[30.9s], total [29.8s]/[5.2h], memory [89.4gb]->[87.3gb]/[89.9gb], all_pools {[young] [710.8mb]->[135.1mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.6gb]->[87.1gb]/[88.9gb]}
Jul  8 19:21:30.190669 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039485][81044] duration [30.2s], collections [1]/[31s], total [30.2s]/[5.2h], memory [89.4gb]->[87.6gb]/[89.9gb], all_pools {[young] [432.7mb]->[161.5mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[87.4gb]/[88.9gb]}
Jul  8 19:22:35.299828 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039520][81046] duration [30s], collections [1]/[30.6s], total [30s]/[5.2h], memory [89.7gb]->[87.5gb]/[89.9gb], all_pools {[young] [721.2mb]->[93.4mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[87.4gb]/[88.9gb]}
Jul  8 19:23:46.081212 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1039559][81049] duration [31.1s], collections [2]/[31.9s], total [31.1s]/[5.2h], memory [89.4gb]->[87.5gb]/[89.9gb], all_pools {[young] [503.9mb]->[129.7mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[87.4gb]/[88.9gb]}
Jul  8 19:32:25.870668 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1040045][81115] duration [30.6s], collections [2]/[31.3s], total [30.6s]/[5.2h], memory [89.8gb]->[87.5gb]/[89.9gb], all_pools {[young] [808mb]->[52.8mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[87.5gb]/[88.9gb]}
Jul  8 19:33:17.784320 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1040067][81117] duration [30.3s], collections [2]/[30.7s], total [30.3s]/[5.2h], memory [89.6gb]->[88gb]/[89.9gb], all_pools {[young] [649.7mb]->[264.2mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[87.7gb]/[88.9gb]}

Node 75s GC loop an hour later -- went on for an hour or more

Jul  8 20:28:19.741881 prod-flume-es-useast1-75-i-bc086b43.XXX  ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1043322][81546] duration [29.8s], collections [1]/[30.4s], total [29.8s]/[5.3h], memory [88.8gb]->[88gb]/[89.9gb], all_pools {[young] [811.3mb]->[14.1mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [87.9gb]->[88gb]/[88.9gb]}
Jul  8 21:08:07.168273 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045672][81818] duration [29.8s], collections [1]/[30.2s], total [29.8s]/[5.3h], memory [89.6gb]->[88.5gb]/[89.9gb], all_pools {[young] [704.2mb]->[45.8mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.5gb]/[88.9gb]}
Jul  8 21:08:41.352528 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045679][81819] duration [27.2s], collections [1]/[28.1s], total [27.2s]/[5.3h], memory [89.3gb]->[88.6gb]/[89.9gb], all_pools {[young] [463.8mb]->[56.9mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.7gb]->[88.5gb]/[88.9gb]}
Jul  8 21:09:12.617369 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045685][81820] duration [25.5s], collections [1]/[25.7s], total [25.5s]/[5.3h], memory [89.6gb]->[88.7gb]/[89.9gb], all_pools {[young] [723.7mb]->[6.6mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.7gb]/[88.9gb]}
Jul  8 21:09:48.365129 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045692][81821] duration [28.6s], collections [1]/[29.7s], total [28.6s]/[5.3h], memory [89.4gb]->[88.9gb]/[89.9gb], all_pools {[young] [416.9mb]->[164.5mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.7gb]/[88.9gb]}
Jul  8 21:10:20.292452 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045698][81822] duration [26.4s], collections [1]/[26.9s], total [26.4s]/[5.3h], memory [89.5gb]->[88.9gb]/[89.9gb], all_pools {[young] [519.8mb]->[124.1mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.8gb]/[88.9gb]}
Jul  8 21:10:52.069332 prod-flume-es-useast1-75-i-bc086b43.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-75-i-bc086b43-flume-elasticsearch-production_vpc-useast1] [gc][old][1045703][81823] duration [27.4s], collections [1]/[27.6s], total [27.4s]/[5.3h], memory [89.7gb]->[88.8gb]/[89.9gb], all_pools {[young] [719mb]->[3.9mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.8gb]/[88.9gb]}

... goes on for over an hour

Node 77s GC loop -- stopped when we restarted ES

Jul  8 23:16:54.165901 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053381][74364] duration [32.1s], collections [1]/[33.3s], total [32.1s]/[4.4h], memory [88.9gb]->[88.6gb]/[89.9gb], all_pools {[young] [21.9mb]->[7.5mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [88.8gb]->[88.6gb]/[88.9gb]}
Jul  8 23:20:23.469300 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053557][74388] duration [31.1s], collections [2]/[32.3s], total [31.1s]/[4.4h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [29.7mb]->[186.6mb]/[865.3mb]}{[survivor] [106mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.8gb]/[88.9gb]}
Jul  8 23:20:51.901650 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053560][74389] duration [25.8s], collections [1]/[26.4s], total [25.8s]/[4.4h], memory [89.5gb]->[88.9gb]/[89.9gb], all_pools {[young] [469mb]->[9.3mb]/[865.3mb]}{[survivor] [107.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:21:14.087709 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053561][74390] duration [21.6s], collections [1]/[22.1s], total [21.6s]/[4.4h], memory [88.9gb]->[88.9gb]/[89.9gb], all_pools {[young] [9.3mb]->[893.1kb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:21:38.369926 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053563][74391] duration [23.1s], collections [1]/[23.2s], total [23.1s]/[4.4h], memory [89.7gb]->[89gb]/[89.9gb], all_pools {[young] [780.3mb]->[28.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:22:03.804087 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053565][74392] duration [24.1s], collections [1]/[24.4s], total [24.1s]/[4.4h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[59.4mb]/[865.3mb]}{[survivor] [2.2mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:22:28.031757 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053567][74393] duration [23.1s], collections [1]/[23.2s], total [23.1s]/[4.4h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[33.2mb]/[865.3mb]}{[survivor] [103.6mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:22:53.483730 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053569][74394] duration [24s], collections [1]/[24.4s], total [24s]/[4.4h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [854.5mb]->[37.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:23:19.178259 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053571][74395] duration [24.6s], collections [1]/[24.6s], total [24.6s]/[4.4h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[32.7mb]/[865.3mb]}{[survivor] [83.7mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:23:42.444754 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053573][74396] duration [21.8s], collections [1]/[22.2s], total [21.8s]/[4.4h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [839mb]->[66.6mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:24:07.569372 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053574][74397] duration [24.1s], collections [1]/[25.1s], total [24.1s]/[4.4h], memory [89gb]->[89.1gb]/[89.9gb], all_pools {[young] [66.6mb]->[182.4mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:24:30.145261 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053576][74398] duration [21.4s], collections [1]/[21.5s], total [21.4s]/[4.4h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[34.2mb]/[865.3mb]}{[survivor] [73.5mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:24:54.803107 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053578][74399] duration [23.4s], collections [1]/[23.6s], total [23.4s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[57.8mb]/[865.3mb]}{[survivor] [49.2mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:25:17.380932 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053580][74400] duration [21.3s], collections [1]/[21.5s], total [21.3s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[66.1mb]/[865.3mb]}{[survivor] [42.8mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:25:43.662986 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053582][74401] duration [25.2s], collections [1]/[25.2s], total [25.2s]/[4.5h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[82.3mb]/[865.3mb]}{[survivor] [94.4mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:26:08.731447 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053583][74402] duration [24s], collections [1]/[25s], total [24s]/[4.5h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [82.3mb]->[44.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:26:32.904273 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053585][74403] duration [22.8s], collections [1]/[23.1s], total [22.8s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[55.6mb]/[865.3mb]}{[survivor] [31.9mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:26:58.115853 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053587][74404] duration [24.1s], collections [1]/[24.2s], total [24.1s]/[4.5h], memory [89.9gb]->[89.1gb]/[89.9gb], all_pools {[young] [865.3mb]->[118.2mb]/[865.3mb]}{[survivor] [81.7mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:27:24.618880 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053589][74405] duration [25.4s], collections [1]/[25.5s], total [25.4s]/[4.5h], memory [89.9gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[73.2mb]/[865.3mb]}{[survivor] [89.1mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:27:48.702525 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053590][74406] duration [23.2s], collections [1]/[24s], total [23.2s]/[4.5h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [73.2mb]->[18.3mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:28:14.687780 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053592][74407] duration [24.7s], collections [1]/[24.9s], total [24.7s]/[4.5h], memory [89.8gb]->[89gb]/[89.9gb], all_pools {[young] [865.3mb]->[29.7mb]/[865.3mb]}{[survivor] [21.6mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:28:36.859798 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053593][74408] duration [21.3s], collections [1]/[22.1s], total [21.3s]/[4.5h], memory [89gb]->[89gb]/[89.9gb], all_pools {[young] [29.7mb]->[7.8mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:29:01.279278 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053595][74409] duration [22.7s], collections [1]/[23.4s], total [22.7s]/[4.5h], memory [89.6gb]->[89gb]/[89.9gb], all_pools {[young] [713.9mb]->[4.2mb]/[865.3mb]}{[survivor] [0b]->[69.9mb]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:29:25.303766 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053596][74410] duration [23.1s], collections [1]/[24s], total [23.1s]/[4.5h], memory [89gb]->[88.9gb]/[89.9gb], all_pools {[young] [4.2mb]->[1.5mb]/[865.3mb]}{[survivor] [69.9mb]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}
Jul  8 23:29:47.080598 prod-flume-es-useast1-77-i-cfbeb230.XXX ElasticSearch: [WARN ][monitor.jvm              ] [prod-flume-es-useast1-77-i-cfbeb230-flume-elasticsearch-production_vpc-useast1] [gc][old][1053597][74411] duration [20.8s], collections [1]/[21.7s], total [20.8s]/[4.5h], memory [88.9gb]->[88.9gb]/[89.9gb], all_pools {[young] [1.5mb]->[6.1mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [88.9gb]->[88.9gb]/[88.9gb]}

clintongormley · 2015-07-09T15:58:30Z

@diranged that's not surprising. you have enormous heaps (90GB!) and they're full. You need some tuning advice. I suggest asking about that on the forum: http://discuss.elastic.co/

pavanIntel · 2015-07-31T11:55:10Z

If anybody facing issue with following error:

exception caught on transport layer [[id: 0x32cd6d09, /172.31.6.91:38524 => /172.31.18.78:9300]], closing connectionjava.io.StreamCorruptedException: invalid internal transport message format, got (50,4f,53,54)

The 9300 port is used for inter-node communication and uses an internal binary protocol, thus you can't use it from the browser. In order to use the 9300 port you have to use the Java API, either Node client or Transport client, which both understand the internal binary protocol.

From the browser you should only use the 9200 port, which is the one that exposes the REST API.

If you are using Amazon LoadBalancer to access your cluster: then you should change your instance protocol in Listener settings to 9200.

faxm0dem · 2015-07-31T13:17:55Z

@pavanIntel and how is this relevant?

clintongormley · 2015-12-03T19:46:47Z

Nothing further on this ticket. Closing

razvanphp · 2016-01-18T22:23:34Z

I have the same problem, and I got out of ideas. The riding socket is out of question, since we use latest kernel (3.16.7-ckt20-1+deb8u1). ES version 1.7.4. Debian Jessie. Java build 1.8.0_66-internal-b17.

Here is the debug log:

[2016-01-18 21:38:27,599][TRACE][transport.tracer         ] [graylog-es-1-vm] [121786447][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:38:27,619][TRACE][transport.tracer         ] [graylog-es-1-vm] [121786447][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:39,979][TRACE][transport.tracer         ] [graylog-es-1-vm] [121788793][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:38:39,999][TRACE][transport.tracer         ] [graylog-es-1-vm] [121788793][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:50,991][DEBUG][transport.netty          ] [graylog-es-1-vm] disconnecting from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}], channel closed event
[2016-01-18 21:38:50,991][TRACE][transport.netty          ] [graylog-es-1-vm] disconnected from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}], channel closed event
[2016-01-18 21:38:50,998][INFO ][cluster.service          ] [graylog-es-1-vm] removed {[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false},}, reason: zen-disco-node_failed([graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}), reason transport disconnected
[2016-01-18 21:38:55,044][DEBUG][transport.netty          ] [graylog-es-1-vm] connected to node [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:55,044][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790018][internal:discovery/zen/join/validate] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [null])
[2016-01-18 21:38:55,061][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790018][internal:discovery/zen/join/validate] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:55,061][INFO ][cluster.service          ] [graylog-es-1-vm] added {[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false},}, reason: zen-disco-receive(join from node[[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}])
[2016-01-18 21:38:55,093][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790022][internal:discovery/zen/publish] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [null])
[2016-01-18 21:38:55,261][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790022][internal:discovery/zen/publish] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:39:04,854][TRACE][transport.tracer         ] [graylog-es-1-vm] [121791213][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:39:04,874][TRACE][transport.tracer         ] [graylog-es-1-vm] [121791213][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:39:17,381][TRACE][transport.tracer         ] [graylog-es-1-vm] [121793559][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:39:17,401][TRACE][transport.tracer         ] [graylog-es-1-vm] [121793559][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]

It is true that we have a rather big cluster, but only the master disconnects, not the nodes. They communicate through a VPN tunnel, maybe somebody has another idea how to improve this.

Is it normal that the nodes are queried so often with the stats, i.e. each few seconds?

Thank you!

bleskes self-assigned this Jan 9, 2015

clintongormley closed this as completed Jun 12, 2015

clintongormley reopened this Jun 12, 2015

clintongormley closed this as completed Dec 3, 2015

ES 1.4.2 random node disconnect #9212

ES 1.4.2 random node disconnect #9212

Comments

dragosrosculete commented Jan 9, 2015

markwalkom commented Jan 9, 2015

dragosrosculete commented Jan 9, 2015

TinLe commented Jan 9, 2015

dragosrosculete commented Jan 9, 2015

markwalkom commented Jan 9, 2015

dragosrosculete commented Jan 9, 2015

bleskes commented Jan 9, 2015

rangagopalan commented Jan 9, 2015

dragosrosculete commented Jan 9, 2015

rangagopalan commented Jan 9, 2015

dragosrosculete commented Jan 9, 2015

dragosrosculete commented Jan 9, 2015

rangagopalan commented Jan 10, 2015

bleskes commented Jan 10, 2015

rangagopalan commented Jan 10, 2015

dragosrosculete commented Jan 10, 2015

bleskes commented Jan 10, 2015

dragosrosculete commented Jan 11, 2015

rangagopalan commented Jan 11, 2015

dragosrosculete commented Jan 11, 2015

dragosrosculete commented Jan 11, 2015

TinLe commented Jan 11, 2015

rangagopalan commented Jan 11, 2015

bleskes commented Jan 12, 2015

dragosrosculete commented Jan 12, 2015

bleskes commented Jan 12, 2015

dragosrosculete commented Jan 12, 2015

bleskes commented Jan 12, 2015

rangagopalan commented Jan 16, 2015

dragosrosculete commented Jun 1, 2015

faxm0dem commented Jun 1, 2015

wilb commented Jun 2, 2015

faxm0dem commented Jun 2, 2015

wilb commented Jun 2, 2015

roder commented Jun 10, 2015

diranged commented Jun 10, 2015

wilb commented Jun 11, 2015

gurvindersingh commented Jun 11, 2015

tlrx commented Jun 11, 2015

wilb commented Jun 11, 2015

clintongormley commented Jun 12, 2015

faxm0dem commented Jun 12, 2015

faxm0dem commented Jun 12, 2015

clintongormley commented Jun 12, 2015

faxm0dem commented Jun 12, 2015

clintongormley commented Jun 12, 2015

bradmac commented Jul 7, 2015

faxm0dem commented Jul 7, 2015

bradmac commented Jul 7, 2015

faxm0dem commented Jul 8, 2015

diranged commented Jul 9, 2015

Node 75s GC loop -- this stopped on its own

Node 75s GC loop an hour later -- went on for an hour or more

Node 77s GC loop -- stopped when we restarted ES

clintongormley commented Jul 9, 2015

pavanIntel commented Jul 31, 2015

faxm0dem commented Jul 31, 2015

clintongormley commented Dec 3, 2015

razvanphp commented Jan 18, 2016