COMET - accumulomaster out of memory issue #14

kthare10 · 2019-02-20T14:44:28Z

In COMET cluster running in AWS, node running accumulomaster also hosts comet head node.
In current deployment, EC2 instance is of type small which has 2GB ram.

Issue:
Accumulomaster process is killed due to OutOfMemoryError. This happens almost after 4-5 days.

Accumulo process is killed by watchdog as can be observed in logs: /opt/accumulo-1.9.1/logs/gc_accumulomaster.out

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 3018"...

Investigation so far:
Current deployment configures Accumulo to use 1GB memory distributed across various processes.

accumulomaster configured with -Xmx128m -Xms128m
gc configured with -Xmx64m -Xms64m
monitor configured with -Xmx64m -Xms64m
tracer configured with -Xmx128m -Xms64m

Also, no memory parameters are passed for comet when launching it. By default Spring Boot app will use JVM default memory settings. This results in comet process taking upto 1GB memory.

With the above two configurations we are pretty much exhausting EC2 instance out.

To verify this theory, i have launched comet only on comet-head node VM. Accumulomaster is running standalone with memory options -Xmx256m -Xms256m.

I also have a script running on accumulomaster node collecting memory usage. Would monitor the system to check if memory issue is still observed. Based on observations above, will change depoloyment and configuration scripts.

The text was updated successfully, but these errors were encountered:

kthare10 · 2019-02-21T14:29:08Z

Master was again killed with OutofMemory error within a day. Additional observation that on each occurence following trace is observed in accumulo logs.

2019-02-21 00:32:24,586 [rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Read a frame size of 1246907721, which is bigger than the maximum allowable buffer size for ALL connections.
2019-02-21 00:32:30,001 [rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Read a frame size of 1195725856, which is bigger than the maximum allowable buffer size for ALL connections.

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 2916"...

kthare10 · 2019-02-26T13:11:30Z

As per Accumulo documentation memory calculated for running master is 5GB. Re-spawned EC2 instance for accummulo master as EC2 large instance.

Script stats.sh used to collect measurements is available at /root on 18.191.186.255 (master). Measurements collected are being redirected to meas.txt

efu39 · 2019-03-12T14:25:16Z

From meas3.txt collected by stats.sh, it shows when Master just started, it consumed 223m memory already; it was down again on 3/11 and showing it consumed 289m memory 1hr before crash. The memory consumption is quite stable for a week (277m~292m in past 6 days). According to this, I increased heap size limit to 512m so we have better margin and will check how it goes.

efu39 · 2019-03-22T15:04:05Z

Master is crashed again twice this week, with the recent crashes, in log we always see something like
Read a frame size of 1195725856, which is bigger than the maximum allowable buffer size for ALL connections." then crash happened right after this message. This occured on gc and master.
It might be something not accumulo process somehow erroneously connects to the port and sends non-Thrift message, based on:
https://issues.apache.org/jira/browse/ACCUMULO-2351
https://issues.apache.org/jira/browse/ACCUMULO-2360
http://apache-accumulo.1065345.n5.nabble.com/accumulo-master-is-down-td12877.html
I just set the tserver.server.message.size.max and general.server.message.size.max value to 50M (default was 1G) and see if issue will go away.

efu39 · 2019-03-22T17:12:56Z

From tcpdump, there is indeed suspicious TCP SYN msg from worker-18.sfj.corp.censys.io :

reading from file ./tcpdump_6_master.txt, link-type EN10MB (Ethernet)
04:46:58.358091 IP (tos 0x0, ttl 44, id 37218, offset 0, flags [DF], proto TCP (6), length 60)
    worker-18.sfj.corp.censys.io.49008 > accumulomaster.8222: Flags [S], cksum 0x1a14 (correct), seq 2103285987, win 29200, options [mss 1460,sackOK,TS val 4145745134 ecr 0,nop,wscale 9], length 0
	0x0000:  4500 003c 9162 4000 2c06 fb59 c66c 4330  E..<.b@.,..Y.lC0
	0x0010:  ac1f 0c44 bf70 201e 7d5d 98e3 0000 0000  ...D.p..}]......
	0x0020:  a002 7210 1a14 0000 0204 05b4 0402 080a  ..r.............
	0x0030:  f71b 0cee 0000 0000 0103 0309            ............
04:47:00.907022 IP (tos 0x0, ttl 44, id 6409, offset 0, flags [DF], proto TCP (6), length 60)
    worker-18.sfj.corp.censys.io.55556 > accumulomaster.8222: Flags [S], cksum 0x753c (correct), seq 2224558839, win 29200, options [mss 1460,sackOK,TS val 4145747683 ecr 0,nop,wscale 9], length 0
	0x0000:  4500 003c 1909 4000 2c06 73b3 c66c 4330  E..<..@.,.s..lC0
	0x0010:  ac1f 0c44 d904 201e 8498 12f7 0000 0000  ...D............
	0x0020:  a002 7210 753c 0000 0204 05b4 0402 080a  ..r.u<..........
	0x0030:  f71b 16e3 0000 0000 0103 0309            ............
04:47:01.121572 IP (tos 0x0, ttl 44, id 38561, offset 0, flags [DF], proto TCP (6), length 60)
    worker-18.sfj.corp.censys.io.mercury-disc > accumulomaster.8222: Flags [S], cksum 0x9b5e (correct), seq 2307562133, win 29200, options [mss 1460,sackOK,TS val 4145747897 ecr 0,nop,wscale 9], length 0
	0x0000:  4500 003c 96a1 4000 2c06 f61a c66c 4330  E..<..@.,....lC0
	0x0010:  ac1f 0c44 257c 201e 898a 9a95 0000 0000  ...D%|..........
	0x0020:  a002 7210 9b5e 0000 0204 05b4 0402 080a  ..r..^..........
	0x0030:  f71b 17b9 0000 0000 0103 0309            ............

efu39 · 2019-03-29T18:21:50Z

No crash observed for a week after setting up the security group policy to accept only same group inbound traffic to ports of accumulomaster and gc.

efu39 · 2019-04-08T16:06:35Z

The issue is not a bug.
Cloudformation script will be updated to block unexpected traffic from public.

kthare10 changed the title ~~COMET memory leak~~ COMET - accumulomaster out of memory issue Feb 20, 2019

ibaldin assigned efu39 Feb 21, 2019

ibaldin added the bug label Feb 21, 2019

efu39 removed the bug label Apr 2, 2019

efu39 closed this as completed Apr 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COMET - accumulomaster out of memory issue #14

COMET - accumulomaster out of memory issue #14

kthare10 commented Feb 20, 2019 •

edited

kthare10 commented Feb 21, 2019

kthare10 commented Feb 26, 2019 •

edited

efu39 commented Mar 12, 2019 •

edited

efu39 commented Mar 22, 2019 •

edited

efu39 commented Mar 22, 2019 •

edited

efu39 commented Mar 29, 2019 •

edited

efu39 commented Apr 8, 2019

COMET - accumulomaster out of memory issue #14

COMET - accumulomaster out of memory issue #14

Comments

kthare10 commented Feb 20, 2019 • edited

kthare10 commented Feb 21, 2019

kthare10 commented Feb 26, 2019 • edited

efu39 commented Mar 12, 2019 • edited

efu39 commented Mar 22, 2019 • edited

efu39 commented Mar 22, 2019 • edited

efu39 commented Mar 29, 2019 • edited

efu39 commented Apr 8, 2019

kthare10 commented Feb 20, 2019 •

edited

kthare10 commented Feb 26, 2019 •

edited

efu39 commented Mar 12, 2019 •

edited

efu39 commented Mar 22, 2019 •

edited

efu39 commented Mar 22, 2019 •

edited

efu39 commented Mar 29, 2019 •

edited