Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMET - accumulomaster out of memory issue #14

Closed
kthare10 opened this issue Feb 20, 2019 · 7 comments
Closed

COMET - accumulomaster out of memory issue #14

kthare10 opened this issue Feb 20, 2019 · 7 comments
Assignees

Comments

@kthare10
Copy link
Contributor

kthare10 commented Feb 20, 2019

In COMET cluster running in AWS, node running accumulomaster also hosts comet head node.
In current deployment, EC2 instance is of type small which has 2GB ram.

Issue:
Accumulomaster process is killed due to OutOfMemoryError. This happens almost after 4-5 days.

Accumulo process is killed by watchdog as can be observed in logs: /opt/accumulo-1.9.1/logs/gc_accumulomaster.out

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 3018"...

Investigation so far:
Current deployment configures Accumulo to use 1GB memory distributed across various processes.

  • accumulomaster configured with -Xmx128m -Xms128m
  • gc configured with -Xmx64m -Xms64m
  • monitor configured with -Xmx64m -Xms64m
  • tracer configured with -Xmx128m -Xms64m

Also, no memory parameters are passed for comet when launching it. By default Spring Boot app will use JVM default memory settings. This results in comet process taking upto 1GB memory.

With the above two configurations we are pretty much exhausting EC2 instance out.

To verify this theory, i have launched comet only on comet-head node VM. Accumulomaster is running standalone with memory options -Xmx256m -Xms256m.

I also have a script running on accumulomaster node collecting memory usage. Would monitor the system to check if memory issue is still observed. Based on observations above, will change depoloyment and configuration scripts.

@kthare10 kthare10 changed the title COMET memory leak COMET - accumulomaster out of memory issue Feb 20, 2019
@kthare10
Copy link
Contributor Author

Master was again killed with OutofMemory error within a day. Additional observation that on each occurence following trace is observed in accumulo logs.

2019-02-21 00:32:24,586 [rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Read a frame size of 1246907721, which is bigger than the maximum allowable buffer size for ALL connections.
2019-02-21 00:32:30,001 [rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Read a frame size of 1195725856, which is bigger than the maximum allowable buffer size for ALL connections.
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 2916"...

@kthare10
Copy link
Contributor Author

kthare10 commented Feb 26, 2019

As per Accumulo documentation memory calculated for running master is 5GB. Re-spawned EC2 instance for accummulo master as EC2 large instance.

Script stats.sh used to collect measurements is available at /root on 18.191.186.255 (master). Measurements collected are being redirected to meas.txt

@efu39
Copy link
Contributor

efu39 commented Mar 12, 2019

From meas3.txt collected by stats.sh, it shows when Master just started, it consumed 223m memory already; it was down again on 3/11 and showing it consumed 289m memory 1hr before crash. The memory consumption is quite stable for a week (277m~292m in past 6 days). According to this, I increased heap size limit to 512m so we have better margin and will check how it goes.

@efu39
Copy link
Contributor

efu39 commented Mar 22, 2019

Master is crashed again twice this week, with the recent crashes, in log we always see something like
Read a frame size of 1195725856, which is bigger than the maximum allowable buffer size for ALL connections." then crash happened right after this message. This occured on gc and master.
It might be something not accumulo process somehow erroneously connects to the port and sends non-Thrift message, based on:
https://issues.apache.org/jira/browse/ACCUMULO-2351
https://issues.apache.org/jira/browse/ACCUMULO-2360
http://apache-accumulo.1065345.n5.nabble.com/accumulo-master-is-down-td12877.html
I just set the tserver.server.message.size.max and general.server.message.size.max value to 50M (default was 1G) and see if issue will go away.

@efu39
Copy link
Contributor

efu39 commented Mar 22, 2019

From tcpdump, there is indeed suspicious TCP SYN msg from worker-18.sfj.corp.censys.io :

reading from file ./tcpdump_6_master.txt, link-type EN10MB (Ethernet)
04:46:58.358091 IP (tos 0x0, ttl 44, id 37218, offset 0, flags [DF], proto TCP (6), length 60)
    worker-18.sfj.corp.censys.io.49008 > accumulomaster.8222: Flags [S], cksum 0x1a14 (correct), seq 2103285987, win 29200, options [mss 1460,sackOK,TS val 4145745134 ecr 0,nop,wscale 9], length 0
	0x0000:  4500 003c 9162 4000 2c06 fb59 c66c 4330  E..<.b@.,..Y.lC0
	0x0010:  ac1f 0c44 bf70 201e 7d5d 98e3 0000 0000  ...D.p..}]......
	0x0020:  a002 7210 1a14 0000 0204 05b4 0402 080a  ..r.............
	0x0030:  f71b 0cee 0000 0000 0103 0309            ............
04:47:00.907022 IP (tos 0x0, ttl 44, id 6409, offset 0, flags [DF], proto TCP (6), length 60)
    worker-18.sfj.corp.censys.io.55556 > accumulomaster.8222: Flags [S], cksum 0x753c (correct), seq 2224558839, win 29200, options [mss 1460,sackOK,TS val 4145747683 ecr 0,nop,wscale 9], length 0
	0x0000:  4500 003c 1909 4000 2c06 73b3 c66c 4330  E..<..@.,.s..lC0
	0x0010:  ac1f 0c44 d904 201e 8498 12f7 0000 0000  ...D............
	0x0020:  a002 7210 753c 0000 0204 05b4 0402 080a  ..r.u<..........
	0x0030:  f71b 16e3 0000 0000 0103 0309            ............
04:47:01.121572 IP (tos 0x0, ttl 44, id 38561, offset 0, flags [DF], proto TCP (6), length 60)
    worker-18.sfj.corp.censys.io.mercury-disc > accumulomaster.8222: Flags [S], cksum 0x9b5e (correct), seq 2307562133, win 29200, options [mss 1460,sackOK,TS val 4145747897 ecr 0,nop,wscale 9], length 0
	0x0000:  4500 003c 96a1 4000 2c06 f61a c66c 4330  E..<..@.,....lC0
	0x0010:  ac1f 0c44 257c 201e 898a 9a95 0000 0000  ...D%|..........
	0x0020:  a002 7210 9b5e 0000 0204 05b4 0402 080a  ..r..^..........
	0x0030:  f71b 17b9 0000 0000 0103 0309            ............

@efu39
Copy link
Contributor

efu39 commented Mar 29, 2019

No crash observed for a week after setting up the security group policy to accept only same group inbound traffic to ports of accumulomaster and gc.

@efu39 efu39 removed the bug label Apr 2, 2019
@efu39
Copy link
Contributor

efu39 commented Apr 8, 2019

The issue is not a bug.
Cloudformation script will be updated to block unexpected traffic from public.

@efu39 efu39 closed this as completed Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants