tsd fsck warning message #895

lordang · 2016-11-23T02:01:45Z

It seems I have tsd name and UID mapping error on uid table.
Our cluster has large tagv value cuz we use client ip as tagv.
And when I executed uid fsck command, uid java process used all RAM (we have 128G RAM)
and continuously ran GC and comsumed all CPU and RAM.
And then I got following warning message.

2016-11-23 10:19:16,882 WARN [New I/O worker #6] Scanner: RegionInfo(table="tsdb-uid", region_name="tsdb-uid,\x1Bx\xB4\xB7,1475154085093.65338521f3a7a06523eec77f11e2ca23.", stop_key="168220431") pretends to not know Scanner(table="tsdb-uid", start_key="!q\x10\xB2", stop_key="", columns=org.hbase.async.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 1484925, already closed?
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1966)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30438)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2016)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:110)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:90)
at java.lang.Thread.run(Thread.java:745)

Caused by RPC: GetNextRowsRequest(scanner_id=0x000000000006A87D, max_num_rows=1024, region=null, attempt=0), populate_blockcache=true, max_num_rows=1024, max_num_kvs=4096, region=null, filter=null, scanner_id=0x000000000006A87D). I will retry to open a scanner but this is typically because you've been holding the scanner open and idle for too long (possibly due to a long GC pause on your side or in the RegionServer)
2016-11-23 10:19:16,887 ERROR [main] UidManager: Duplicate reverse tagv mapping: 284081517 -> 284081517 and 284081517 -> 217110B2. kv=KeyValue(key="!q\x10\xB2", family="name", qualifier="tagv", value="284081517", timestamp=1460540461278)

Can I ignore this message and continue running fsck and wait for end?
Or Must I increase RAM and try again?

manolama · 2016-11-27T20:03:55Z

Hello @lordang, The scanner exception you're seeing is normal for JVM undergoing massive GC as the underlying connection to HBase will be killed after a timeout period.

But fsck shouldn't eat up 128G of RAM so it sounds like there's a bug in there. If you could restart it and take a heap-dump of the JVM at around 4G or so I'd love to see it. Then we can fix it up. Thanks!

lordang · 2016-12-01T02:44:42Z

I took heap dump, but it's too big to attach to github. It's about 4GB. How can I show this?

manolama · 2016-12-13T06:24:28Z

If you can drop-box it or post it in a GDrive that would be great.

lordang · 2016-12-26T06:47:23Z

Here's my heap dump.
https://www.dropbox.com/s/halddyh80kyuxb0/fsck_dump.hprof?dl=0

manolama added the bug label Nov 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsd fsck warning message #895

tsd fsck warning message #895

lordang commented Nov 23, 2016

manolama commented Nov 27, 2016

lordang commented Dec 1, 2016

manolama commented Dec 13, 2016

lordang commented Dec 26, 2016

tsd fsck warning message #895

tsd fsck warning message #895

Comments

lordang commented Nov 23, 2016

manolama commented Nov 27, 2016

lordang commented Dec 1, 2016

manolama commented Dec 13, 2016

lordang commented Dec 26, 2016