Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTSDBv2.0.0 : java.lang.StackOverflowError: null #334

Open
dennismphil opened this Issue May 12, 2014 · 29 comments

Comments

Projects
None yet
@dennismphil
Copy link

dennismphil commented May 12, 2014

Using OpenTSDB version [2.0.0] - (Modified Const.java MAX_NUM_TAGS = 16 - if that matters)

From OpenTSDB logs:

10:52:52.111 INFO  [TsdbQuery.call] - TsdbQuery(start_time=1084632741867, end_time=1398916800000, metric=[0, 0, 1] (usage), tags={}, rate=false, aggregator=sum, group_bys=(app [0, 0, 2], user_id [0, 0, 6], )) matched 2120235 rows in 494634 spans in 9464ms
10:52:55.847 ERROR [RegionClient.exceptionCaught] - Unexpected exception from downstream on [id: 0x7f739d1e, /127.0.0.1:37253 => /127.0.0.1:60020]
java.lang.StackOverflowError: null
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
...
...

OpenTSDB becomes unresponsive eventually until a service opentsdb restart

@dennismphil

This comment has been minimized.

Copy link
Author

dennismphil commented May 14, 2014

Ended up increasing stack size in JVMARGS /usr/share/opentsdb/bin/tsdb

@manolama

This comment has been minimized.

Copy link
Member

manolama commented Jun 3, 2014

Likely need to re-work the query code so that it isn't overflowing the stack. Users should never have to change the stack size.

@manolama manolama added the bug label Jun 3, 2014

@also

This comment has been minimized.

Copy link
Contributor

also commented Jul 23, 2014

What part of the query code causes the stack overflow? I'm also getting this exception, and would like to work on a patch.

manolama added a commit to manolama/opentsdb that referenced this issue Dec 8, 2014

Modify the TsdbQuery scanner to stop at 1024 calls to .nextRows() so …
…that we

can continue in a different thread. It has a thread pool for now but I'd like
to eventually share that with Netty. This avoids the stack overflow problem
in OpenTSDB#334 where the callback chain simply grew too long.
@tsuna

This comment has been minimized.

Copy link
Member

tsuna commented Feb 13, 2015

I happened to discuss this issue with @manolama today, I'm not sure I agree with the fix and the analysis being done here. Can you post a longer section of the stack trace, to understand why there was a StackOverflowError?

@scicco

This comment has been minimized.

Copy link

scicco commented Feb 13, 2015

hello, i'm experiencing the same issue, the version is:

OpenTSDB version [2.0.0] built from revision 14fd1b1 in a MINT state
Built on Wed Mar 05 18:55:53 GMT+100 2014 by ..

here is my stack:

2015-02-13 14:30:32,706 ERROR [New I/O worker #32] RegionClient: Unexpected exception from downstream on [id: 0x282632ab, /192.168.0.113:37836 => /192.168.0.113:60020]
java.lang.StackOverflowError: null
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1366) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[suasync-1.4.0.jar:fe17b98]
    at com.stumbleupon.async.Deferred.access$300(Deferred.java:430) ~[suasync-1.4.0.jar:fe17b98]

and so on

@tsuna

This comment has been minimized.

Copy link
Member

tsuna commented Feb 13, 2015

Thanks, this helps. So we see that we have a long chain of Deferreds strung together and they all seem to be firing immediately. This means it's very likely the bug is in AsyncHBase, and the right fix is probably there, not in OpenTSDB.

Let me dig some more in the code...

@scicco

This comment has been minimized.

Copy link

scicco commented Feb 24, 2015

i've fixed increasing JVMARGS as @dennismphil

@alienth

This comment has been minimized.

Copy link
Contributor

alienth commented Apr 22, 2015

Curiously I upped my stack size and that didn't address the issue. Still getting a StackOverflow, with exactly 1024 Deferred entries in the chain.

@alienth

This comment has been minimized.

Copy link
Contributor

alienth commented May 15, 2015

Additional curiosity: Bumping Xss didn't help me, but bumping VMThreadStackSize to 4096 did result in an unexpected change. It still threw a StackOverflow, but there are now 2048 Deferred entries in the StackOverflow error output instead of 1024.

@alberts

This comment has been minimized.

Copy link

alberts commented May 21, 2015

Also running into this in production with 2.1.0 final. Any new info/workarounds?

@alienth

This comment has been minimized.

Copy link
Contributor

alienth commented May 21, 2015

I chatted with @manolama about this in IRC. My issue is that I'm making a query which is fetching a tonne of rows and it will always hit the stackoverflow. For my case, he recommended I try out this patch: manolama@c591503

Once I get that built and give it a try I'll provide an update.

@alberts

This comment has been minimized.

Copy link

alberts commented May 21, 2015

I suspect this cluster might have an HBase issue, but it's hard to see from the OpenTSDB log itself.

Running a fsck now and we get a couple of the following per second.

The HBase cluster is busy, but not that busy.

2015-05-21 22:54:43,061 WARN  [New I/O worker #5] Scanner: RegionInfo(table="tsdb", region_name="tsdb,\x01\x86\xFDT\xEDd\xE0\x00\x00\x01\x00\x00\x01\x00\x00\x02\x00\x05\x9E\x00\x00\x03\x00\x00\xE4\x00\x00\x05\x00\x01w\x00\x00\x1B\x00\x01o,1426021282127.d50ad79cc9dc959bf4b329f62b88b19d.", stop_key=[1, -114, 46, 84, -6, 63, -128, 0, 0, 1, 0, 0, 1, 0, 0, 2, 0, 7, -45, 0, 0, 3, 0, 4, 99, 0, 0, 24, 0, 0, -79]) pretends to not know Scanner(table="tsdb", start_key=[1, -117, 125, 84, -13, 111, -64, 0, 0, 23, 0, 0, 100, 0, 0, 100, 0, -43, 13], stop_key="\x01\x94z", columns={"t"}, populate_blockcache=true, max_num_rows=128, max_num_kvs=4096, region=null, filter=null, scanner_id=0x000000000000016E).  I will retry to open a scanner but this is typically because you've been holding the scanner open and idle for too long (possibly due to a long GC pause on your side or in the RegionServer)
org.hbase.async.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 33134, already closed?
        at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3166)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30808)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2029)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
        at java.lang.Thread.run(Thread.java:745)

Caused by RPC: GetNextRowsRequest(scanner_id=0x000000000000016E, max_num_rows=128, region=null, attempt=0)
        at org.hbase.async.UnknownScannerException.make(UnknownScannerException.java:60) ~[asynchbase-1.6.0.jar:na]
        at org.hbase.async.UnknownScannerException.make(UnknownScannerException.java:32) ~[asynchbase-1.6.0.jar:na]
        at org.hbase.async.RegionClient.makeException(RegionClient.java:1448) [asynchbase-1.6.0.jar:na]
        at org.hbase.async.RegionClient.decodeException(RegionClient.java:1468) [asynchbase-1.6.0.jar:na]
        at org.hbase.async.RegionClient.decode(RegionClient.java:1299) [asynchbase-1.6.0.jar:na]
        at org.hbase.async.RegionClient.decode(RegionClient.java:89) [asynchbase-1.6.0.jar:na]
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) [netty-3.9.4.Final.jar:na]
        at org.hbase.async.RegionClient.handleUpstream(RegionClient.java:1082) [asynchbase-1.6.0.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) [netty-3.9.4.Final.jar:na]
        at org.hbase.async.HBaseClient$RegionClientPipeline.sendUpstream(HBaseClient.java:2677) [asynchbase-1.6.0.jar:na]
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.9.4.Final.jar:na]
        at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.9.4.Final.jar:na]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
@alberts

This comment has been minimized.

Copy link

alberts commented May 23, 2015

Still investigating, but it seems that our issues were caused by some queries of certain metrics (potentially over invalid or future time ranges) which were failing, causing valid queries (maybe in the same RPC batch?) to hang forever too.

manolama added a commit to manolama/opentsdb that referenced this issue May 28, 2015

Modify the TsdbQuery scanner to stop at 1024 calls to .nextRows() so …
…that we

can continue in a different thread. It has a thread pool for now but I'd like
to eventually share that with Netty. This avoids the stack overflow problem
in OpenTSDB#334 where the callback chain simply grew too long.

manolama added a commit to manolama/opentsdb that referenced this issue May 28, 2015

Modify the TsdbQuery scanner to stop at 1024 calls to .nextRows() so …
…that we

can continue in a different thread. It has a thread pool for now but I'd like
to eventually share that with Netty. This avoids the stack overflow problem
in OpenTSDB#334 where the callback chain simply grew too long.

manolama added a commit to manolama/opentsdb that referenced this issue May 28, 2015

Modify the TsdbQuery scanner to stop at 1024 calls to .nextRows() so …
…that we

can continue in a different thread. It has a thread pool for now but I'd like
to eventually share that with Netty. This avoids the stack overflow problem
in OpenTSDB#334 where the callback chain simply grew too long.
@liorsav

This comment has been minimized.

Copy link

liorsav commented Jun 2, 2015

Hi,
We are working with v 2.1.0, getting the same StackOverflowError.
Could you please advise what's the recommended solution/workaround ?
Is there a patch that can be installed ?

Many Thanks,
Lior

@liorsav

This comment has been minimized.

Copy link

liorsav commented Jun 8, 2015

Hi,
After applying manolama patch (manolama/opentsdb@c591503 freezes stopped, , however we started getting OOM error.

Exception in thread "pool-7-thread-1" java.lang.OutOfMemoryError: Java heap space

In JHat we noticed many-many CallBack objects.

Please advise,
Thanks,
Lior

@GiuVi

This comment has been minimized.

Copy link

GiuVi commented Jun 12, 2015

Hi all,
we are working with a 4 nodes Hadoop cluster and a metric with 5 tags: domain, node (cardinality = 33000), group, counter(cardinality = 10), subCounter(cardinality = 3).
10000 metrics per second are written. Each metric has a size of about 100 byte.
There is a production of 33000x10x3~1000000 metrics per minute.

From OpenTSDB log:
2015-06-11 17:18:34,215 ERROR [New I/O worker #34] RegionClient: Unexpected exception from downstream on [id: 0xd72ae84b, /163.162.107.244:52916 => /163.162.107.240:60020]
java.lang.StackOverflowError: null
at com.stumbleupon.async.Deferred.access$100(Deferred.java:430) ~[async-1.4.0.jar:na]
at com.stumbleupon.async.Deferred$Continue.call(Deferred.java:1358) ~[async-1.4.0.jar:na]

Someone can help us?
Thanks,
Giulia

@fbobobo

This comment has been minimized.

Copy link

fbobobo commented Aug 14, 2015

I'm also having this issue, same logs as those given above. Running OpenTSDB 2.1 on CentOS 6 with an HDP 2.0 cluster. The Hbase cluster looks completely fine.

@fbobobo

This comment has been minimized.

Copy link

fbobobo commented Sep 15, 2015

Hello guys, do you have any news on that problem ?

@xicabin

This comment has been minimized.

Copy link

xicabin commented Sep 15, 2015

Finally I change the JVM args to prevent from this problem.

JVMARGS ... -Xss16m ...

@fbobobo

This comment has been minimized.

Copy link

fbobobo commented Sep 15, 2015

Already done for me (-Xss100m), and the problem is still present.
We will make a regular daily rolling restart, somewhat ugly, but we have no other choices for the moment

@manolama

This comment has been minimized.

Copy link
Member

manolama commented Apr 7, 2016

Found once case of this happening with 2.0 where too many time series in the output (i.e. 16k or more) will throw this SOFE. https://github.com/OpenTSDB/opentsdb/blob/master/src/tsd/HttpJsonSerializer.java#L842-L850

@ntirupattur

This comment has been minimized.

Copy link

ntirupattur commented Jul 6, 2016

We are hitting this issue on 2.2. could you please update on when this will be fixed. thanks

@todd-richmond

This comment has been minimized.

Copy link

todd-richmond commented Jul 7, 2016

We are using collectd with a "tag" patch that allows adding tags to any metric. Our stack failure occurs in a case where the tag value set is huge (30k) for a one tag and a query is requested for a day or two of data.

@opsun

This comment has been minimized.

Copy link
Contributor

opsun commented Nov 20, 2017

same issue on v2.3.0.

2017-11-20 15:21:07,485 ERROR [AsyncHBase I/O Worker #9] RegionClient: Unexpected exception from downstream on [id: 0x8d650fd6, /10.108.2.18:46612 => /10.105.39.139:60020]
java.lang.StackOverflowError: null
        at net.opentsdb.core.Span.seekRow(Span.java:365) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.core.Span.access$100(Span.java:36) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.core.Span$Iterator.seek(Span.java:444) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.core.AggregationIterator.<init>(AggregationIterator.java:363) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.core.AggregationIterator.create(AggregationIterator.java:325) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.core.SpanGroup.iterator(SpanGroup.java:487) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.core.SpanGroup.iterator(SpanGroup.java:54) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.tsd.HttpJsonSerializer$1DPsResolver$WriteToBuffer.call(HttpJsonSerializer.java:787) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.tsd.HttpJsonSerializer$1DPsResolver$WriteToBuffer.call(HttpJsonSerializer.java:671) ~[tsdb-2.3.0.jar:]
        at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.addCallbacks(Deferred.java:688) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.addCallback(Deferred.java:724) ~[async-1.4.0.jar:na]
        at net.opentsdb.tsd.HttpJsonSerializer$1DPsResolver.call(HttpJsonSerializer.java:860) ~[tsdb-2.3.0.jar:]
        at net.opentsdb.tsd.HttpJsonSerializer$1DPsResolver.call(HttpJsonSerializer.java:625) ~[tsdb-2.3.0.jar:]
        at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.doCall(Deferred.java:1284) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.doCall(Deferred.java:1284) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313) ~[async-1.4.0.jar:na]
        at com.stumbleupon.async.Deferred.doCall(Deferred.java:1284) ~[async-1.4.0.jar:na]
@yangzj

This comment has been minimized.

Copy link

yangzj commented Mar 2, 2018

@opsun the same with you

@chroth7

This comment has been minimized.

Copy link

chroth7 commented Jan 29, 2019

FWIW, I can confirm @todd-richmond 's finding above (very high tagv cardinality, in my case almost 1Mio) - querying hours worked, but not days.

I restructured the data using "Shift to Metric" as discussed here: http://opentsdb.net/docs/build/html/user_guide/writing/index.html#time-series-cardinality

(losing the aggregation in this dimension is no issue for me)

... and it works lightning-fast again.

HTH

@manolama

This comment has been minimized.

Copy link
Member

manolama commented Feb 1, 2019

Yeah this is better in 3.x but still has an issue with the UID resolution if all of the UIDs are in cache. I'm re-working the pipeline one more time in 3.x and that'll clean it up.

@asdf2014

This comment has been minimized.

Copy link

asdf2014 commented Apr 12, 2019

Hi, @manolama . I tried the latest version in 3.x and still failed.

@asdf2014

This comment has been minimized.

Copy link

asdf2014 commented Apr 12, 2019

@manolama After adding the -Xss=16m option, it works! However, the same JVM arguments are used on OpenTSDB 2.x and still fail.. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.