You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running OpenTSDB version 2.4.0 (had also seen this with 2.2.0) with one OpenTSDB table on a MapR 5.2.2 cluster (MapR-DB being the underlying datastore). We have been experiencing a lot of slowness with writes to our OpenTSDB table. We have detected that the slowness is not within our MapR HBase puts, but rather our OpenTSDB process itself that is causing the slowness.
Typically, we see write throughputs of ~100,000 records/sec, but over the past 3 weeks those rates have fallen to only about 1,000/sec now, which is a huge performance degradation.
When running a stack trace against our OpenTSDB processes, we see a lot of threads often get stuck WAITING at this point:
"OpenTSDB I/O Worker #8" #17 prio=5 os_prio=0 tid=0x00007ffff042f800 nid=0x509b in Object.wait() [0x00007fffd9b0a000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1138)
- locked <0x00000006535b77e0> (a com.stumbleupon.async.Deferred$Signal)
at com.stumbleupon.async.Deferred.joinUninterruptibly(Deferred.java:1064)
at net.opentsdb.uid.UniqueId.suggest(UniqueId.java:769)
at net.opentsdb.core.TSDB.suggestMetrics(TSDB.java:999)
at net.opentsdb.tsd.SuggestRpc.execute(SuggestRpc.java:80)
at net.opentsdb.tsd.RpcHandler.handleHttpQuery(RpcHandler.java:283)
at net.opentsdb.tsd.RpcHandler.messageReceived(RpcHandler.java:134)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.handler.timeout.IdleStateAwareChannelUpstreamHandler.handleUpstream(IdleStateAwareChannelUpstreamHandler.java:36)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.timeout.IdleStateHandler.messageReceived(IdleStateHandler.java:294)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.codec.http.HttpContentEncoder.messageReceived(HttpContentEncoder.java:82)
at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.SimpleChannelHandler.messageReceived(SimpleChannelHandler.java:142)
at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
at net.opentsdb.tsd.ConnectionManager.handleUpstream(ConnectionManager.java:87)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This thread will be waiting on ‘net.opentsdb.uid.UniqueId.suggest(UniqueId.java:769)’ for a long time, as running the stack trace at later intervals reveals the thread is still waiting on this call to return.
We have checked our OpenTSDB stats endpoint, and find that we are not near our limits for UID’s:
tsd.uid.cache-hit
1600184822
23523956
kind=metrics host=client001
tsd.uid.cache-miss
1600184822
711344
kind=metrics host= client001
tsd.uid.cache-size
1600184822
1413859
kind=metrics host=client001
tsd.uid.random-collisions
1600184822
13
kind=metrics host=client001
tsd.uid.ids-used
1600184822
0
kind=metrics host=client001
tsd.uid.ids-available
1600184822
0
kind=metrics host=client001
tsd.uid.cache-hit
1600184822
48115583
kind=tagk host=client001
tsd.uid.cache-miss
1600184822
62
kind=tagk host=client001
tsd.uid.cache-size
1600184822
102
kind=tagk host=client001
tsd.uid.random-collisions
1600184822
0
kind=tagk host=client001
tsd.uid.ids-used
1600184822
1377
kind=tagk host=client001
tsd.uid.ids-available
1600184822
4294965918
kind=tagk host=client001
tsd.uid.cache-hit
1600184822
48107077
kind=tagv host=client001
tsd.uid.cache-miss
1600184822
7369
kind=tagv host=client001
tsd.uid.cache-size
1600184822
14710
kind=tagv host=client001
tsd.uid.random-collisions
1600184822
0
kind=tagv host=client001
tsd.uid.ids-used
1600184822
41099
kind=tagv host=client001
tsd.uid.ids-available
1600184822
4294926196
kind=tagv host=client001
We have already deduced that HBase writes are not the issue, as there is very little delay between HBase receiving put requests and writing to MapR-DB, a strong indicator the delay is occurring in the OpenTSDB process.
Does anyone have any inclinations what might be causing the suggest Unique ID calls to be taking so long? Or any other ideas where to look for the decreased writing writes?
Thank you!
The text was updated successfully, but these errors were encountered:
Hmm, suggest calls only come from reads and I always recommend folks separate their writers from readers. That would help the write rate.
Unfortunately the TSD code has deferred.joinUninterruptibly() calls everywhere so it will never timeout waiting on an async call to HBase. And by default the AsyncHBase code doesn't have a timeout set on RPCs so they can be sent off to a region server and if it never responds, the thread will just hang in the TSDB JVM until it's restarted.
Try setting hbase.rpc.timeout to a reasonable value like 60000 for a JVM that reads from HBase. That will let the threads recover. https://opentsdb.github.io/asynchbase/docs/build/html/configuration.html Also check the region server stats to see if the read queue is full. There could be an unhandled exception in the AsyncHBase code that dropped a response from the server on the floor.
We are running OpenTSDB version 2.4.0 (had also seen this with 2.2.0) with one OpenTSDB table on a MapR 5.2.2 cluster (MapR-DB being the underlying datastore). We have been experiencing a lot of slowness with writes to our OpenTSDB table. We have detected that the slowness is not within our MapR HBase puts, but rather our OpenTSDB process itself that is causing the slowness.
Typically, we see write throughputs of ~100,000 records/sec, but over the past 3 weeks those rates have fallen to only about 1,000/sec now, which is a huge performance degradation.
When running a stack trace against our OpenTSDB processes, we see a lot of threads often get stuck WAITING at this point:
This thread will be waiting on ‘net.opentsdb.uid.UniqueId.suggest(UniqueId.java:769)’ for a long time, as running the stack trace at later intervals reveals the thread is still waiting on this call to return.
We have checked our OpenTSDB stats endpoint, and find that we are not near our limits for UID’s:
We have already deduced that HBase writes are not the issue, as there is very little delay between HBase receiving put requests and writing to MapR-DB, a strong indicator the delay is occurring in the OpenTSDB process.
Does anyone have any inclinations what might be causing the suggest Unique ID calls to be taking so long? Or any other ideas where to look for the decreased writing writes?
Thank you!
The text was updated successfully, but these errors were encountered: