milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. #490

tonywutao · 2015-04-11T13:53:35Z

Environment:

OpenTSDB 2.0.1 and HBase CDH4.3.0.
OpenTSDB compaction enabled.
We use the milliseconds in OpenTSDB metrics.

Problem
After OpenTSDB compaction, we found some region always in pending_close, and size of region tmp folder on HDFS were increasing all the time, never stop. It made the region server down, and generate over 50GB region tmp files, meanwhile the metrics size is less than 500MB.

Root cause

When use milliseconds in OpenTSDB metrics, which may generate over 32000 metrics in one hour. each column of milliseconds metrics uses 4 bytes, when compaction, the integrated column size may exceed the 128KB (hfile.index.block.max.size).
If use seconds metrics , there are 3600 metrics in one row (one hour) at most, and each column uses 2 bytes, so there will be no problem.
2.If the size of ( rowkey + columnfamily:qualifer ) > hfile.index.block.max.size. this may cause the memstore flush to infinite loop during writing the HFile index.

That's why the compaction hangs, and the tmp folder of regions on HDFS increases all the time. and makes the region server down.

Now we change to make the aggregation on app side instead of OpenTSDB side to resolve this problem.

manolama · 2015-04-19T19:56:15Z

Oh wow, that's really interesting and awesome work on figuring that out. I'll add some notes to our documentation. Also, I'll upstream the Yahoo append code that would avoid this issue as it places the offsets and values in the column value instead of the qualifier.

manolama · 2015-04-19T20:17:54Z

Updated the documentation http://opentsdb.net/docs/build/html/user_guide/troubleshooting.html. Thanks again for finding this!

ghost · 2015-07-09T09:00:01Z

I have experienced this problem for a while. Could you please show me how I can disable compaction in OpenTSDB 2.1.0?

CamJN · 2015-12-14T20:21:42Z

I think I'm running into this, I'm using millisecond timestamps and seeing crashes when a compaction happens anyway. However I've disabled compactions in opentsdb but hbase seems to imply compactions are being done anyway when I issue a query?

jtamplin · 2015-12-14T20:25:25Z

HBase compactions are different than OpenTSBD compactions.

shubhamagarwal003 · 2016-01-04T08:44:27Z

@CamJN were you able to solve this issue?

CamJN · 2016-01-04T13:20:52Z

No, I had to turn compactions off, and now my queries take 11-17 seconds to run.

dacjames · 2016-07-18T21:33:08Z

@CamJN We resolved this issue by increasing hfile.index.block.max.size to 512KB

enis · 2016-07-23T01:40:33Z

There seems to be a bug in HBase that is causing this. We are writing HFile intermediate level index blocks, which are multi-level. We keep the indexes for the next level (and root) to create this multi-level index recursively. It seems that we are never ending the intermediate level index block writes because the estimated size of the root level index never comes down as expected. We're working on patch in HBase. Will update here once we know more.

enis · 2016-07-23T01:42:04Z

Independently, we were also testing with hfile.index.block.max.size set to 512KB to workaround the issue until we have a fix in HBase.

enis · 2016-07-26T19:36:17Z

If any body runs into this, the fix will come as a part of https://issues.apache.org/jira/browse/HBASE-16288.

enis · 2016-07-26T19:39:45Z

@manolama , @tonywutao the issue will not only happen in compactions BTW. We have seen it in region recovery as well, and it can also theoretically happen in flushes as well. We can update the troubleshooting guide that the temporary fix of setting the hfile.index.block.max.size to 512KB is not needed with HBASE-16288.

dacjames · 2016-07-30T03:39:11Z

@enis Excellent! We had traced the problem exactly as you describe and reporting it to HBASE was on my list but it looks like you beat me to it. I can confirm this happens on region recovery. FYI, the tmp file grows indefinitely: ours hit 12TB before the cluster ran out of disk space!

enis · 2016-08-01T18:16:26Z

Thanks @dacjames . FYI, I have just committed the fix for https://issues.apache.org/jira/browse/HBASE-16288 to all HBase-1.1+.

tonywutao changed the title ~~milliseconds metrics may cause the compaction hang and region server down.~~ milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. Apr 11, 2015

manolama added the performance label Apr 19, 2015

manolama closed this as completed Apr 19, 2015

trixpan mentioned this issue Jan 3, 2016

Space Issue #669

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. #490

milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. #490

tonywutao commented Apr 11, 2015

manolama commented Apr 19, 2015

manolama commented Apr 19, 2015

ghost commented Jul 9, 2015

CamJN commented Dec 14, 2015

jtamplin commented Dec 14, 2015

shubhamagarwal003 commented Jan 4, 2016

CamJN commented Jan 4, 2016

dacjames commented Jul 18, 2016

enis commented Jul 23, 2016

enis commented Jul 23, 2016

enis commented Jul 26, 2016

enis commented Jul 26, 2016

dacjames commented Jul 30, 2016

enis commented Aug 1, 2016

milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. #490

milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. #490

Comments

tonywutao commented Apr 11, 2015

manolama commented Apr 19, 2015

manolama commented Apr 19, 2015

ghost commented Jul 9, 2015

CamJN commented Dec 14, 2015

jtamplin commented Dec 14, 2015

shubhamagarwal003 commented Jan 4, 2016

CamJN commented Jan 4, 2016

dacjames commented Jul 18, 2016

enis commented Jul 23, 2016

enis commented Jul 23, 2016

enis commented Jul 26, 2016

enis commented Jul 26, 2016

dacjames commented Jul 30, 2016

enis commented Aug 1, 2016