Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. #490

Closed
tonywutao opened this issue Apr 11, 2015 · 14 comments

Comments

@tonywutao
Copy link

Environment:

  1. OpenTSDB 2.0.1 and HBase CDH4.3.0.
  2. OpenTSDB compaction enabled.
  3. We use the milliseconds in OpenTSDB metrics.

Problem
After OpenTSDB compaction, we found some region always in pending_close, and size of region tmp folder on HDFS were increasing all the time, never stop. It made the region server down, and generate over 50GB region tmp files, meanwhile the metrics size is less than 500MB.

Root cause

  1. When use milliseconds in OpenTSDB metrics, which may generate over 32000 metrics in one hour. each column of milliseconds metrics uses 4 bytes, when compaction, the integrated column size may exceed the 128KB (hfile.index.block.max.size).
    If use seconds metrics , there are 3600 metrics in one row (one hour) at most, and each column uses 2 bytes, so there will be no problem.
    2.If the size of ( rowkey + columnfamily:qualifer ) > hfile.index.block.max.size. this may cause the memstore flush to infinite loop during writing the HFile index.

That's why the compaction hangs, and the tmp folder of regions on HDFS increases all the time. and makes the region server down.

Now we change to make the aggregation on app side instead of OpenTSDB side to resolve this problem.

@tonywutao tonywutao changed the title milliseconds metrics may cause the compaction hang and region server down. milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. Apr 11, 2015
@manolama
Copy link
Member

Oh wow, that's really interesting and awesome work on figuring that out. I'll add some notes to our documentation. Also, I'll upstream the Yahoo append code that would avoid this issue as it places the offsets and values in the column value instead of the qualifier.

@manolama
Copy link
Member

Updated the documentation http://opentsdb.net/docs/build/html/user_guide/troubleshooting.html. Thanks again for finding this!

@ghost
Copy link

ghost commented Jul 9, 2015

I have experienced this problem for a while. Could you please show me how I can disable compaction in OpenTSDB 2.1.0?

@CamJN
Copy link
Contributor

CamJN commented Dec 14, 2015

I think I'm running into this, I'm using millisecond timestamps and seeing crashes when a compaction happens anyway. However I've disabled compactions in opentsdb but hbase seems to imply compactions are being done anyway when I issue a query?

@jtamplin
Copy link
Contributor

HBase compactions are different than OpenTSBD compactions.

@trixpan trixpan mentioned this issue Jan 3, 2016
@shubhamagarwal003
Copy link

@CamJN were you able to solve this issue?

@CamJN
Copy link
Contributor

CamJN commented Jan 4, 2016

No, I had to turn compactions off, and now my queries take 11-17 seconds to run.

@dacjames
Copy link

@CamJN We resolved this issue by increasing hfile.index.block.max.size to 512KB

@enis
Copy link

enis commented Jul 23, 2016

There seems to be a bug in HBase that is causing this. We are writing HFile intermediate level index blocks, which are multi-level. We keep the indexes for the next level (and root) to create this multi-level index recursively. It seems that we are never ending the intermediate level index block writes because the estimated size of the root level index never comes down as expected. We're working on patch in HBase. Will update here once we know more.

@enis
Copy link

enis commented Jul 23, 2016

Independently, we were also testing with hfile.index.block.max.size set to 512KB to workaround the issue until we have a fix in HBase.

@enis
Copy link

enis commented Jul 26, 2016

If any body runs into this, the fix will come as a part of https://issues.apache.org/jira/browse/HBASE-16288.

@enis
Copy link

enis commented Jul 26, 2016

@manolama , @tonywutao the issue will not only happen in compactions BTW. We have seen it in region recovery as well, and it can also theoretically happen in flushes as well. We can update the troubleshooting guide that the temporary fix of setting the hfile.index.block.max.size to 512KB is not needed with HBASE-16288.

@dacjames
Copy link

@enis Excellent! We had traced the problem exactly as you describe and reporting it to HBASE was on my list but it looks like you beat me to it. I can confirm this happens on region recovery. FYI, the tmp file grows indefinitely: ours hit 12TB before the cluster ran out of disk space!

@enis
Copy link

enis commented Aug 1, 2016

Thanks @dacjames . FYI, I have just committed the fix for https://issues.apache.org/jira/browse/HBASE-16288 to all HBase-1.1+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants