New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
milliseconds metrics may cause the compaction hang and huge region tmp files and region server down. #490
Comments
Oh wow, that's really interesting and awesome work on figuring that out. I'll add some notes to our documentation. Also, I'll upstream the Yahoo append code that would avoid this issue as it places the offsets and values in the column value instead of the qualifier. |
Updated the documentation http://opentsdb.net/docs/build/html/user_guide/troubleshooting.html. Thanks again for finding this! |
I have experienced this problem for a while. Could you please show me how I can disable compaction in OpenTSDB 2.1.0? |
I think I'm running into this, I'm using millisecond timestamps and seeing crashes when a compaction happens anyway. However I've disabled compactions in opentsdb but hbase seems to imply compactions are being done anyway when I issue a query? |
HBase compactions are different than OpenTSBD compactions. |
@CamJN were you able to solve this issue? |
No, I had to turn compactions off, and now my queries take 11-17 seconds to run. |
@CamJN We resolved this issue by increasing hfile.index.block.max.size to 512KB |
There seems to be a bug in HBase that is causing this. We are writing HFile intermediate level index blocks, which are multi-level. We keep the indexes for the next level (and root) to create this multi-level index recursively. It seems that we are never ending the intermediate level index block writes because the estimated size of the root level index never comes down as expected. We're working on patch in HBase. Will update here once we know more. |
Independently, we were also testing with hfile.index.block.max.size set to 512KB to workaround the issue until we have a fix in HBase. |
If any body runs into this, the fix will come as a part of https://issues.apache.org/jira/browse/HBASE-16288. |
@manolama , @tonywutao the issue will not only happen in compactions BTW. We have seen it in region recovery as well, and it can also theoretically happen in flushes as well. We can update the troubleshooting guide that the temporary fix of setting the hfile.index.block.max.size to 512KB is not needed with HBASE-16288. |
@enis Excellent! We had traced the problem exactly as you describe and reporting it to HBASE was on my list but it looks like you beat me to it. I can confirm this happens on region recovery. FYI, the tmp file grows indefinitely: ours hit 12TB before the cluster ran out of disk space! |
Thanks @dacjames . FYI, I have just committed the fix for https://issues.apache.org/jira/browse/HBASE-16288 to all HBase-1.1+. |
Environment:
Problem
After OpenTSDB compaction, we found some region always in pending_close, and size of region tmp folder on HDFS were increasing all the time, never stop. It made the region server down, and generate over 50GB region tmp files, meanwhile the metrics size is less than 500MB.
Root cause
If use seconds metrics , there are 3600 metrics in one row (one hour) at most, and each column uses 2 bytes, so there will be no problem.
2.If the size of ( rowkey + columnfamily:qualifer ) > hfile.index.block.max.size. this may cause the memstore flush to infinite loop during writing the HFile index.
That's why the compaction hangs, and the tmp folder of regions on HDFS increases all the time. and makes the region server down.
Now we change to make the aggregation on app side instead of OpenTSDB side to resolve this problem.
The text was updated successfully, but these errors were encountered: