-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Added BlockCache to HFile reader. #13427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-0.14.2
Are you sure you want to change the base?
Conversation
Summary: HBase 2.4+ does not allocate a block cache automatically within the CacheConfig but requires a BlockCache instance to be passed in. This is different from HBase 1.x where CacheConfig allocated and used a static global BlockCache instance. BlockCache speeds up lookup from the HFile. To emulate the behavior of HBase 1.x, we will allocate a global static global block cache here if it is enabled. The BlockCache can be controlle via hfile configs passed via hadoop configuration to the process. https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java Logs which show BlockCache was allocated [Driver] BlockCacheFactory: Allocating BlockCache size=3.20 GB, blockSize=64 KB [Driver] HoodieAvroHFileReader: Allocated a new global block cache for hfile readers LruBlockCache{blockCount=0, currentSize=2.40 MB, freeSize=3.20 GB, maxSize=3.20 GB, heapSize=2.40 MB, minSize=3.04 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, singleSize=778.24 MB, singleFactor=0.25} Test Plan: Tested using a perf tool to lookup 10K keys from RI. With BlockCache enabled, the time is 4x less. With BlockCache: [Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 204462 [Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys: 95169 Without BlockCache: [Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 817908 [Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys: 93470 Reviewers: syalla, O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, balajee Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, balajee Subscribers: balajee Tags: #hudi_0.14, #has_java JIRA Issues: HUDI-6311 Differential Revision: https://code.uberinternal.com/D16768071
Summary: This library is required for enabling hfile Block Cache. Reviewers: syalla, O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, jingli Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, jingli Tags: #hudi_0.14 JIRA Issues: HUDI-6594 Differential Revision: https://code.uberinternal.com/D17294457
@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> sorte | |||
blockContentLoc.getContentPositionInLogFile(), | |||
blockContentLoc.getBlockSize()); | |||
|
|||
// HFile BlockCache key is based on the name of the file. The inlinefs path has the format | |||
// inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename | |||
// component is parsed as "?start_offset=start_offset>&length=<length>" ignoring the actual outer file itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"start_offset>" ==> "<start_offset>"
@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> sorte | |||
blockContentLoc.getContentPositionInLogFile(), | |||
blockContentLoc.getBlockSize()); | |||
|
|||
// HFile BlockCache key is based on the name of the file. The inlinefs path has the format | |||
// inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"start_offset>" ==> "<start_offset>"
// inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename | ||
// component is parsed as "?start_offset=start_offset>&length=<length>" ignoring the actual outer file itself. | ||
// To keep this unique (as cache keys should be unique for each file), we add a hash which is calculated from the | ||
// entire filename. So the cache key would now be ?start_offset=start_offset>&length=<length>&pathHash=<hash> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same typo
} | ||
|
||
// The BlockCache keys are based on the name of the files being cached. Within HUDI, there is a non-zero chance that | ||
// two different files can have the same name. Also, when multiple datasets are being accessed in the same JVM, there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When do two files have the same name? Same file name in different paths?
@@ -83,6 +87,11 @@ public class HoodieAvroHFileReader extends HoodieAvroFileReaderBase implements H | |||
|
|||
private static final Logger LOG = LoggerFactory.getLogger(HoodieAvroHFileReader.class); | |||
|
|||
// The global block cache instance used across all HFile readers | |||
private static BlockCache GLOBAL_HFILE_BLOCK_CACHE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why don't we just make it per file? After a file has been read, why do we need to keep its content in the cache?
- We have removed all the hbase dependencies, does it require us to bring some dependency back? If so, we probably cannot do it. Or just implement one by ourselves.
@@ -470,6 +470,7 @@ | |||
<include>org.apache.hbase.thirdparty:hbase-shaded-miscellaneous</include> | |||
<include>org.apache.hbase.thirdparty:hbase-shaded-netty</include> | |||
<include>org.apache.hbase.thirdparty:hbase-shaded-protobuf</include> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Hbase dependencies have been removed from HUDI.
hey @rmahindra : can you take a first stab on reviewing this patch |
Change Logs
HBase 2.4+ does not allocate a block cache automatically within the CacheConfig
but requires a BlockCache instance to be passed in. This is different from HBase 1.x
where CacheConfig allocated and used a static global BlockCache instance.
singleFactor=0.25}
Impact
Lookups from MDT partitions like RI can be faster.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist