Skip to content

Added BlockCache to HFile reader. #13427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: release-0.14.2
Choose a base branch
from

Conversation

pacificleo
Copy link

Change Logs

HBase 2.4+ does not allocate a block cache automatically within the CacheConfig
but requires a BlockCache instance to be passed in. This is different from HBase 1.x
where CacheConfig allocated and used a static global BlockCache instance.

BlockCache speeds up lookup from the HFile. To emulate the behavior of
HBase 1.x, we will allocate a global static global block cache here if
it is enabled. The BlockCache can be controlle via hfile configs passed
via hadoop configuration to the process.
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java

Logs which show BlockCache was allocated

[Driver] BlockCacheFactory: Allocating BlockCache size=3.20 GB, blockSize=64 KB
[Driver] HoodieAvroHFileReader: Allocated a new global block cache for hfile readers LruBlockCache{blockCount=0, currentSize=2.40 MB, freeSize=3.20 GB, maxSize=3.20 GB, heapSize=2.40 MB, minSize=3.04 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, singleSize=778.24 MB, 

singleFactor=0.25}

Test Plan:
Tested using a perf tool to lookup 10K keys from RI. With BlockCache enabled, the time is 4x less.

With BlockCache:
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 204462
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys:   95169

Without BlockCache:
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 817908
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys: 93470

Impact

Lookups from MDT partitions like RI can be faster.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Summary:
HBase 2.4+ does not allocate a block cache automatically within the CacheConfig
but requires a BlockCache instance to be passed in. This is different from HBase 1.x
where CacheConfig allocated and used a static global BlockCache instance.

BlockCache speeds up lookup from the HFile. To emulate the behavior of
HBase 1.x, we will allocate a global static global block cache here if
it is enabled. The BlockCache can be controlle via hfile configs passed
via hadoop configuration to the process.
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java

Logs which show BlockCache was allocated

[Driver] BlockCacheFactory: Allocating BlockCache size=3.20 GB, blockSize=64 KB
[Driver] HoodieAvroHFileReader: Allocated a new global block cache for hfile readers LruBlockCache{blockCount=0, currentSize=2.40 MB, freeSize=3.20 GB, maxSize=3.20 GB, heapSize=2.40 MB, minSize=3.04 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, singleSize=778.24 MB, singleFactor=0.25}

Test Plan:
Tested using a perf tool to lookup 10K keys from RI. With BlockCache enabled, the time is 4x less.

With BlockCache:
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 204462
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys:   95169

Without BlockCache:
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 817908
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys: 93470

Reviewers: syalla, O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, balajee

Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, balajee

Subscribers: balajee

Tags: #hudi_0.14, #has_java

JIRA Issues: HUDI-6311

Differential Revision: https://code.uberinternal.com/D16768071
Summary: This library is required for enabling hfile Block Cache.

Reviewers: syalla, O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, jingli

Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, jingli

Tags: #hudi_0.14

JIRA Issues: HUDI-6594

Differential Revision: https://code.uberinternal.com/D17294457
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Jun 12, 2025
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> sorte
blockContentLoc.getContentPositionInLogFile(),
blockContentLoc.getBlockSize());

// HFile BlockCache key is based on the name of the file. The inlinefs path has the format
// inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename
// component is parsed as "?start_offset=start_offset>&length=<length>" ignoring the actual outer file itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"start_offset>" ==> "<start_offset>"

@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> sorte
blockContentLoc.getContentPositionInLogFile(),
blockContentLoc.getBlockSize());

// HFile BlockCache key is based on the name of the file. The inlinefs path has the format
// inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"start_offset>" ==> "<start_offset>"

// inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename
// component is parsed as "?start_offset=start_offset>&length=<length>" ignoring the actual outer file itself.
// To keep this unique (as cache keys should be unique for each file), we add a hash which is calculated from the
// entire filename. So the cache key would now be ?start_offset=start_offset>&length=<length>&pathHash=<hash>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same typo

}

// The BlockCache keys are based on the name of the files being cached. Within HUDI, there is a non-zero chance that
// two different files can have the same name. Also, when multiple datasets are being accessed in the same JVM, there
Copy link
Contributor

@linliu-code linliu-code Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. When do two files have the same name? Same file name in different paths?

@@ -83,6 +87,11 @@ public class HoodieAvroHFileReader extends HoodieAvroFileReaderBase implements H

private static final Logger LOG = LoggerFactory.getLogger(HoodieAvroHFileReader.class);

// The global block cache instance used across all HFile readers
private static BlockCache GLOBAL_HFILE_BLOCK_CACHE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why don't we just make it per file? After a file has been read, why do we need to keep its content in the cache?
  2. We have removed all the hbase dependencies, does it require us to bring some dependency back? If so, we probably cannot do it. Or just implement one by ourselves.

@@ -470,6 +470,7 @@
<include>org.apache.hbase.thirdparty:hbase-shaded-miscellaneous</include>
<include>org.apache.hbase.thirdparty:hbase-shaded-netty</include>
<include>org.apache.hbase.thirdparty:hbase-shaded-protobuf</include>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Hbase dependencies have been removed from HUDI.

@nsivabalan
Copy link
Contributor

hey @rmahindra : can you take a first stab on reviewing this patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants