Added BlockCache to HFile reader. #13427

pacificleo · 2025-06-12T19:23:30Z

Change Logs

HBase 2.4+ does not allocate a block cache automatically within the CacheConfig
but requires a BlockCache instance to be passed in. This is different from HBase 1.x
where CacheConfig allocated and used a static global BlockCache instance.

BlockCache speeds up lookup from the HFile. To emulate the behavior of
HBase 1.x, we will allocate a global static global block cache here if
it is enabled. The BlockCache can be controlle via hfile configs passed
via hadoop configuration to the process.
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java

Logs which show BlockCache was allocated

[Driver] BlockCacheFactory: Allocating BlockCache size=3.20 GB, blockSize=64 KB
[Driver] HoodieAvroHFileReader: Allocated a new global block cache for hfile readers LruBlockCache{blockCount=0, currentSize=2.40 MB, freeSize=3.20 GB, maxSize=3.20 GB, heapSize=2.40 MB, minSize=3.04 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, singleSize=778.24 MB,

singleFactor=0.25}

Test Plan:
Tested using a perf tool to lookup 10K keys from RI. With BlockCache enabled, the time is 4x less.

With BlockCache:
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 204462
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys:   95169

Without BlockCache:
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 817908
[Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys: 93470

Impact

Lookups from MDT partitions like RI can be faster.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

Summary: HBase 2.4+ does not allocate a block cache automatically within the CacheConfig but requires a BlockCache instance to be passed in. This is different from HBase 1.x where CacheConfig allocated and used a static global BlockCache instance. BlockCache speeds up lookup from the HFile. To emulate the behavior of HBase 1.x, we will allocate a global static global block cache here if it is enabled. The BlockCache can be controlle via hfile configs passed via hadoop configuration to the process. https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheFactory.java Logs which show BlockCache was allocated [Driver] BlockCacheFactory: Allocating BlockCache size=3.20 GB, blockSize=64 KB [Driver] HoodieAvroHFileReader: Allocated a new global block cache for hfile readers LruBlockCache{blockCount=0, currentSize=2.40 MB, freeSize=3.20 GB, maxSize=3.20 GB, heapSize=2.40 MB, minSize=3.04 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, singleSize=778.24 MB, singleFactor=0.25} Test Plan: Tested using a perf tool to lookup 10K keys from RI. With BlockCache enabled, the time is 4x less. With BlockCache: [Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 204462 [Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys: 95169 Without BlockCache: [Driver] HoodieRecordIndexPerf: Time to lookup 10000 existing keys: 817908 [Driver] HoodieRecordIndexPerf: Time to lookup 10000 missing keys: 93470 Reviewers: syalla, O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, balajee Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, balajee Subscribers: balajee Tags: #hudi_0.14, #has_java JIRA Issues: HUDI-6311 Differential Revision: https://code.uberinternal.com/D16768071

Summary: This library is required for enabling hfile Block Cache. Reviewers: syalla, O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, jingli Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, jingli Tags: #hudi_0.14 JIRA Issues: HUDI-6594 Differential Revision: https://code.uberinternal.com/D17294457

hudi-bot · 2025-06-12T23:04:15Z

CI report:

7c8a92a Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

linliu-code · 2025-06-16T19:44:56Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java

@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> sorte
        blockContentLoc.getContentPositionInLogFile(),
        blockContentLoc.getBlockSize());

+    // HFile BlockCache key is based on the name of the file. The inlinefs path has the format
+    // inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename
+    // component is parsed as "?start_offset=start_offset>&length=<length>" ignoring the actual outer file itself.


"start_offset>" ==> "<start_offset>"

linliu-code · 2025-06-16T19:45:11Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java

@@ -197,9 +199,15 @@ protected <T> ClosableIterator<HoodieRecord<T>> lookupRecords(List<String> sorte
        blockContentLoc.getContentPositionInLogFile(),
        blockContentLoc.getBlockSize());

+    // HFile BlockCache key is based on the name of the file. The inlinefs path has the format
+    // inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename


"start_offset>" ==> "<start_offset>"

linliu-code · 2025-06-16T19:46:03Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java

+    // inlinefs://<path_to_outer_file>/<outer_file_scheme>/?start_offset=start_offset>&length=<length> whose filename
+    // component is parsed as "?start_offset=start_offset>&length=<length>" ignoring the actual outer file itself.
+    // To keep this unique (as cache keys should be unique for each file), we add a hash which is calculated from the
+    // entire filename. So the cache key would now be ?start_offset=start_offset>&length=<length>&pathHash=<hash>


linliu-code · 2025-06-16T19:55:31Z

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java

+      }
+
+      // The BlockCache keys are based on the name of the files being cached. Within HUDI, there is a non-zero chance that
+      // two different files can have the same name. Also, when multiple datasets are being accessed in the same JVM, there


When do two files have the same name? Same file name in different paths?

linliu-code · 2025-06-16T20:01:36Z

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java

@@ -83,6 +87,11 @@ public class HoodieAvroHFileReader extends HoodieAvroFileReaderBase implements H

  private static final Logger LOG = LoggerFactory.getLogger(HoodieAvroHFileReader.class);

+  // The global block cache instance used across all HFile readers
+  private static BlockCache GLOBAL_HFILE_BLOCK_CACHE;


Why don't we just make it per file? After a file has been read, why do we need to keep its content in the cache?

We have removed all the hbase dependencies, does it require us to bring some dependency back? If so, we probably cannot do it. Or just implement one by ourselves.

linliu-code · 2025-06-16T20:03:16Z

pom.xml

@@ -470,6 +470,7 @@
              <include>org.apache.hbase.thirdparty:hbase-shaded-miscellaneous</include>
              <include>org.apache.hbase.thirdparty:hbase-shaded-netty</include>
              <include>org.apache.hbase.thirdparty:hbase-shaded-protobuf</include>


These Hbase dependencies have been removed from HUDI.

nsivabalan · 2025-07-09T01:46:50Z

hey @rmahindra : can you take a first stab on reviewing this patch

prashantwason added 2 commits June 12, 2025 19:11

github-actions bot added the size:M label Jun 12, 2025

linliu-code reviewed Jun 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added BlockCache to HFile reader. #13427

Added BlockCache to HFile reader. #13427

pacificleo commented Jun 12, 2025

Uh oh!

hudi-bot commented Jun 12, 2025

Uh oh!

linliu-code Jun 16, 2025

Uh oh!

linliu-code Jun 16, 2025

Uh oh!

linliu-code Jun 16, 2025

Uh oh!

linliu-code Jun 16, 2025 •

edited

Loading

Uh oh!

linliu-code Jun 16, 2025

Uh oh!

linliu-code Jun 16, 2025

Uh oh!

nsivabalan commented Jul 9, 2025

Uh oh!

Uh oh!

Added BlockCache to HFile reader. #13427

Are you sure you want to change the base?

Added BlockCache to HFile reader. #13427

Conversation

pacificleo commented Jun 12, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Jun 12, 2025

CI report:

Uh oh!

linliu-code Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linliu-code Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

linliu-code Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jul 9, 2025

Uh oh!

Uh oh!

linliu-code Jun 16, 2025 •

edited

Loading