[HUDI-9504] support in-memory buffer sort in append write #13409

HuangZhenQiu · 2025-06-09T06:57:29Z

Change Logs

Add in memory buffer sort in append write function to improve the parquet compression ratio. From our experiment and testing, It can improve 300% compression ratio with right sort key and buffer size configuration.

Impact

User can use the feature by enable the buffer sort configurations

Risk level (write none, low medium or high below)

low

Documentation Update

It is a new feature. Jira will be created to update the website.

Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

HuangZhenQiu · 2025-06-12T03:10:21Z

@zhangyue19921010 @danny0405
Updated the diff with BinaryInMemorySortBuffer.

zhangyue19921010 · 2025-06-13T02:21:01Z

@zhangyue19921010 @danny0405 Updated the diff with BinaryInMemorySortBuffer.

Will finish my review later this week.

danny0405

cc @cshuo for the fist round of review

cshuo · 2025-06-16T01:38:12Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

@@ -586,6 +586,27 @@ private FlinkOptions() {
      .withDescription("Maximum memory in MB for a write task, when the threshold hits,\n"
          + "it flushes the max size data bucket to avoid OOM, default 1GB");

+  @AdvancedConfig
+  public static final ConfigOption<Boolean> WRITE_BUFFER_SORT_ENABLED = ConfigOptions
+      .key("write.buffer.sort.enabled")


write.sort.enabled?

cshuo · 2025-06-16T01:38:23Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

+
+  @AdvancedConfig
+  public static final ConfigOption<String> WRITE_BUFFER_SORT_KEYS = ConfigOptions
+      .key("write.buffer.sort.keys")


write.sort.keys?

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java

cshuo · 2025-06-16T06:27:58Z

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java

+
+  public AppendWriteFunctionWithBufferSort(Configuration config, RowType rowType) {
+    super(config, rowType);
+    this.writebufferSize = config.get(FlinkOptions.WRITE_BUFFER_SIZE);


Maybe WRITE_BUFFER_SIZE is not needed , since memorySegmentPool has an limited memory size, which can be used to trigger buffer flushing.

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java

...flink/src/test/java/org/apache/hudi/sink/append/ITTestAppendWriteFunctionWithBufferSort.java

zhangyue19921010 · 2025-07-01T02:53:31Z

@HuangZhenQiu Still working on this?

HuangZhenQiu · 2025-07-09T19:50:33Z

@zhangyue19921010 Yes. I was OOO last week. Will update the diff this week.

HuangZhenQiu

Thanks for the review. @cshuo @zhangyue19921010

cshuo · 2025-07-14T01:21:25Z

Thanks for the review. @cshuo @zhangyue19921010

Ok, will take another look soon.

cshuo

@HuangZhenQiu Thks for updating, left some comments.

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/BufferUtils.java

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java

...flink/src/test/java/org/apache/hudi/sink/append/ITTestAppendWriteFunctionWithBufferSort.java

HuangZhenQiu · 2025-07-18T17:52:29Z

@cshuo
Thanks for the valuable comments. Resolved all of them except the buffer size option. Shall we keep it for the flexibility of users to adopt the feature?

cshuo · 2025-07-21T02:28:37Z

@cshuo Thanks for the valuable comments. Resolved all of them except the buffer size option. Shall we keep it for the flexibility of users to adopt the feature?

Thks for updating. Seems the pr don't fix the comments here, i.e., with the current impl, records are partially ordered within a parquet file, since it may contains batches from multiple sortAndSend. We should keep all records within a file strictly ordered to fully leverage the advantages of the sorting.

HuangZhenQiu · 2025-07-21T03:20:21Z

Small files is not good for query performance. But if we have whole parquet file with order, we will lose the data freshness. Sort time will increase a lot then cause the high back pressure in Flink job. Thus, we use the buffer size to control the row group level order and compression ratio. It is a trade off to achieve data freshness and storage size without keeping parquet file level sort. We will leverage table service to do the stitching later.

cshuo · 2025-07-21T07:51:55Z

Small files is not good for query performance.

As mentioned above, we can trigger flushing by buffer memory size and set the size properly to relieve the small files pressure. And the current impl seems can't ensure the data is ordered in row group level either, since row group is switched when it reaches the configured size limit, e.g., default 120Mb currently. (HoodieStorageConfig#PARQUET_BLOCK_SIZE).

But if we have whole parquet file with order, we will lose the data freshness.

Actually the data freshness is decided by checkpoint interval. The writer will flush and commit the written files during checkpoint, until which point the data remains invisible.

Sort time will increase a lot then cause the high back pressure in Flink job.

Agree that it will need more sort time to keep whole file ordered. Not sure how significant the impact is, I remembered @Alowator has a ingestion benchmark which includes sorting of binary buffer here, and said sort performs fast enough so it doesn't affect write performance, where the default batch size is 256Mb to trigger flushing. Maybe you can double check that. cc @HuangZhenQiu

...flink/src/test/java/org/apache/hudi/sink/append/ITTestAppendWriteFunctionWithBufferSort.java

cshuo · 2025-07-21T07:52:53Z

cc @danny0405 @zhangyue19921010 for final review.

hudi-bot · 2025-07-29T19:29:09Z

CI report:

cc72dd0 UNKNOWN
46637fd UNKNOWN
c692d75 UNKNOWN
3f5ee19 UNKNOWN
f27de92 UNKNOWN
d5210d8 UNKNOWN
fd80089 UNKNOWN
3c20b56 UNKNOWN
9446081 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

zhangyue19921010 · 2025-08-01T09:12:16Z

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java

+   *
+   * @throws IOException
+   */
+  private void sortAndSend() throws IOException {


During sort and send, we still use a common write helper, which means this sort is still a partial sort?

Based on our experience, global sorting for Parquet files requires additional memory and inevitably leads to small file issues, although the sorting process itself is highly efficient (accounting for less than 5% in flame graphs). For partial sorting implementations, I recommend uniformly updating all references—including configuration names, java docs, and method names ,etc —to use the term "Partial Order" to prevent unnecessary confusion among other users.

zhangyue19921010 · 2025-08-01T09:22:03Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

+      .key("write.sort.enabled")
+      .booleanType()
+      .defaultValue(false) // default no sort
+      .withDescription("Whether to enable buffer sort within append write function.");


I still think we need to Implement a complete sorting semantics.

We can refer to StreamWriteFunction and use Buckets to control the implementation of flush.

Fortunately, most of the code in this PR can be reused, just need to improve the flush control

github-actions bot added the size:XL label Jun 9, 2025

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch from 45cf09a to 2efb9bf Compare June 9, 2025 07:10

zhangyue19921010 requested changes Jun 9, 2025

View reviewed changes

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java Outdated Show resolved Hide resolved

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch 5 times, most recently from 08ab684 to 123870c Compare June 11, 2025 21:12

github-actions bot added size:L and removed size:XL labels Jun 12, 2025

xushiyan requested a review from danny0405 June 13, 2025 14:05

danny0405 reviewed Jun 13, 2025

View reviewed changes

cshuo reviewed Jun 16, 2025

View reviewed changes

zhangyue19921010 reviewed Jun 16, 2025

View reviewed changes

zhangyue19921010 reviewed Jun 17, 2025

View reviewed changes

.../hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithBufferSort.java Outdated Show resolved Hide resolved

cshuo reviewed Jun 17, 2025

View reviewed changes

...flink/src/test/java/org/apache/hudi/sink/append/ITTestAppendWriteFunctionWithBufferSort.java Show resolved Hide resolved

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch from 36ed297 to c692d75 Compare July 10, 2025 08:06

HuangZhenQiu commented Jul 10, 2025

View reviewed changes

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch 3 times, most recently from 3f5ee19 to 1d86872 Compare July 11, 2025 16:50

cshuo reviewed Jul 14, 2025

View reviewed changes

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch 2 times, most recently from f27de92 to d5210d8 Compare July 18, 2025 17:38

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch 2 times, most recently from fd80089 to 3c2854e Compare July 18, 2025 17:49

cshuo reviewed Jul 21, 2025

View reviewed changes

...flink/src/test/java/org/apache/hudi/sink/append/ITTestAppendWriteFunctionWithBufferSort.java Show resolved Hide resolved

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch 4 times, most recently from e3058eb to a1ae791 Compare July 29, 2025 01:01

[HUDI-9504] support in-memory buffer sort in append write

9446081

HuangZhenQiu force-pushed the HUDI-9504-batch-sort branch from a1ae791 to 9446081 Compare July 29, 2025 17:09

zhangyue19921010 reviewed Aug 1, 2025

View reviewed changes

[HUDI-9504] support in-memory buffer sort in append write #13409

Are you sure you want to change the base?

[HUDI-9504] support in-memory buffer sort in append write #13409

Conversation

HuangZhenQiu commented Jun 9, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

HuangZhenQiu commented Jun 12, 2025

Uh oh!

zhangyue19921010 commented Jun 13, 2025

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

cshuo Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

cshuo Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cshuo Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhangyue19921010 commented Jul 1, 2025

Uh oh!

HuangZhenQiu commented Jul 9, 2025

Uh oh!

HuangZhenQiu left a comment

Choose a reason for hiding this comment

Uh oh!

cshuo commented Jul 14, 2025

Uh oh!

cshuo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuangZhenQiu commented Jul 18, 2025

Uh oh!

cshuo commented Jul 21, 2025

Uh oh!

HuangZhenQiu commented Jul 21, 2025

Uh oh!

cshuo commented Jul 21, 2025

Uh oh!

Uh oh!

cshuo commented Jul 21, 2025

Uh oh!

hudi-bot commented Jul 29, 2025

CI report:

Uh oh!

zhangyue19921010 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

zhangyue19921010 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!