[HUDI-9527] Switch to HoodieFileGroupReader in HoodieTableMetadataUtil #13445

the-other-tim-brown · 2025-06-16T16:48:41Z

Change Logs

Removes usage of HoodieMergedLogRecordScanner in HoodieTableMetadataUtil and replace it with the HoodieFileGroupReader
Fixes handling of deleted records when reading as HoodieRecord

Impact

Uses new standard way of reading

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

linliu-code · 2025-06-17T00:02:57Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

                                                         HoodieTableMetaClient datasetMetaClient,
                                                         Option<Schema> writerSchemaOpt,
                                                         String latestCommitTimestamp,


Incorrect indention.

danny0405 · 2025-06-17T00:29:47Z

...ient/hudi-spark-client/src/main/scala/org/apache/hudi/BaseSparkInternalRowReaderContext.java

@@ -171,9 +183,6 @@ public Comparable convertValueToEngineType(Comparable value) {

  @Override
  public InternalRow getDeleteRow(InternalRow record, String recordKey) {
-    if (record != null) {


why remove this? The record should already been marked as delete in this case.

The spark row doesn't have an operation type field like the flink row data does. Now this will always use an internal row to represent the row.

Are you saying the records written to log data block from spark is always not a delete? This is not true as when the customized delete marker is there.

Can we return HoodieInternalRow with isDelete set up in there?

When we read the data from the log block, it is not going to have an operation type set. We can determine it is a delete by inspecting the record for _hoodie_is_delete or by a field used to mark deletes or the merger will output an empty option to signal a delete in the case of custom mergers and payloads.

Returns a InternalRow with just record key and patition path set up there will lost the other payload fields, not sure if it works for cdc read scenarios(the deletes for retraction)

I updated the handling here to simply wrap the row if it is non-null so we can still detect it is a delete but otherwise do not modify it.

danny0405 · 2025-06-17T00:30:26Z

...ient/hudi-spark-client/src/main/scala/org/apache/hudi/BaseSparkInternalRowReaderContext.java

+  public HoodieRecord<InternalRow> constructHoodieRecord(InternalRow row, Schema schema, Option<String> orderingFieldName) {
+    if (row instanceof HoodieInternalRow && ((HoodieInternalRow) row).isDeleteOperation()) {
+      return new HoodieEmptyRecord<>(
+          new HoodieKey(row.getUTF8String(HoodieRecord.RECORD_KEY_META_FIELD_ORD).toString(), partitionPath),


if the row already been marked as delete, we can just construct HoodieSparkRecord?

The HoodieSparkRecord will be marked as a delete if the row is null. Using an empty record is the common pattern adopted across the engines.

@the-other-tim-brown, @danny0405 , should we fix HoodieDeleteHelper as well, since for Avro, we use HoodieAvroRecord with EmptyHoodieRecordPayload; what about Flink, Hive?

@linliu-code what is the impact of using EmptyHoodieRecordPayload vs HoodieEmptyRecord?

danny0405 · 2025-06-17T00:36:56Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroReaderContext.java

@@ -120,7 +118,7 @@ public GenericRecord convertToAvroRecord(IndexedRecord record, Schema schema) {

  @Override
  public IndexedRecord getDeleteRow(IndexedRecord record, String recordKey) {
-    throw new UnsupportedOperationException("Not supported for " + this.getClass().getSimpleName());
+    return new DeleteRecord(recordKey, record);


The API is designed to return a record with the record keys set up ready in the record fields instead of a wrapper.
Things like DeleteRecord.getRecordKey is not a general API for readers..

Updated this to match the pattern with HoodieInternalRow

danny0405 · 2025-06-17T00:38:26Z

hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java

+   * @param orderingFieldName The name of the ordering field, if any.
+   * @return A new instance of {@link HoodieRecord}.
+   */
+  public abstract HoodieRecord<T> constructHoodieRecord(T record, Schema schema, Option<String> orderingFieldName);


I don't think we need this.

Without this we will need a way to determine if the record is a delete record even when the data may not be set. If there is a way to do this across all engine types (not just Flink) then we can use that. Without this you will get runtime errors when trying to read as a HoodieRecord.

If there is a way to do this across all engine types (not just Flink) then we can use that.

Let's figure out a solution for it.

the solution is already proposed here and it works

only other option I see is to make the HoodieReaderContext have a method for isDeleteOperation(T record) and then we can use that to set the value for isDelete in this line, but that is just a roundabout way to get back to the same result.

let's give it a try. In the end, I think we should introduce our engine agnostice data structures there to unify all the discrepencies.

Another option is to have the FGReader return BufferedRecords directly which will already have this context for whether it represents a delete for that row. There are some other changes that may need to come with that like making the ordering value lookup lazy to avoid extra overhead.

The tricky part here is only when there are log files or merging we could have a BufferedRecord.

Pushed an update to add a check for whether the record is a delete operation in the reader context

nsivabalan · 2025-06-17T05:54:18Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java

+    this.isDeleteOperation = false;
+  }
+
+  private HoodieInternalRow(UTF8String recordKey,


if this is anyway private, can you take in isDeleteOperation as an argument. and so we can set it to tru in L 114. Just that looking at constructor arguments, its not very apparent that this is for delete operation.

Yes, makes sense. I've cleaned up the constructors to all route through a common all-args constructor.

nsivabalan · 2025-06-17T06:04:29Z

hudi-common/src/main/java/org/apache/hudi/avro/DeleteIndexedRecord.java

+  private final String[] metaFields;
+  private final IndexedRecord record;
+
+  public DeleteIndexedRecord(String recordKey, String partitionPath, IndexedRecord record) {


IndexedDeleteRecord

nsivabalan · 2025-06-17T06:05:06Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java

@@ -58,6 +58,8 @@ public class HoodieInternalRow extends InternalRow {
   */
  private final UTF8String[] metaFields;
  private final InternalRow sourceRow;
+  // indicates whether this row represents a delete operation. Used in the CDC read.
+  private final boolean isDeleteOperation;


why not just isDeleted

This was to better align with the operation type used in Flink

nsivabalan · 2025-06-17T06:07:21Z

hudi-common/src/main/java/org/apache/hudi/common/model/DeleteRecord.java

@@ -107,7 +107,7 @@ public int hashCode() {

  @Override
  public String toString() {
-    return "DeleteRecord {"
+    return "DeleteIndexedRecord {"


should this be reverted?

Yes, this was a bad find and replace on my part

nsivabalan · 2025-06-17T06:14:30Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+                                                         HoodieReaderContext<T> readerContext) {
+    if (writerSchemaOpt.isPresent() && !logFilePaths.isEmpty()) {
+      List<HoodieLogFile> logFiles = logFilePaths.stream().map(HoodieLogFile::new).collect(Collectors.toList());
+      FileSlice fileSlice = new FileSlice(partitionPath, logFiles.get(0).getFileId(), logFiles.get(0).getDeltaCommitTime());


delta commit time of first log file may not be the file slice's base instant time right. but I see here we are setting it so.
Can you confirm this does not have any other side effects.

Right now we don't rely on the instant time anywhere in the FileGroupReader but we can change this is there is a way to set it properly

nsivabalan · 2025-06-17T06:18:56Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+            .withFileSlice(fileSlice)
+            .withDataSchema(tableSchema)
+            .withRequestedSchema(HoodieAvroUtils.getRecordKeySchema())
+            .withLatestCommitTime(metaClient.getActiveTimeline().filterCompletedInstants().lastInstant().map(HoodieInstant::requestedTime).orElse(""))


can we compute the latestInstant in driver and let the executor access it directly.

Yes, updated

linliu-code · 2025-06-17T18:42:37Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

@@ -850,6 +846,7 @@ public static HoodieData<HoodieRecord> convertMetadataToRecordIndexRecords(Hoodi
      StorageConfiguration storageConfiguration = dataTableMetaClient.getStorageConf();
      Option<Schema> writerSchemaOpt = tryResolveSchemaForTable(dataTableMetaClient);
      Option<Schema> finalWriterSchemaOpt = writerSchemaOpt;
+      ReaderContextFactory<T> readerContextFactory = engineContext.getReaderContextFactory(dataTableMetaClient);


Curious, should we consider initializing InstantRange ?

I don't see it set in the current code, what would be the side-effect of not adding it?

linliu-code · 2025-06-17T18:44:54Z

...asource/hudi-flink/src/main/java/org/apache/hudi/table/format/FlinkRowDataReaderContext.java

+    if (operation == HoodieOperation.DELETE) {
+      return new HoodieEmptyRecord<>(hoodieKey, HoodieOperation.DELETE, getOrderingValue(rowData, schema, orderingFieldName), HoodieRecord.HoodieRecordType.FLINK);
+    }


We probably can use HoodieDeleteHelper.createDeleteRecord function.

danny0405 · 2025-06-18T01:40:52Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java

@@ -384,10 +384,8 @@ public ClosableIterator<T> getClosableIterator() throws IOException {
   * @return An iterator over the records that wraps the engine-specific record in a HoodieRecord.
   */
  public ClosableIterator<HoodieRecord<T>> getClosableHoodieRecordIterator() throws IOException {
-    return new CloseableMappingIterator<>(getClosableIterator(), nextRecord -> {
-      BufferedRecord<T> bufferedRecord = BufferedRecord.forRecordWithContext(nextRecord, readerContext.getSchemaHandler().getRequestedSchema(), readerContext, orderingFieldName, false);


If we fix the isDelete flag here, the new introduced #constructHoodieRecord can be eliminated?

yes, just needs some way to set this properly

This is what I recommended to address this: https://github.com/apache/hudi/pull/13445/files/11a4df01b69b973f5fce11fbb574b1edba4ca19c#r2151183064

… existing class, add meta fields to record

… Update handling of delete row for spark

…y, update serializers

the-other-tim-brown · 2025-06-18T17:34:02Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/BufferedRecord.java

@@ -41,12 +42,12 @@
 */
 public class BufferedRecord<T> implements Serializable {
  private final String recordKey;
-  private final Comparable orderingValue;
+  private final Lazy<Comparable> orderingValue;


When deriving the BufferedRecord from the record with emitDeletes we can run into issues trying to extract the ordering value since the record is null. This allows us to only perform that evaluation if required.

We already cache the orderingVal in HoodieRecord, the Lazy wrapper will increase the bytes to serialize for the spillableMap too.

the-other-tim-brown · 2025-06-18T17:34:28Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/BufferedRecordSerializer.java

-      byte[] avroBytes = recordSerializer.serialize(record.getRecord());
-      output.writeInt(avroBytes.length);
-      output.writeBytes(avroBytes);
+      byte[] recordBytes = record.getRecord() == null ? new byte[0] : recordSerializer.serialize(record.getRecord());


Previously, this would fail for deletes where the record is null

hudi-bot · 2025-06-18T23:58:54Z

CI report:

3b2cb77 UNKNOWN
57e41ca UNKNOWN
2ca6f2d UNKNOWN
26f86c3 UNKNOWN
e75a1af UNKNOWN
1aa2d2f UNKNOWN
dab3542 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2025-06-19T00:56:00Z

hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java

@@ -192,7 +192,7 @@ public SizeEstimator<BufferedRecord<T>> getRecordSizeEstimator() {
  }

  public CustomSerializer<BufferedRecord<T>> getRecordSerializer() {
-    return new DefaultSerializer<>();
+    return new BufferedRecordSerializer<>();


why this change? It seems not necessary based on the benchmark: #13408 (comment)

danny0405 · 2025-06-19T01:02:06Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/BufferedRecordSerializer.java

-  private final ByteArrayOutputStream baos;
-  private final RecordSerializer<T> recordSerializer;
+  // Caching kryo serializer to avoid creating kryo instance for every serde operation
+  private static final ThreadLocal<InternalSerializerInstance> SERIALIZER_REF =


The kryo creation happens once for each file group reader so the cost should be low?

danny0405 · 2025-06-19T01:04:16Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java

@@ -389,7 +392,8 @@ public ClosableIterator<T> getClosableIterator() throws IOException {
   */
  public ClosableIterator<HoodieRecord<T>> getClosableHoodieRecordIterator() throws IOException {
    return new CloseableMappingIterator<>(getClosableIterator(), nextRecord -> {
-      BufferedRecord<T> bufferedRecord = BufferedRecord.forRecordWithContext(nextRecord, readerContext.getSchemaHandler().getRequestedSchema(), readerContext, orderingFieldName, false);
+      boolean isDelete = emitDelete && readerContext.isDeleteOperation(nextRecord);


delete is a DELETE, you can not emit a delete when the emitDelete is false.

danny0405 · 2025-06-19T01:26:16Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java

@@ -128,6 +130,7 @@ private HoodieFileGroupReader(HoodieReaderContext<T> readerContext, HoodieStorag
    this.props = props;
    this.start = start;
    this.length = length;
+    this.emitDelete = emitDelete;


I don't think we need this.

danny0405 · 2025-06-19T01:53:56Z

@the-other-tim-brown I think the emitDeletes support for HoodieRecord iterator brings in too much overhead than I thought, can we drop it in this PR, the delete keys fetching should be just used in the legacy code path, now we have streaming write to MDT, and the code should be removed in the future anyway(once the streaming write is stable).

The emitDeletes is introduced mainly for streaming read scenarios with engine specific rows.

Also can we revert the changes for size estimation into a serapate PR to make the review of the current one easier.

the-other-tim-brown · 2025-06-20T16:04:49Z

@the-other-tim-brown I think the emitDeletes support for HoodieRecord iterator brings in too much overhead than I thought, can we drop it in this PR, the delete keys fetching should be just used in the legacy code path, now we have streaming write to MDT, and the code should be removed in the future anyway(once the streaming write is stable).

The emitDeletes is introduced mainly for streaming read scenarios with engine specific rows.

Also can we revert the changes for size estimation into a serapate PR to make the review of the current one easier.

@danny0405 I don't understand what you are recommending here. Streaming write to the MDT is only used for Spark and I don't think there are plans to use it for other engines.

the-other-tim-brown · 2025-06-20T19:55:55Z

Started a new PR with the same end result of moving off deprecated code but smaller changeset #13470

danny0405 · 2025-06-21T01:04:30Z

Streaming write to the MDT is only used for Spark and I don't think there are plans to use it for other engines.

That's true, for Flink and Java, there is no even a solution/plan to support the RLI there, that is why I said those codes like constructing RLI from files should be deemed as legacy.

github-actions bot added the size:L label Jun 16, 2025

the-other-tim-brown marked this pull request as ready for review June 16, 2025 21:17

linliu-code reviewed Jun 17, 2025

View reviewed changes

danny0405 reviewed Jun 17, 2025

View reviewed changes

nsivabalan reviewed Jun 17, 2025

View reviewed changes

linliu-code reviewed Jun 17, 2025

View reviewed changes

danny0405 reviewed Jun 18, 2025

View reviewed changes

the-other-tim-brown force-pushed the HUDI-9527 branch from bb84243 to 57e41ca Compare June 18, 2025 16:50

the-other-tim-brown added 12 commits June 18, 2025 12:30

switch to using FGReader in HoodieTableMetadataUtil

a182a89

fix doc link

ba9743b

fix style

7374ec8

fix test issues

9c83e5d

update DeleteRecord name to DeleteIndexedRecord to avoid overlap with…

d3b4627

… existing class, add meta fields to record

fix indentation

c60263b

address PR feedback

6b9ba3d

Remove constructHoodieRecord, add isDeleteOperation to ReaderContext.…

a578c59

… Update handling of delete row for spark

add schemaId for delete records to avoid NPE, make ordering value laz…

e9a1823

…y, update serializers

add licensce

ebe6f60

cleanup test

8ec2e07

handle empty records in serializer

e75a1af

the-other-tim-brown force-pushed the HUDI-9527 branch from 19a487d to e75a1af Compare June 18, 2025 17:31

the-other-tim-brown commented Jun 18, 2025

View reviewed changes

the-other-tim-brown added 3 commits June 18, 2025 13:22

update fgreader base test to use reader contexts' serializer

0494293

add cached kryo for BufferedRecordSerializer

fd1fd27

add debug logs for memory leak

1aa2d2f

add more debug logs

dab3542

danny0405 reviewed Jun 19, 2025

View reviewed changes

the-other-tim-brown closed this Jun 23, 2025

the-other-tim-brown deleted the HUDI-9527 branch June 23, 2025 00:04

[HUDI-9527] Switch to HoodieFileGroupReader in HoodieTableMetadataUtil #13445

[HUDI-9527] Switch to HoodieFileGroupReader in HoodieTableMetadataUtil #13445

Uh oh!

Conversation

the-other-tim-brown commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the-other-tim-brown commented Jun 16, 2025 •

edited

Loading

danny0405 Jun 18, 2025 •

edited

Loading