-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[HUDI-9505] Hudi 1.1 blocker code change of index look up #13414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-9505] Hudi 1.1 blocker code change of index look up #13414
Conversation
bff05a9
to
a18a18b
Compare
cccb3cc
to
7f6b80b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it deserves a new table version based on the fact there is no file format or layout change, maybe just a new metadata config should be enough.
Hi Danny, thanks for chiming in here. I added more details in the PR descriptions. If your concerns persists, let's have a discussion with Ethan and vc regarding the decision making. Thank you! |
There is layout change in the secondary index partition in MDT where the file group mapping from the record key to the file group index is changed. Unless the table version is changed and the mapping or sharding (we can decide a name to use) algorithm is stored in the table config, there is no way to differentiate how secondary index is sharded. |
In general, I prefer the mapping or sharding (we can decide a name to use) algorithm to be stored in the table config along with the table version upgrade so it's easier to change the mapping if needed and the reader can use that to determine how to look up the index. There are at least two occurrences, this secondary index and partitioned record index, in which having the new table config would help migration and avoid table version upgrade in the future. |
eef34e1
to
f202555
Compare
Mulling if we should change this constraint instead. This may be a good point to introduce a index layout version and have that inside the index definition metadata. Similar to log files. This will help readers/writers evolve index storage. for e.g compaction could seamlessly upgrade index partitions. |
Prefer index definition. vs adding a new table config. |
In general I prefer adding the info in the |
@yihua asked for PR owner @Davis-Zhang-Onehouse to work on the following AI:
Proposed changeIntroducing a new attribute "partitionStrategy". Possible values:
![]() Also from the schema evolution perspective about this json. If we didn't find such field, depends on the index type, we infer the default strategy. For SI, in case we could not find the value, it means using Implementation detailsRead pathWhenever we do lookup using indexes, we will read the index def file and add partitionStrategy value to the context. If it is Also for all the other indexes, we need to do the same as they share the same code path. It is an index look up interface change so it impacts code shared by all indexes. Write pathIf it is creating a new interface, we always write partitionStrategy attribute in the strategy. For SI it will be of value HASH_ON_SECONDARY_KEY_RECORD_KEY. For others, we will come up name mapping to their existing behavior. If it is updating existing indexes due to change of data, follow the "Read path" logic. Miscrevert the table version related code change. @yihua please comment on the proposed plan when you get a chance. thanks |
Looks good to me overall. |
I prefer not to introduce a new concept like Instead I propose
|
Actually this proposed index layout version is more general. I prefer this too. |
@yihua @vinothchandar new items to factor in: backwards and forward compatibility. I spotted major issues and the PR is blocked on your feedback CompatibilityForward compatibilityIf SI using version 2 (hash partition on data column value only) and hudi is of old binary, what happens is hudi does not has the concept of index version and will treat the new SI version as if it is the old one. As a result, Read pathit should be fine since it will use prefix lookup which naturally compatible with the new partition strategy. Write pathWrite path is messed up as the old hudi binary will write to new index version with old partition strategy. Backward compatibilitythis should be fine as the new hudi binary will properly recognize the version (or the absence of the version) and adapt properly. Details are covered in the previous thread. Fundamental limitation of the index version designOld hudi binary only recognize and respect table version. Introducing index version means user must use a version that recognize and honor this. In industry the standard procedure is
|
An alternative approach - create SI V2 as a new index categoryInstead of creating SI with multiple versions, we can do new version of SI as a completely different index type - we write to different metadata partition path, etc. There is no compatibility issue anymore. Also with the approch, the old binary does not know the new SI version, so it should not do anything with it (not sure if we follow this compatibility best practice). The new binary can do the following:
This avoids version management of any form. But the old binary must tolerate unrecognized MDT partitions which I'm not sure |
@Davis-Zhang-Onehouse at a high level, both the reader and writer paths should follow the new index layout version added in the index definition (i.e., the index layout version is Here are more details on how it should work:
Upgrade and downgrade:
MDT reader and write compatibility:
|
A new metadata partition path should be avoided for only layout difference. That's why the layout version plays a role. |
ok, so the solution still requires table version upgrade, sorry I should have read vc's reply. |
945a9e9
to
0c87abf
Compare
e3e7001
to
7692e90
Compare
CI failures are not relevant to the code changes |
feff82f
to
04e3a63
Compare
@Davis-Zhang-Onehouse When you get a chance, please bring the PR description inline with the approach now implemented |
done |
04e3a63
to
afa2a89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a pass. Can take another once you address some of these.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
Outdated
Show resolved
Hide resolved
public Map<ConfigProperty, String> upgrade(HoodieWriteConfig config, HoodieEngineContext context, | ||
String instantTime, SupportsUpgradeDowngrade upgradeDowngradeHelper) { | ||
HoodieTable table = upgradeDowngradeHelper.getTable(config, context); | ||
SecondaryIndexUpgradeDowngradeHelper.dropSecondaryIndexPartitions(config, context, table, "upgrading"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to "upgrading to table version 9"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to recreate the index as well? or we push that to the first write after upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user needs to explicitly recreate the index. we will make sure it is captured in the doc update.
If we want first write after upgrade to auto recreate, let's plan for the priority and delivery time. As MVP I didn't include this.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
Outdated
Show resolved
Hide resolved
@@ -134,6 +140,8 @@ public void createOrUpdateColumnStatsIndexDefinition(HoodieTableMetaClient metaC | |||
.withIndexType(PARTITION_NAME_COLUMN_STATS) | |||
.withIndexFunction(PARTITION_NAME_COLUMN_STATS) | |||
.withSourceFields(columnsToIndex) | |||
// Use the existing version if exists, otherwise fall back to the default version. | |||
.withVersion(getExistingHoodieIndexVersionOrDefault(PARTITION_NAME_COLUMN_STATS, metaClient)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we want to hand control of what version is written to user - right? the versions should be picked by index def building for writing internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, user cannot control.
then why we have "existing" or "default"?
- Existing: if index is already created, write using the existing version.
- default: if index is not created (first time creating index), use index version based on the table version
JavaRDD<String> partitionedKeyRDD = HoodieJavaRDD.getJavaRDD(records) | ||
.map(HoodieRecord::getRecordKey) | ||
.keyBy(k -> HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(k, numFileGroups)) | ||
.keyBy(k -> HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(k, numFileGroups, partitionPath, | ||
getExistingHoodieIndexVersionOrDefault(partitionPath, hoodieTable.getMetaClient()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callout: we should ensure none of these calls from executors read the index defs json. it should be read once on the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's an excellent callout. I reviewed all call site and revised as appropriate.
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
Show resolved
Hide resolved
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
@@ -294,18 +303,20 @@ public Map<String, HoodieRecordGlobalLocation> readRecordIndex(List<String> reco | |||
* @param secondaryKeys The list of secondary keys to read | |||
*/ | |||
@Override | |||
public Map<String, HoodieRecordGlobalLocation> readSecondaryIndex(List<String> secondaryKeys, String partitionName) { | |||
public HoodiePairData<String, HoodieRecordGlobalLocation> readSecondaryIndex(HoodieData<String> secondaryKeys, String partitionName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you point me to changes where we read v1 and v2 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we assume we just read v2.. Reason is - the writer version can be 6,8,9.. and if its 6, don't we write v1 of SI? so the code here, shoudl read both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR I didn't add the change and it does not break anything.
write path: it does the proper hashing based on index, so it is good
read path: it read all file index and prefix matching which is compatible to handle both v1 and v2.
we have the code change of what you want in the productionization PR, which is separate from what this PR is - minimum code change to unblock hudi 1.x.
The read path has larger code change and we should not do everything in 1 single giant PR
faef689
to
9d86873
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar @Davis-Zhang-Onehouse
Curerntly the SI is not even been utilized in any community user production use cases yet, is it okay we just migrate to new hashing func for SI and add doc to our site the SI in 1.0.2 and 1.1 are not compatible, add tutorials to guid user how to do a munal upgrade then? So that we avoid the chaos introduced in this PR.
each index comes with its version number. So later storage change on index do not require table version upgrade anymore
This is not true, as long as the index layout changes, a table upgrade/downgrade is required.
Also for all the other indexes, we need to do the same as they share the same code path. It is an index look up interface change so it impacts code shared by all indexes.
Can we limit the change to just SI as much as possible, I don't think the other index types need a multiple version in the mid-term future.
@danny0405 we need a versioning schema for index storage as well, just like log blocks.. So this is useful foundation work regardless |
Change Logs
<secondary col value>_<record col value>: null
records of SI were hash partitioned by
<secondary col value>_<record col value>
Now we change it to hash partitioned by
<secondary col value>
, the storage format is unchanged.The motivation is that the index query pattern typically copes with cases where only "secondary col value" are available in the lookup query
by changing the partitioning strategy, we can do SI file group pruning to ensure we only involve file groups of interest instead of scanning all SI records. The downsides of skewed record distributions of SI records across file groups are well understood.
Because we change the partition strategy of SI, the change cannot go out as it is. The functional requirement is "when the code change is actively functioning, reader and writer path will all function as if SI file groups are partitioned based on
<secondary col value>
", and we DO NOT want to support old & new modes at the same time. The new strategy does not work with the existing SI data layout if a table already comes with SI and is hash partitioned by<secondary col value>_<record col value>
.Without extra protection, SI failed to lookup the records from the file groups it needs due to the file group pruning enforced by the new read pattern, causing data correctness issue. As discussed with hudi PMCs, we agree table version upgrade is indispensable.For future extensibility, we introduced index version field in index_def.json. Starting table version 9, each index comes with its version number. So later storage change on index do not require table version upgrade anymore. If the field is not present, then value version 1 is assumed.
We handled the differences between SI layouts 1 and 2, inside the reader/write code. Inside the code, the layout version is handled specifically for each index type.
We still need to bump up the table version (to handle a downgrade where SI layout 2, can't be read by releases < 1.1).
https://issues.apache.org/jira/browse/HUDI-9516
Impact
Faster SI lookup with acceptable downsides as explained above.
Risk level (write none, low medium or high below)
None
Documentation Update
Need to work on table version upgrade guide. Tracked by
https://issues.apache.org/jira/browse/HUDI-9505
Contributor's checklist