Iceberg metadata super optimization #43460

stephen-shelby · 2024-04-01T08:15:47Z

[Feature] iceberg metadata super optimize (backport #45479) #45640

Enhancement

This issue is used to trace iceberg metadata related optimization patches. The first version patch is #43459

There are currently four important optimizations:

iceberg metadata distributed plan. （Performance increased by n ~ 2n times. n is be number）
iceberg manifest cache. (Job planning delay reduced to around 100ms)
external table based on metadata file distributed plan Framework.
refactor some iceberg metadata parts.

There is still a lot of optimization work to come, some of which are as follows:

Add more auto distributed strategies baed on be load like cpu/io/mem
Enhancement time travel statement like as of.
refactor current split structure like RemoteFileInfo
provide default optimal distributed plan threshold by do a lot of testing.
Adapt to partition data parser when exists partition evolution.
Continuously optimize performance
add cache metrics like hitrate
provide benchmark to guarantee performance

The text was updated successfully, but these errors were encountered:

…uted_plan interface (#45404) Why I'm doing: we use ConnectContext.get() to get Session variable in the iceberg job planning. If we don't execute planning in the query thread, and can't get the connect context. There are two cases: In the prepareMetadata, the job planning is executed in the thread pool distributed plan mode, the metadata collect job is executed in the another thread. 1 && 2. plan_mode is distributed and running under the thread pool So we need to adapt to the two case with the connect context. What I'm doing: Adapting ConnectContext and Tracers under multi-threading Fixes #issue #43460 Signed-off-by: stephen <stephen5217@163.com>

…uted_plan interface (StarRocks#45404) Why I'm doing: we use ConnectContext.get() to get Session variable in the iceberg job planning. If we don't execute planning in the query thread, and can't get the connect context. There are two cases: In the prepareMetadata, the job planning is executed in the thread pool distributed plan mode, the metadata collect job is executed in the another thread. 1 && 2. plan_mode is distributed and running under the thread pool So we need to adapt to the two case with the connect context. What I'm doing: Adapting ConnectContext and Tracers under multi-threading Fixes #issue StarRocks#43460 Signed-off-by: stephen <stephen5217@163.com>

stephen-shelby added the type/enhancement Make an enhancement to StarRocks label Apr 1, 2024

stephen-shelby mentioned this issue Apr 1, 2024

[Feature] introduce iceberg metadata super optimization #43459

Closed

24 tasks

Dshadowzh mentioned this issue Apr 12, 2024

StarRocks Roadmap 2024 #39686

Open

60 tasks

stephen-shelby linked a pull request Apr 22, 2024 that will close this issue

[Feature] Introduce meta spec interface #44527

Merged

24 tasks

stephen-shelby mentioned this issue Apr 22, 2024

[Refactor] adjust some iceberg metadata config #44562

Merged

24 tasks

andyziye closed this as completed in #44527 Apr 23, 2024

andyziye reopened this Apr 23, 2024

stephen-shelby mentioned this issue Apr 23, 2024

[Enhancement] Implement iceberg metadata scan node #44581

Merged

24 tasks

stephen-shelby linked a pull request Apr 23, 2024 that will close this issue

[Enhancement] Implement iceberg metadata scan node #44581

Merged

24 tasks

stephen-shelby closed this as completed in #44581 Apr 23, 2024

stephen-shelby mentioned this issue Apr 23, 2024

[Enhancement] iceberg metadata reader execution #44632

Merged

24 tasks

stephen-shelby linked a pull request Apr 23, 2024 that will close this issue

[Enhancement] iceberg metadata reader execution #44632

Merged

24 tasks

This was referenced Apr 24, 2024

[Enhancement] support to collect column statistics on iceberg distributed plan #44647

Merged

[Enhancement] support metadata collect job #44679

Merged

stephen-shelby linked a pull request Apr 24, 2024 that will close this issue

[Enhancement] support metadata collect job #44679

Merged

24 tasks

stephen-shelby mentioned this issue Apr 24, 2024

[Enhancement] Iceberg metadata distributed plan e2e #44703

Merged

24 tasks

This was referenced May 15, 2024

[Feature] cp iceberg metadata optimize to branch-3.3 #45637

Closed

[Feature] iceberg metadata super optimize (backport #45479) #45640

Merged

stephen-shelby linked a pull request May 15, 2024 that will close this issue

[Feature] iceberg metadata super optimize (backport #45479) #45640

Merged

24 tasks

wanpengfei-git reopened this May 15, 2024

wanpengfei-git added the version:3.3.0 label May 15, 2024

Samrose-Ahmed mentioned this issue Jun 2, 2024

Optimize Iceberg table count #46525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg metadata super optimization #43460

Iceberg metadata super optimization #43460

stephen-shelby commented Apr 1, 2024 •

edited by wanpengfei-git

Iceberg metadata super optimization #43460

Iceberg metadata super optimization #43460

Comments

stephen-shelby commented Apr 1, 2024 • edited by wanpengfei-git

Enhancement

stephen-shelby commented Apr 1, 2024 •

edited by wanpengfei-git