Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg metadata super optimization #43460

Open
1 of 9 tasks
stephen-shelby opened this issue Apr 1, 2024 · 0 comments · Fixed by #44527, #44581, #44632, #44679 or #45640
Open
1 of 9 tasks

Iceberg metadata super optimization #43460

stephen-shelby opened this issue Apr 1, 2024 · 0 comments · Fixed by #44527, #44581, #44632, #44679 or #45640
Labels
type/enhancement Make an enhancement to StarRocks version:3.3.0

Comments

@stephen-shelby
Copy link
Contributor

stephen-shelby commented Apr 1, 2024

Enhancement

This issue is used to trace iceberg metadata related optimization patches. The first version patch is #43459

There are currently four important optimizations:

  1. iceberg metadata distributed plan. (Performance increased by n ~ 2n times. n is be number)
  2. iceberg manifest cache. (Job planning delay reduced to around 100ms)
  3. external table based on metadata file distributed plan Framework.
  4. refactor some iceberg metadata parts.

There is still a lot of optimization work to come, some of which are as follows:

  • Add more auto distributed strategies baed on be load like cpu/io/mem
  • Enhancement time travel statement like as of.
  • refactor current split structure like RemoteFileInfo
  • provide default optimal distributed plan threshold by do a lot of testing.
  • Adapt to partition data parser when exists partition evolution.
  • Continuously optimize performance
  • add cache metrics like hitrate
  • provide benchmark to guarantee performance
@stephen-shelby stephen-shelby added the type/enhancement Make an enhancement to StarRocks label Apr 1, 2024
@stephen-shelby stephen-shelby linked a pull request Apr 22, 2024 that will close this issue
24 tasks
@andyziye andyziye reopened this Apr 23, 2024
@stephen-shelby stephen-shelby linked a pull request Apr 23, 2024 that will close this issue
24 tasks
@stephen-shelby stephen-shelby linked a pull request Apr 23, 2024 that will close this issue
24 tasks
@stephen-shelby stephen-shelby linked a pull request Apr 24, 2024 that will close this issue
24 tasks
imay pushed a commit that referenced this issue May 11, 2024
…uted_plan interface (#45404)

Why I'm doing:
we use ConnectContext.get() to get Session variable in the iceberg job planning. If we don't execute planning in the query thread, and can't get the connect context. There are two cases:

In the prepareMetadata, the job planning is executed in the thread pool
distributed plan mode, the metadata collect job is executed in the another thread.
1 && 2. plan_mode is distributed and running under the thread pool
So we need to adapt to the two case with the connect context.

What I'm doing:
Adapting ConnectContext and Tracers under multi-threading

Fixes #issue
#43460

Signed-off-by: stephen <stephen5217@163.com>
stephen-shelby added a commit to stephen-shelby/starrocks that referenced this issue May 15, 2024
…uted_plan interface (StarRocks#45404)

Why I'm doing:
we use ConnectContext.get() to get Session variable in the iceberg job planning. If we don't execute planning in the query thread, and can't get the connect context. There are two cases:

In the prepareMetadata, the job planning is executed in the thread pool
distributed plan mode, the metadata collect job is executed in the another thread.
1 && 2. plan_mode is distributed and running under the thread pool
So we need to adapt to the two case with the connect context.

What I'm doing:
Adapting ConnectContext and Tracers under multi-threading

Fixes #issue
StarRocks#43460

Signed-off-by: stephen <stephen5217@163.com>
@stephen-shelby stephen-shelby linked a pull request May 15, 2024 that will close this issue
24 tasks
node pushed a commit to vivo/starrocks that referenced this issue May 17, 2024
…uted_plan interface (StarRocks#45404)

Why I'm doing:
we use ConnectContext.get() to get Session variable in the iceberg job planning. If we don't execute planning in the query thread, and can't get the connect context. There are two cases:

In the prepareMetadata, the job planning is executed in the thread pool
distributed plan mode, the metadata collect job is executed in the another thread.
1 && 2. plan_mode is distributed and running under the thread pool
So we need to adapt to the two case with the connect context.

What I'm doing:
Adapting ConnectContext and Tracers under multi-threading

Fixes #issue
StarRocks#43460

Signed-off-by: stephen <stephen5217@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment