Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mito): adjust seg size of inverted index to finer granularity instead of row group level #3289

Merged

Conversation

zhongzc
Copy link
Contributor

@zhongzc zhongzc commented Feb 6, 2024

I hereby agree to the terms of the GreptimeDB CLA

What's changed and what's your intention?

Adjusting the segment row count from 102400 (row group size) to 1024 helps provide a finer granularity of filtering than the row group level, which greatly enhances the filtering capability of the inverted index.

In the personal testing scenario, an improvement of over 5x was achieved.

2024-02-06T11:12:47.571445Z DEBUG mito2::sst::parquet::reader: Read parquet 4651449581568(1083, 0) 0ae417c3-8a95-4aa1-a582-db73ad3eaa3e, range: (Timestamp { value: 1707157248569, unit: Millisecond }, Timestamp { value: 1707157257567, unit: Millisecond }), 7/7 row groups, metrics: Metrics { num_row_groups_before_filtering: 7, num_row_groups_inverted_index_filtered: 0, num_row_groups_min_max_filtered: 0, num_rows_precise_filtered: 97632, num_rows_in_row_group_before_filtering: 672768, num_rows_in_row_group_inverted_index_filtered: 574464, build_cost: 1.150289ms, scan_cost: 406.620401ms, num_record_batches: 96, num_batches: 672, num_rows: 672 }
2024-02-06T11:12:47.571455Z DEBUG mito2::sst::parquet::reader: Read parquet 4651449581568(1083, 0) 7fbf7779-43c0-457a-b006-8ba8bf4d28ec, range: (Timestamp { value: 1707157256604, unit: Millisecond }, Timestamp { value: 1707157268548, unit: Millisecond }), 8/8 row groups, metrics: Metrics { num_row_groups_before_filtering: 8, num_row_groups_inverted_index_filtered: 0, num_row_groups_min_max_filtered: 0, num_rows_precise_filtered: 97632, num_rows_in_row_group_before_filtering: 781056, num_rows_in_row_group_inverted_index_filtered: 682752, build_cost: 854.454µs, scan_cost: 375.459905ms, num_record_batches: 96, num_batches: 672, num_rows: 672 }
2024-02-06T11:12:47.571466Z DEBUG mito2::sst::parquet::reader: Read parquet 4651449581568(1083, 0) ed666445-3036-4236-b682-780f8a2c44a9, range: (Timestamp { value: 1707157267615, unit: Millisecond }, Timestamp { value: 1707157291488, unit: Millisecond }), 18/18 row groups, metrics: Metrics { num_row_groups_before_filtering: 18, num_row_groups_inverted_index_filtered: 0, num_row_groups_min_max_filtered: 0, num_rows_precise_filtered: 97984, num_rows_in_row_group_before_filtering: 1767168, num_rows_in_row_group_inverted_index_filtered: 1667840, build_cost: 1.864051ms, scan_cost: 194.583036ms, num_record_batches: 97, num_batches: 672, num_rows: 1344 }
2024-02-06T11:12:47.571497Z DEBUG mito2::sst::parquet::reader: Read parquet 4651449581568(1083, 0) 106a54ea-5706-4654-8ed7-1c8ced239787, range: (Timestamp { value: 1707155901111, unit: Millisecond }, Timestamp { value: 1707157249316, unit: Millisecond }), 96/304 row groups, metrics: Metrics { num_row_groups_before_filtering: 304, num_row_groups_inverted_index_filtered: 208, num_row_groups_min_max_filtered: 0, num_rows_precise_filtered: 103840, num_rows_in_row_group_before_filtering: 9830400, num_rows_in_row_group_inverted_index_filtered: 9700352, build_cost: 1.419874ms, scan_cost: 92.19075ms, num_record_batches: 127, num_batches: 699, num_rows: 26208 }
2024-02-06T11:12:47.579477Z DEBUG mito2::memtable::time_series: Iter 4651449581568(1083, 0) time series memtable, metrics: Metrics { total_series: 70656, num_pruned_series: 69984, num_rows: 672, num_batches: 672, scan_cost: 496.600474ms }
2024-02-06T11:12:47.580200Z DEBUG mito2::read::seq_scan: Seq scan finished, region_id: 4651449581568(1083, 0), metrics: Metrics { build_reader_cost: 32.264549ms, scan_cost: 485.816121ms, convert_cost: 27.524708ms }, use_parallel: true, parallelism: 6
2024-02-06T11:12:47.580220Z DEBUG mito2::read::merge: Merge reader finished, metrics: Metrics { scan_cost: 451.252364ms, num_fetch_by_batches: 3387, num_fetch_by_rows: 0, num_input_rows: 29568, num_duplicate_rows: 0, num_output_rows: 29568, num_deleted_rows: 0, fetch_cost: 447.486077ms }

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR does not require documentation updates.

Refer to a related PR or issue link (optional)

…stead of row group level

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
@github-actions github-actions bot added docs-not-required This change does not impact docs. Size: M labels Feb 6, 2024
Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
@github-actions github-actions bot added Size: L and removed Size: M labels Feb 6, 2024
Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
Copy link

codecov bot commented Feb 7, 2024

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (e4cd294) 85.67% compared to head (b223293) 85.08%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3289      +/-   ##
==========================================
- Coverage   85.67%   85.08%   -0.60%     
==========================================
  Files         865      872       +7     
  Lines      141035   141688     +653     
==========================================
- Hits       120838   120550     -288     
- Misses      20197    21138     +941     

Copy link
Contributor

@evenyag evenyag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@killme2008 killme2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@zhongzc zhongzc added this pull request to the merge queue Feb 7, 2024
Merged via the queue into GreptimeTeam:main with commit 141ed51 Feb 7, 2024
16 checks passed
@zhongzc zhongzc deleted the zhongzc/inverted-index-fine-grained branch February 7, 2024 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants