feat(parquet-writer)!: enable multiple compaction output #5292

v0y4g3r · 2025-01-06T03:44:51Z

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.
This PR requires documentation updates.
API changes are backward compatible.
Schema or data changes are backward compatible.

coderabbitai · 2025-01-06T03:44:57Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

…pdated references and implementations in `access_layer.rs`, `write_cache.rs`, and related test files to use the new struct name. - **Add `max_file_size` support in compaction:** Introduced `max_file_size` option in `PickerOutput`, `SerializedPickerOutput`, and `WriteOptions` in `compactor.rs`, `picker.rs`, `twcs.rs`, and `window.rs`. - **Enhance Parquet writing logic:** Modified `parquet.rs` and `parquet/writer.rs` to support optional `max_file_size` and added a test case `test_write_multiple_files` to verify writing multiple files based on size constraints. **Refactor Parquet Writer Initialization and File Handling** - Updated `ParquetWriter` in `writer.rs` to handle `current_indexer` as an `Option`, allowing for more flexible initialization and management. - Introduced `finish_current_file` method to encapsulate logic for completing and transitioning between SST files, improving code clarity and maintainability. - Enhanced error handling and logging with `debug` statements for better traceability during file operations. - **Removed Output Size Enforcement in `twcs.rs`:** - Deleted the `enforce_max_output_size` function and related logic to simplify compaction input handling. - **Added Max File Size Option in `parquet.rs`:** - Introduced `max_file_size` in `WriteOptions` to control the maximum size of output files. - **Refactored Indexer Management in `parquet/writer.rs`:** - Changed `current_indexer` from an `Option` to a direct `Indexer` type. - Implemented `roll_to_next_file` to handle file transitions when exceeding `max_file_size`. - Simplified indexer initialization and management logic. - **Refactored SST File Handling**: - Introduced `FilePathProvider` trait and its implementations (`WriteCachePathProvider`, `RegionFilePathFactory`) to manage SST and index file paths. - Updated `AccessLayer`, `WriteCache`, and `ParquetWriter` to use `FilePathProvider` for path management. - Modified `SstWriteRequest` and `SstUploadRequest` to use path providers instead of direct paths. - Files affected: `access_layer.rs`, `write_cache.rs`, `parquet.rs`, `writer.rs`. - **Enhanced Indexer Management**: - Replaced `IndexerBuilder` with `IndexerBuilderImpl` and made it async to support dynamic indexer creation. - Updated `ParquetWriter` to handle multiple indexers and file IDs. - Files affected: `index.rs`, `parquet.rs`, `writer.rs`. - **Removed Redundant File ID Handling**: - Removed `file_id` from `SstWriteRequest` and `CompactionOutput`. - Updated related logic to dynamically generate file IDs where necessary. - Files affected: `compaction.rs`, `flush.rs`, `picker.rs`, `twcs.rs`, `window.rs`. - **Test Adjustments**: - Updated tests to align with new path and indexer management. - Introduced `FixedPathProvider` and `NoopIndexBuilder` for testing purposes. - Files affected: `sst_util.rs`, `version_util.rs`, `parquet.rs`.

### Add Benchmarking and Refactor Compaction Logic - **Benchmarking**: Added a new benchmark `run_bench` in `Cargo.toml` and implemented benchmarks in `benches/run_bench.rs` using Criterion for `find_sorted_runs` and `reduce_runs` functions. - **Compaction Module Enhancements**: - Made `run.rs` public and refactored the `Ranged` and `Item` traits to be public. - Simplified the logic in `find_sorted_runs` and `reduce_runs` by removing `MergeItems` and related functions. - Introduced `find_overlapping_items` for identifying overlapping items. - **Code Cleanup**: Removed redundant code and tests related to `MergeItems` in `run.rs`.

### Enhance Compaction Logic and Add Benchmarks - **Compaction Logic Improvements**: - Updated `reduce_runs` function in `src/mito2/src/compaction/run.rs` to remove the target parameter and improve the logic for selecting files to merge based on minimum penalty. - Enhanced `find_overlapping_items` to handle unsorted inputs and improve overlap detection efficiency. - **Benchmark Enhancements**: - Added `bench_find_overlapping_items` in `src/mito2/benches/run_bench.rs` to benchmark the new `find_overlapping_items` function. - Extended existing benchmarks to include larger data sizes. - **Testing Enhancements**: - Updated tests in `src/mito2/src/compaction/run.rs` to reflect changes in `reduce_runs` and added new tests for `find_overlapping_items`. - **Logging and Debugging**: - Improved logging in `src/mito2/src/compaction/twcs.rs` to provide more detailed information about compaction decisions.

### Refactor and Enhance Compaction Logic - **Refactor `find_overlapping_items` Function**: Changed the function signature to accept slices instead of mutable vectors in `run.rs`. - **Rename and Update Struct Fields**: Renamed `penalty` to `size` in `SortedRun` struct and updated related logic in `run.rs`. - **Enhance `reduce_runs` Function**: Improved logic to sort runs by size and limit probe runs to 100 in `run.rs`. - **Add `merge_seq_files` Function**: Introduced a new function `merge_seq_files` in `run.rs` for merging sequential files. - **Modify `TwcsPicker` Logic**: Updated the compaction logic to use `merge_seq_files` when only one run is found in `twcs.rs`. - **Remove `enforce_file_num` Function**: Deleted the `enforce_file_num` function and its related test cases in `twcs.rs`.

### Enhance Compaction Logic and Testing - **Add `merge_seq_files` Functionality**: Implemented the `merge_seq_files` function in `run.rs` to optimize file merging based on scoring systems. Updated benchmarks in `run_bench.rs` to include `bench_merge_seq_files`. - **Improve Compaction Strategy in `twcs.rs`**: Modified the compaction logic to handle file merging more effectively, considering file size and overlap. - **Update Tests**: Enhanced test coverage in `compaction_test.rs` and `append_mode_test.rs` to validate new compaction logic and file merging strategies. - **Remove Unused Function**: Deleted `new_file_handles` from `test_util.rs` as it was no longer needed.

### Refactor TWCS Compaction Options - **Refactor Compaction Logic**: Simplified the TWCS compaction logic by replacing multiple parameters (`max_active_window_runs`, `max_active_window_files`, `max_inactive_window_runs`, `max_inactive_window_files`) with a single `trigger_file_num` parameter in `picker.rs`, `twcs.rs`, and `options.rs`. - **Update Tests**: Adjusted test cases to reflect the new compaction logic in `append_mode_test.rs`, `compaction_test.rs`, `filter_deleted_test.rs`, `merge_mode_test.rs`, and various test files under `tests/cases`. - **Modify Engine Options**: Updated engine option keys to use `trigger_file_num` in `mito_engine_options.rs` and `region_request.rs`. - **Fuzz Testing**: Updated fuzz test generators and translators to accommodate the new compaction parameter in `alter_expr.rs` and related files. This refactor aims to streamline the compaction configuration by reducing the number of parameters and simplifying the codebase.

Copilot

Pull Request Overview

This PR introduces a new TWCS compaction parameter (trigger_file_num) to replace the previous active/inactive window file/run limits, enabling multiple output SST files based on size thresholds. Core changes include:

Replaced old max_active_window_* and max_inactive_window_* options with trigger_file_num
Updated API layers, protobuf mappings, and default structs to reflect the new option
Refactored the Parquet writer to split SST files at a configurable max_file_size and added end-to-end tests

Reviewed Changes

Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests-fuzz/src/generator/alter_expr.rs	Updated fuzz generator to emit `TwcsTriggerFileNum` option
src/store-api/src/region_request.rs	Removed deprecated TWCS options, added mapping for trigger num
src/store-api/src/mito_engine_options.rs	Dropped old constants, introduced `TWCS_TRIGGER_FILE_NUM`
src/mito2/src/worker/handle_alter.rs	Handle `TWCS_TRIGGER_FILE_NUM` in metadata updates
src/mito2/src/sst/parquet/writer.rs	Reworked writer to finish multiple files based on size
src/mito2/src/region/options.rs	Simplified `TwcsOptions` to only include `trigger_file_num`
src/mito2/src/compaction/twcs.rs	Updated picker logic to use `trigger_file_num` and merge files
src/mito2/src/compaction/picker.rs	Propagated new max file size field in `PickerOutput`

Comments suppressed due to low confidence (1)

src/mito2/benches/run_bench.rs:1

The benchmark uses the criterion crate but there’s no indication it’s listed under [dev-dependencies] in Cargo.toml. Ensure criterion = "^0.3" (or correct version) is added to avoid build errors when running benches.

use criterion::{black_box, criterion_group, criterion_main, Criterion};

Copilot · 2025-05-18T18:02:44Z

src/mito2/src/sst/parquet/writer.rs

                }
            }
        }



Calling finish_current_file unconditionally at the end of write_all will panic if no rows were written (the assertion stats.num_rows > 0 will fail). Add a guard to early-return when stats.num_rows == 0 or self.writer.is_none() to avoid panics.

Suggested change

// Guard to prevent panics if no rows were written or writer is uninitialized.

if stats.num_rows == 0 || self.writer.is_none() {

return Ok(results);

}

Copilot · 2025-05-18T18:02:44Z

src/mito2/src/sst/parquet/writer.rs

+        stats: &mut SourceStats,
+    ) -> Result<()> {
+        // maybe_init_writer will re-create a new file.
+        if let Some(mut current_writer) = mem::take(&mut self.writer) {


After taking self.writer, the associated current_indexer remains set to the old instance. This could lead to using a finished indexer on the next file. Consider mem::take(&mut self.current_indexer) inside this block to clear it.

github-actions bot added the docs-not-required This change does not impact docs. label Jan 6, 2025

v0y4g3r changed the title ~~feat(parquet-writer): enable multiple compaction output~~ feat!(parquet-writer): enable multiple compaction output Jan 6, 2025

v0y4g3r force-pushed the feat/multiple-compaction-output branch from d410f7f to edaa6c3 Compare March 13, 2025 07:03

v0y4g3r marked this pull request as ready for review March 13, 2025 08:04

v0y4g3r requested review from evenyag and waynexia as code owners March 13, 2025 08:04

github-actions bot added docs-required This change requires docs update. and removed docs-not-required This change does not impact docs. labels Mar 13, 2025

sunng87 mentioned this pull request Mar 13, 2025

Update docs for feat!(parquet-writer): enable multiple compaction output GreptimeTeam/docs#1562

Closed

v0y4g3r changed the title ~~feat!(parquet-writer): enable multiple compaction output~~ feat(parquet-writer)!: enable multiple compaction output Mar 13, 2025

sunng87 mentioned this pull request Mar 13, 2025

Update docs for feat(parquet-writer)!: enable multiple compaction output GreptimeTeam/docs#1563

Closed

fengjiachun added this to the v0.15 milestone Apr 21, 2025

v0y4g3r and others added 7 commits May 14, 2025 06:32

chore: rebase main

4f41116

v0y4g3r force-pushed the feat/multiple-compaction-output branch from edaa6c3 to e8f16ca Compare May 18, 2025 16:06

v0y4g3r requested a review from a team as a code owner May 18, 2025 16:06

waynexia requested a review from Copilot May 18, 2025 18:00

Copilot AI reviewed May 18, 2025

View reviewed changes

v0y4g3r closed this May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(parquet-writer)!: enable multiple compaction output #5292

feat(parquet-writer)!: enable multiple compaction output #5292

Uh oh!

v0y4g3r commented Jan 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 6, 2025 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 18, 2025

Uh oh!

Copilot AI May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+        // Guard to prevent panics if no rows were written or writer is uninitialized.
+        if stats.num_rows == 0 || self.writer.is_none() {
+            return Ok(results);
+        }

feat(parquet-writer)!: enable multiple compaction output #5292

feat(parquet-writer)!: enable multiple compaction output #5292

Uh oh!

Conversation

v0y4g3r commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Uh oh!

coderabbitai bot commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

v0y4g3r commented Jan 6, 2025 •

edited

Loading

coderabbitai bot commented Jan 6, 2025 •

edited

Loading