Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize compaction strategy to avoid merging SST files without overlapping time ranges #3416

Open
1 of 2 tasks
v0y4g3r opened this issue Feb 29, 2024 · 6 comments
Open
1 of 2 tasks
Assignees

Comments

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Feb 29, 2024

What type of enhancement is this?

Performance

What does the enhancement do?

With the introduction of new merge tree memtable, we may find SST files locates in L0 can as large as several GiBs. In this case, merging L0 files regardless of their size can be expensive and do little with read performance. We should leverage the "tiered" merging manner that skips large files and those does overlap with others in terms of time ranges.

Tasks

Implementation challenges

No response

@evenyag
Copy link
Contributor

evenyag commented Feb 29, 2024

We compact small files with a large file. e.g.

3.8G 4649c98-4edc-444e-a9bd-920d9b88b033.parquet
452M 8fea78f6-ad12-4344-a43d-c93092bfe068.parquet
426M d3a2ddcd-1ff4-4d82-a87d-d47d1f8490ba.parquet
423M fb727d76-2afd-487e-a286-cdeea29048dd.parquet

becomes

4.5G 5d79971b-ce56-4984-af07-5e60fa6f9436.parquet

We might stop compacting a window if the output file is too large.

However, we need to choose files carefully since we remove deleted keys and the deletion marker during compaction so we should compact files with overlapped keys.

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Mar 29, 2024

For active windows, we enforce max_sorted_runs (maybe apply to both L0 and L1) and max_files (apply to all files in active window or just L0 in active window?). If current runs exceed that value, we find a cheapest way to achieve that goal.

For example, for an active window with max_sorted_runs=1

compaction

For inactive windows, consider that sometimes there will be out of order insertions
we allow no more than one run in noth L0 and L1 (Typical leveled compaction strategy). If current runs exceeds that value, all overlapping files in both l0 and l1 will be picked to be compacted to a new file in l1.

When an active window shifts to inactive window, we check if there is only one sorted run in both l0 and l1 (this does not require another rountine, every compaction should check every time window in current implementation).

compaction

@evenyag
Copy link
Contributor

evenyag commented Mar 29, 2024

We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.

@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Mar 29, 2024

We should ensure all overlapping files in the same time window are compacted into one sorted run. Since we only have 2 levels and always remove the delete tombstone after compaction. Otherwise, we should add a flag to the merger reader to control whether to remove the delete tombstone. We only filter deleted items when we ensure all overlapping files are compacted.

Just like Cassandra's size-tiered strategy, tombstones are only dropped when all overlapping files are involved in current compaction.

@evenyag
Copy link
Contributor

evenyag commented Mar 29, 2024

I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.

2024-03-28T06:59:10.833054Z  INFO mito2::compaction::twcs: Compacted SST files, input: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(9f93f433-2928-474c-99b8-0e313aeeff81), time_range: (Timestamp { value: 1708489285951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 29669, available_indexes: [InvertedIndex], index_file_size: 20331 },
// ......
FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(c07bf121-2ef6-4e2c-8381-67b27958195f), time_range: (Timestamp { value: 1708489210951, unit: Millisecond }, Timestamp { value: 1708489285951, unit: Millisecond }), level: 0, file_size: 32240, available_indexes: [InvertedIndex], index_file_size: 20333 }, FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(be30fabe-d0b4-4e54-89c2-ae860b3cd9ae), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674531, available_indexes: [InvertedIndex], index_file_size: 41016404 }], output: [FileMeta { region_id: 5175435591680(1205, 0), file_id: FileId(8343abbc-2d5f-4ba4-a8f8-bf8efb4d71a8), time_range: (Timestamp { value: 1702940890950, unit: Millisecond }, Timestamp { value: 1708489555951, unit: Millisecond }), level: 1, file_size: 91674772, available_indexes: [InvertedIndex], index_file_size: 41016403 }], window: Some(31536000)

2024-03-28 14:59:10.899	
2024-03-28T06:59:10.899509Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9f93f433-2928-474c-99b8-0e313aeeff81, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.903	
2024-03-28T06:59:10.903589Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9031acad-daa6-44a7-a4d4-a85593236fbb, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.906	
2024-03-28T06:59:10.906535Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 51d57191-4007-495a-8b2b-e4a6e74a77bc, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.907	
2024-03-28T06:59:10.907597Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7b7b74a8-1062-49ea-9582-cde00f27bf5b, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.927	
2024-03-28T06:59:10.927402Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 53f8d4cb-1af1-4772-9471-a811c7ccf424, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.932	
2024-03-28T06:59:10.932686Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 438cbe35-b94c-4e1e-8671-78d7614acc46, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.937	
2024-03-28T06:59:10.937580Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 94592fe4-f407-4cdd-b875-f7dce7c916aa, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.939	
2024-03-28T06:59:10.938907Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 7d8942b5-edf4-4531-a4b1-6b007bc8d41c, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.958	
2024-03-28T06:59:10.957885Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2f892e83-8477-46de-89df-6a1d829f6fef, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.969	
2024-03-28T06:59:10.969388Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 9aef716f-6e28-45bb-ab61-b4e585df5c99, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.972	
2024-03-28T06:59:10.972234Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: dd9be29e-4ac5-4fc6-9c3a-00dac12f4995, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.973	
2024-03-28T06:59:10.973382Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: 2bbbb635-b09f-4c83-93bc-141211557447, region: 5175435591680(1205, 0)
2024-03-28 14:59:10.991	
2024-03-28T06:59:10.991235Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: b6e7eb67-967c-4b98-bf1a-8ae463708ad8, region: 5175435591680(1205, 0)
2024-03-28 14:59:11.006	
2024-03-28T06:59:11.006431Z  INFO mito2::sst::file_purger: Successfully deleted SST file, file_id: f1da6408-6f8a-45dd-94f3-067d1d444e87, region: 5175435591680(1205, 0)
......

What's more, compaction jobs and flush jobs also use the same background worker pool and job limit.

@evenyag
Copy link
Contributor

evenyag commented Mar 31, 2024

I also found a potential issue that compaction jobs might block flush jobs. While using object stores, a compaction job with lots of input files will result in a large amount of file purge tasks. Deleting an object in the object store is slower than in the local disk. These purge tasks use the same background scheduler as flush jobs and slow down flush jobs.

I created a PR #3621 for this.

For flush and compaction jobs, I wonder if we should use dedicated schedulers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants