Arrow result collector: improved merge algorithm#625
Merged
Conversation
e9847e8 to
25ad40b
Compare
…y combine Follows the same general pattern as DuckDB's batch-index collector: each per-thread local collector is tagged with a unique batch_index; per-batch chunks flow straight through to ArrowQueryResult and the consumer combines them lazily (no eager decode/sort/rebuild, no std::sort). Highlights: * New OrderPreservationType enum (NO_ORDER, INSERTION_ORDER, FIXED_ORDER) on PhysicalOperator, with default NO_ORDER (Ladybug makes no insertion-order guarantee). OrderBy and TopK override operatorOrder() to FIXED_ORDER. * New PhysicalPlanUtil::getOrderPreservation walks the physical plan and asks each operator's metadata; returns FIXED_ORDER only if a fixed-order operator is on the data path, otherwise NO_ORDER. Walks child[0] only — Ladybug physical plans have a single data-flow edge per operator. * ArrowResultCollectorSharedState now stores per-batch Arrow chunks and CSR metadata in std::map<batch_index_t, ...>, plus a BatchIndexAssigner that hands out unique batch_index values per local collector (atomic fetch_add). Each local collector is assigned a batch_index at initLocalStateInternal time. * merge() in the cheap NO_ORDER path moves per-batch chunks into the global map (O(log N) per call, no pairwise merge, no sort). Result construction is zero-work — it threads per-batch chunks into ArrowQueryResult in batch_index order and does no merging. * ArrowQueryResult::ArrowChunkedArray — a ChunkedArray-style view of the merged Arrow arrays (similar to Arrow C++ lib's arrow::ChunkedArray). Backed by a shared vector that the hasNextArrowChunk / getNextArrowChunk iteration API also reads from, so both APIs coexist. Chunks are in batch_index order; python users can construct pyarrow.ChunkedArray from these chunks and call combine_chunks() to materialize on the consumer side (arrow::Concat isn't linked). * ArrowQueryResult::combineCSRChunks() — lazy k-way merge of per-batch CSR chunks in batch_index order. Replaces the eager std::sort + decode/rebuild that flattenCSRMetadata used to do at result-construction time. Within each chunk, the per-batch CSR tracker requires src to be non-decreasing, so emitting in (src, batch_index, scan_order_within_batch) order is correct without any global sort. Merged result is cached for repeated access. The FIXED_ORDER path still goes through mergeCSRMetadata, which does its own pairwise sort to preserve global order. Algorithmic cost drops from O(N^2 * M) for the previous pairwise merge (N threads, M edges) plus O(M log M) for the eager std::sort, to O(M log K) for the lazy k-way merge (K batches). Same correctness as before, but the scan-time hot path only does map insertions and result construction does no merging work.
25ad40b to
58c3fef
Compare
Contributor
Author
|
Tested via: Loads wikidata in about 30 secs and uses 42GB of memory. In theory it should be using 750 million * 8 bytes = 6GB + indptr (which is much smaller). Needs further debug. While I was able to run page rank on this graph, I wasn't able to run it correctly. Icebug requires both forward and reverse edges. Doing it in a way that fits on a 64GB machine is elusive. |
d6b151b to
78844d2
Compare
Contributor
Author
|
Further optimizations are possible, but require careful design. Specifically, peak RSS includes both the chunked and merged results. It's possible to combine chunks in a more memory efficient way. |
a3aca32 to
78844d2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Avoid expensive sort and merge when scanning REL tables in parallel
Follows the same general pattern as DuckDB's batch-index collector: each per-thread local collector is tagged with a unique batch_index; the collector stores per-batch chunks; final result is built by walking chunks in batch_index order (no pairwise merge, no global sort for the cheap path).
Highlights:
New OrderPreservationType enum (NO_ORDER, INSERTION_ORDER, FIXED_ORDER) on PhysicalOperator, with default NO_ORDER (Ladybug makes no insertion-order guarantee). OrderBy and TopK override operatorOrder() to FIXED_ORDER.
New PhysicalPlanUtil::getOrderPreservation walks the physical plan and asks each operator's metadata; returns FIXED_ORDER only if a fixed-order operator is on the data path, otherwise NO_ORDER. Walks child[0] only — Ladybug physical plans have a single data-flow edge per operator.
ArrowResultCollectorSharedState now stores per-batch Arrow chunks and CSR metadata in std::map<batch_index_t, ...>, plus a BatchIndexAssigner that hands out unique batch_index values per local collector (atomic fetch_add). Each local collector is assigned a batch_index at initLocalStateInternal time.
merge() in the cheap NO_ORDER path moves per-batch chunks into the global map (O(log N) map insertion, no pairwise merge, no sort). flattenCSRMetadata does one global decode/sort/rebuild at result-construction time. The FIXED_ORDER path collapses to a single chunk under key 0 via the existing pairwise mergeCSRMetadata.
Result construction (getQueryResult) walks the per-batch map in batch_index order — the map's natural ordering gives the correct global row order without sorting.
Algorithmic cost drops from O(N^2 * M) for the previous pairwise merge (N threads, M edges) to O(M log M) for the new approach. Same correctness as before (decode/re-encode per chunk), but the scan-time hot path only does map insertions.