arrow-csr: Memory efficient zero-copy path by adsharma · Pull Request #626 · LadybugDB/ladybug

adsharma · 2026-06-28T05:37:27Z

Cuts peak RSS from 45GB to around 16GB.

Details in the design doc.

…ase 1) The DirectArrowResultCollector path already skipped ArrowArray materialization, but each per-batch CSRMetadata still built a dense global indptr (numSourceRows+1 entries, ~800MB each) and padded its tail to numSourceRows. With ~50 parallel batches that duplicated indptr was the entire ~45GB blow-up on a 0.75B-edge scan (indices/edgeIDs were already at theoretical size). Replace the dense per-batch indptr with a sparse (srcRows, counts) representation: one run per distinct source row the batch touched. This is valid because the rel scan emits edges in non-decreasing source order per thread (CSR storage + monotonic morsel acquisition) and each source node is scanned in exactly one morsel -> one thread, so per-batch srcRows sets are disjoint and sorted. kwayMergeCSRChunks now consumes the sparse chunks and reconstructs the single dense global indptr (800MB) once, on first consumer request: - single-chunk fast path moves indices/edgeIDs through without copying and synthesizes the indptr from (srcRows, counts) via prefix-sum; - multi-chunk path does a k-way union of the disjoint sorted runs via a min-heap (O(totalSrcRows + totalEdges log numChunks)) instead of the old O(maxSrcRow x numChunks) sweep, with a disjointness invariant check. mergeCSRMetadata (FIXED_ORDER pairwise path) is adapted to consume and produce the sparse shape. The merged CSRMetadata retains the dense indptr + flat indices/edgeIDs, so getCSRArrowArrays(), makeCSRArrowArray (which already aliases the merged vector), and the existing test API (metadata.indptr[src], metadata.indices[idx]) are unchanged. Steady-state memory: ~45GB -> ~7.6GB (no rel rowid) / ~13.6GB (with). Phase 2 (chunked zero-copy indices) will reach the ~6.8GB/12.8GB floor. Design: docs/design_arrow_csr_zero_copy.md

Phase 2 of the Arrow CSR zero-copy work. With Phase 1 the steady-state per-batch indptr bloat (45GB) was gone, but combineCSRChunks() still copied the full per-batch index/edgeID vectors (~6GB) into the merge argument and held all per-batch vectors + the merged copy for the duration of the k-way merge. Changes: - combineCSRChunks() now std::moves csrChunks into kwayMergeCSRChunks, avoiding the ~6GB argument copy and leaving csrChunks empty (hasCSRMetadata() stays true via the cached combinedCSR). - kwayMergeCSRChunks frees each per-batch chunk's vectors as soon as its source rows are exhausted by the min-heap, so the transient peak (per-batch + merged) stays low instead of holding both for the whole merge. - Single-chunk path and validation now handle the legacy dense-indptr form (empty srcRows/counts, pre-set indptr) used by test fixtures. - Replaced the brittle pointer-equality test with a value-equivalence check that verifies the exported Arrow buffers hold the merged CSR values without pinning the backing-pointer choice. Reaches the ~6.8GB floor for the 0.75B-edge sorted-by-source scan. True chunked zero-copy (no merged copy at all) requires a table SORTED BY (src) DDL, to be added separately; the merge already validates disjointness and falls back safely if the table is unsorted.

adsharma changed the title ~~Implement a more memory efficient zero-copy path for scanning REL tables~~ arrow-csr: Memory efficient zero-copy path for scanning REL tables Jun 28, 2026

adsharma changed the title ~~arrow-csr: Memory efficient zero-copy path for scanning REL tables~~ arrow-csr: Memory efficient zero-copy path Jun 28, 2026

adsharma force-pushed the skip_arrow_sort2 branch from 7866dbc to ce63471 Compare June 28, 2026 05:38

adsharma merged commit baa33ab into main Jun 28, 2026
4 checks passed

adsharma deleted the skip_arrow_sort2 branch June 28, 2026 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

arrow-csr: Memory efficient zero-copy path#626

arrow-csr: Memory efficient zero-copy path#626
adsharma merged 2 commits into
mainfrom
skip_arrow_sort2

adsharma commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

adsharma commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant