Support Arrow relationship table scans by adsharma · Pull Request #460 · LadybugDB/ladybug

adsharma · 2026-05-06T16:41:19Z

Related: #183

aheev · 2026-05-07T14:51:40Z

 void ArrowNodeTable::initScanState([[maybe_unused]] transaction::Transaction* transaction,
    TableScanState& scanState, [[maybe_unused]] bool resetCachedBoundNodeSelVec) const {
-    auto& arrowScanState = scanState.cast<ArrowNodeTableScanState>();
+    auto& arrowScanState = scanState.cast<ColumnarNodeTableScanState>();


My understanding is that Arrow and Parquet scans are fundamentally different. The difference can be attribute to their inherent storage. Because arrow tables sit in memory, we can completely own the parallelism. As for parquet, rowGroup size is not always the same. So, we can't distribute them uniformly across morsels everytime(when numMorsels != numRowGroups)

Take a look at getNextMorsel for example. arrow morsel size is constant(decided by us) where as for parquet, it is the rowGroup size. In each morsel, parquet_node_table runs another batching(size: 2048) internally

Maybe that's why separate scanStates were added to extend common columnarTableBase

PS: we can still make it consistent, but requires significant surgery at PhysicalOperator level

My inclination is to make mixed tables work (it's made easier by unifying the scan states), add tests and then specialize them for performance.

aheev · 2026-05-07T14:58:59Z

    // Each scan state needs to be able to read data independently for parallel scanning
+    arrowScanState.outputToColumnarColumnIdx.assign(scanState.columnIDs.size(), -1);
+    auto tableCatalogEntry = getCatalogEntry();
+    for (size_t col = 0; col < scanState.columnIDs.size(); ++col) {


This should belong to initScanCoordination

initScanCoordination does not have the operator-specific projected columnIDs; it only coordinates table-level morsel assignment. The Arrow column mapping depends on the current scan’s requested output columns, so I kept it derived at scan/init-scan time from table metadata rather than storing it as global coordination state.

adsharma · 2026-05-07T18:26:32Z

Changes:

Moved Arrow output-column mapping out of scan state into Arrow table helper methods.
Kept mixed native/Arrow/Parquet scan compatibility through shared cursor/key state where needed.
Fixed the Arrow rel scan cursor so it does not skip the first row for the next bound node.
Added ScanArrowRelTableOverNativeNodeTable coverage.

Consolidate columnar node scan state so mixed scans can reuse one state across native, Arrow, and Parquet-backed tables without hitting Arrow-only checked casts. Also keep multi-rel scan state valid across columnar and native rel tables by initializing each table before scan and reusing the shared output selection vector instead of replacing it with Arrow/Parquet-sized vectors.

adsharma · 2026-05-07T18:46:20Z

Removed Arrow-specific currentBatchIdx/currentMorselStartOffset/currentMorselEndOffset from ColumnarNodeTableScanState.
Moved those fields back onto ArrowNodeTableScanState.
Changed ScanNodeTable to create the correct scan-state type per current node table instead of forcing one columnar state to cover Arrow and Parquet. That preserves mixed-table behavior without leaking Arrow cursor state into the base.

adsharma · 2026-05-07T19:37:50Z

@aheev - there are two unresolved comments that may require further discussion. Let's discuss them on an issue/discussion if necessary.

I want to close on the python + arrow bindings and share some sample code/notebooks.

aheev · 2026-05-08T16:51:03Z

@aheev - there are two unresolved comments that may require further discussion. Let's discuss them on an issue/discussion if necessary.

I want to close on the python + arrow bindings and share some sample code/notebooks.

sure we'll get back to them next week

adsharma requested a review from aheev May 6, 2026 16:41

adsharma force-pushed the ladybug-arrow-fix branch 7 times, most recently from e04036e to e5cd068 Compare May 7, 2026 00:59

aheev reviewed May 7, 2026

View reviewed changes

Support Arrow relationship table scans

9130d12

adsharma force-pushed the ladybug-arrow-fix branch from e5cd068 to 2ae9573 Compare May 7, 2026 18:25

adsharma force-pushed the ladybug-arrow-fix branch from 2ae9573 to cd13d24 Compare May 7, 2026 18:45

adsharma merged commit 22030ee into main May 7, 2026
4 checks passed

adsharma deleted the ladybug-arrow-fix branch May 7, 2026 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Arrow relationship table scans#460

Support Arrow relationship table scans#460
adsharma merged 2 commits intomainfrom
ladybug-arrow-fix

adsharma commented May 6, 2026

Uh oh!

Uh oh!

aheev May 7, 2026

Uh oh!

adsharma May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aheev May 7, 2026

Uh oh!

adsharma May 7, 2026

Uh oh!

Uh oh!

adsharma commented May 7, 2026

Uh oh!

adsharma commented May 7, 2026

Uh oh!

adsharma commented May 7, 2026

Uh oh!

Uh oh!

aheev commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adsharma commented May 6, 2026

Uh oh!

Uh oh!

aheev May 7, 2026

Choose a reason for hiding this comment

Uh oh!

adsharma May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aheev May 7, 2026

Choose a reason for hiding this comment

Uh oh!

adsharma May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adsharma commented May 7, 2026

Uh oh!

adsharma commented May 7, 2026

Uh oh!

adsharma commented May 7, 2026

Uh oh!

Uh oh!

aheev commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants