Skip to content

Support Arrow relationship table scans#460

Merged
adsharma merged 2 commits intomainfrom
ladybug-arrow-fix
May 7, 2026
Merged

Support Arrow relationship table scans#460
adsharma merged 2 commits intomainfrom
ladybug-arrow-fix

Conversation

@adsharma
Copy link
Copy Markdown
Contributor

@adsharma adsharma commented May 6, 2026

Related: #183

@adsharma adsharma requested a review from aheev May 6, 2026 16:41
@adsharma adsharma force-pushed the ladybug-arrow-fix branch 7 times, most recently from e04036e to e5cd068 Compare May 7, 2026 00:59
Comment thread src/include/storage/table/columnar_node_table_base.h Outdated
Comment thread src/storage/table/arrow_node_table.cpp Outdated
void ArrowNodeTable::initScanState([[maybe_unused]] transaction::Transaction* transaction,
TableScanState& scanState, [[maybe_unused]] bool resetCachedBoundNodeSelVec) const {
auto& arrowScanState = scanState.cast<ArrowNodeTableScanState>();
auto& arrowScanState = scanState.cast<ColumnarNodeTableScanState>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that Arrow and Parquet scans are fundamentally different. The difference can be attribute to their inherent storage. Because arrow tables sit in memory, we can completely own the parallelism. As for parquet, rowGroup size is not always the same. So, we can't distribute them uniformly across morsels everytime(when numMorsels != numRowGroups)

Take a look at getNextMorsel for example. arrow morsel size is constant(decided by us) where as for parquet, it is the rowGroup size. In each morsel, parquet_node_table runs another batching(size: 2048) internally

Maybe that's why separate scanStates were added to extend common columnarTableBase

PS: we can still make it consistent, but requires significant surgery at PhysicalOperator level

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination is to make mixed tables work (it's made easier by unifying the scan states), add tests and then specialize them for performance.

Comment thread src/include/storage/table/columnar_node_table_base.h Outdated
Comment thread src/include/storage/table/rel_table.h
Comment thread src/storage/table/arrow_node_table.cpp
Comment thread src/storage/table/arrow_node_table.cpp Outdated
// Each scan state needs to be able to read data independently for parallel scanning
arrowScanState.outputToColumnarColumnIdx.assign(scanState.columnIDs.size(), -1);
auto tableCatalogEntry = getCatalogEntry();
for (size_t col = 0; col < scanState.columnIDs.size(); ++col) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should belong to initScanCoordination

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initScanCoordination does not have the operator-specific projected columnIDs; it only coordinates table-level morsel assignment. The Arrow column mapping depends on the current scan’s requested output columns, so I kept it derived at scan/init-scan time from table metadata rather than storing it as global coordination state.

Comment thread test/api/arrow_rel_table_test.cpp
@adsharma adsharma force-pushed the ladybug-arrow-fix branch from e5cd068 to 2ae9573 Compare May 7, 2026 18:25
@adsharma
Copy link
Copy Markdown
Contributor Author

adsharma commented May 7, 2026

Changes:

  • Moved Arrow output-column mapping out of scan state into Arrow table helper methods.
  • Kept mixed native/Arrow/Parquet scan compatibility through shared cursor/key state where needed.
  • Fixed the Arrow rel scan cursor so it does not skip the first row for the next bound node.
  • Added ScanArrowRelTableOverNativeNodeTable coverage.

Consolidate columnar node scan state so mixed scans can reuse one state across native, Arrow, and Parquet-backed tables without hitting Arrow-only checked casts.

Also keep multi-rel scan state valid across columnar and native rel tables by initializing each table before scan and reusing the shared output selection vector instead of replacing it with Arrow/Parquet-sized vectors.
@adsharma adsharma force-pushed the ladybug-arrow-fix branch from 2ae9573 to cd13d24 Compare May 7, 2026 18:45
@adsharma
Copy link
Copy Markdown
Contributor Author

adsharma commented May 7, 2026

  • Removed Arrow-specific currentBatchIdx/currentMorselStartOffset/currentMorselEndOffset from ColumnarNodeTableScanState.
  • Moved those fields back onto ArrowNodeTableScanState.
  • Changed ScanNodeTable to create the correct scan-state type per current node table instead of forcing one columnar state to cover Arrow and Parquet. That preserves mixed-table behavior without leaking Arrow cursor state into the base.

@adsharma
Copy link
Copy Markdown
Contributor Author

adsharma commented May 7, 2026

@aheev - there are two unresolved comments that may require further discussion. Let's discuss them on an issue/discussion if necessary.

I want to close on the python + arrow bindings and share some sample code/notebooks.

@adsharma adsharma merged commit 22030ee into main May 7, 2026
4 checks passed
@adsharma adsharma deleted the ladybug-arrow-fix branch May 7, 2026 19:37
@aheev
Copy link
Copy Markdown
Contributor

aheev commented May 8, 2026

@aheev - there are two unresolved comments that may require further discussion. Let's discuss them on an issue/discussion if necessary.

I want to close on the python + arrow bindings and share some sample code/notebooks.

sure we'll get back to them next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants