Conversation
e04036e to
e5cd068
Compare
| void ArrowNodeTable::initScanState([[maybe_unused]] transaction::Transaction* transaction, | ||
| TableScanState& scanState, [[maybe_unused]] bool resetCachedBoundNodeSelVec) const { | ||
| auto& arrowScanState = scanState.cast<ArrowNodeTableScanState>(); | ||
| auto& arrowScanState = scanState.cast<ColumnarNodeTableScanState>(); |
There was a problem hiding this comment.
My understanding is that Arrow and Parquet scans are fundamentally different. The difference can be attribute to their inherent storage. Because arrow tables sit in memory, we can completely own the parallelism. As for parquet, rowGroup size is not always the same. So, we can't distribute them uniformly across morsels everytime(when numMorsels != numRowGroups)
Take a look at getNextMorsel for example. arrow morsel size is constant(decided by us) where as for parquet, it is the rowGroup size. In each morsel, parquet_node_table runs another batching(size: 2048) internally
Maybe that's why separate scanStates were added to extend common columnarTableBase
PS: we can still make it consistent, but requires significant surgery at PhysicalOperator level
There was a problem hiding this comment.
My inclination is to make mixed tables work (it's made easier by unifying the scan states), add tests and then specialize them for performance.
| // Each scan state needs to be able to read data independently for parallel scanning | ||
| arrowScanState.outputToColumnarColumnIdx.assign(scanState.columnIDs.size(), -1); | ||
| auto tableCatalogEntry = getCatalogEntry(); | ||
| for (size_t col = 0; col < scanState.columnIDs.size(); ++col) { |
There was a problem hiding this comment.
This should belong to initScanCoordination
There was a problem hiding this comment.
initScanCoordination does not have the operator-specific projected columnIDs; it only coordinates table-level morsel assignment. The Arrow column mapping depends on the current scan’s requested output columns, so I kept it derived at scan/init-scan time from table metadata rather than storing it as global coordination state.
e5cd068 to
2ae9573
Compare
|
Changes:
|
Consolidate columnar node scan state so mixed scans can reuse one state across native, Arrow, and Parquet-backed tables without hitting Arrow-only checked casts. Also keep multi-rel scan state valid across columnar and native rel tables by initializing each table before scan and reusing the shared output selection vector instead of replacing it with Arrow/Parquet-sized vectors.
2ae9573 to
cd13d24
Compare
|
|
@aheev - there are two unresolved comments that may require further discussion. Let's discuss them on an issue/discussion if necessary. I want to close on the python + arrow bindings and share some sample code/notebooks. |
sure we'll get back to them next week |
Related: #183