fix parquet table scans#473
Conversation
|
@adsharma could you PTAL? |
|
benchmark is hanging. Lemme fix it |
|
minimal_test succeeds everytime in my local. Not sure why it fails in CI. Lemme take a look |
The CI tests run as relwithdebinfo with assertions enabled. Perhaps you're running the release builds locally? |
Understood. I am running test-build |
|
The latest commit should fix the build |
adsharma
left a comment
There was a problem hiding this comment.
I'll go ahead and merge this one because it fixes many problems.
Let's continue fixing the more complex query cases where there may be duplicate tuples in the input/bound side in future PRs.
| ParquetRelTableScanState& parquetRelScanState, | ||
| const std::vector<uint64_t>& rowGroupsToProcess, | ||
| const std::unordered_set<common::offset_t>& boundNodeOffsets); | ||
| const std::unordered_map<common::offset_t, common::sel_t>& boundNodeOffsets); |
There was a problem hiding this comment.
A node can appear multiple times in a factorized/vectorized input because the same bound node may be paired with different upstream values. Example:
MATCH (seed:user), (a:user {id: 100})
WITH seed, a
MATCH (a)-[:follows]->(b)
RETURN seed.name, b.name
Here a is the same node in every tuple, but each tuple has a different seed. If the scanner stores only:
unordered_map<offset_t, sel_t>
then duplicate a.offset values collapse to one sel_t.
There was a problem hiding this comment.
This is not a problem for native tables. Should we use the same method?
Native rel table scan does not use unordered_map<offset_t, vector<sel_t>>.
It uses the original selection vector directly and processes bound-node positions in order:
- RelTableScanState::cachedBoundNodeSelVector stores the input sel_ts.
- RelTableScanState::currBoundNodeIdx is an index into that cached selection vector.
- Native CSR scan repeatedly reads:
cachedBoundNodeSelVector[currBoundNodeIdx]
then uses that selected row to get the bound node offset.
Relevant files:
- src/include/storage/table/rel_table.h:22
- src/storage/table/rel_table.cpp:91
- src/storage/table/csr_node_group.cpp:181
So duplicates are naturally preserved because the scan state is position-driven, not offset-keyed. If the same
node offset appears in three input rows, it appears three times in cachedBoundNodeSelVector, and native scan can
flatten back to each sel_t separately.
The parquet path’s unordered_map<offset_t, sel_t> is different: it indexes by node offset, so duplicate offsets
collapse. That is less general than native CSR behavior. A closer parquet equivalent would be either:
unordered_map<offset_t, vector<sel_t>>
or, more native-like, avoid offset-key ownership and drive scanning from cachedBoundNodeSelVector /
currBoundNodeIdx, preserving each input position independently.
There was a problem hiding this comment.
drive scanning from cachedBoundNodeSelVector
This is the plan. I have already implemented this in my previous PR. But had to revert it to avoid risk. It also helps use indptr for fwd scans atleast
A node can appear multiple times in a factorized/vectorized input because the same bound node may be paired with different upstream values
Right now it doesn't cause issues because parquet_node_table always sends one offset at a time. We might want to revisit this if node_table behaviour changes. Although not sure if there's aggregation going on at the higher(operator) level
There was a problem hiding this comment.
Our test coverage doesn't exercise many such self-join and complex code paths that would catch a less general implementation. We might have to create 3 hop tests that specifically catch these cases before our users do.
|
benchmark times for first 10 queries. Full suite is running for more than a half a day now |
|
Anything that takes that long is hard to reproduce and will slow us down. How about we look at query plans for: And compare them to native tables to see if there is a slow down in the parquet path? I would disable hash indices for that work so disk usage/access is comparable. |
I think parquet would be slower for sure because we are effectively not using indptrs and also parquet doesn't have index so some queries might be even slower I will run and see the difference |
|
ran into a lot of issue while running a disable_hash_index build, so gone ahead with 0.16.1 official release The times are not even comparable db size was 1.2 GB. where as icebug-disk parquet files were |
|
aha! failing on large queries. Even with 32 GB max_db_size |
|
without the hash_index even faster? lbdb size is 1 GB now. Will run on bigger queries tmrw |
|
same as native with hash_index failed at query 24, fails with Buffer Manager exception. But still way faster than parquet |
|
@adsharma how can I verify if hash_index is created or not? size on disk? If you look above, the queries w/o index are faster |
|
call show_indexes() return *; and size on disk. |
|
Good to see benchmarks are better without hash indexes! |
|
4M * 8
it returns empty on both 0.16.1 and latest |
|
I remember fixing it - but may be the PR didn't go out. Let me push the ART index branch, which could help you. Still validating it. |
|
Probably easier to create a second demo db with the patterns you want. No one has seen this commit, we could just reset the HEAD to the previous version. |
Uh oh!
There was an error while loading. Please reload this page.