Skip to content

fix parquet table scans#473

Merged
adsharma merged 4 commits into
LadybugDB:mainfrom
aheev:icedisk-impl2
May 10, 2026
Merged

fix parquet table scans#473
adsharma merged 4 commits into
LadybugDB:mainfrom
aheev:icedisk-impl2

Conversation

@aheev
Copy link
Copy Markdown
Contributor

@aheev aheev commented May 10, 2026

  • fixed multi-hop, var-len query scans in parquet tables by adding isVisible funcs and setting selVectorToFlat in each scan
  • refactored parquet_rel_table scan to return only rows of a single boundNode
  • removed unnecessary scanState data
  • expanded test suite

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 10, 2026

@adsharma could you PTAL?

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 10, 2026

benchmark is hanging. Lemme fix it

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 10, 2026

minimal_test succeeds everytime in my local. Not sure why it fails in CI. Lemme take a look

@adsharma
Copy link
Copy Markdown
Contributor

Unexpected error for query: Assertion failed in file "/home/runner/work/ladybug/ladybug/src/common/vector/value_vector.cpp" on line 135: srcVector->dataType.getPhysicalType() == dataType.getPhysicalType()

The CI tests run as relwithdebinfo with assertions enabled. Perhaps you're running the release builds locally?

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 10, 2026

Unexpected error for query: Assertion failed in file "/home/runner/work/ladybug/ladybug/src/common/vector/value_vector.cpp" on line 135: srcVector->dataType.getPhysicalType() == dataType.getPhysicalType()

The CI tests run as relwithdebinfo with assertions enabled. Perhaps you're running the release builds locally?

Understood. I am running test-build

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 10, 2026

The latest commit should fix the build

Copy link
Copy Markdown
Contributor

@adsharma adsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and merge this one because it fixes many problems.

Let's continue fixing the more complex query cases where there may be duplicate tuples in the input/bound side in future PRs.

ParquetRelTableScanState& parquetRelScanState,
const std::vector<uint64_t>& rowGroupsToProcess,
const std::unordered_set<common::offset_t>& boundNodeOffsets);
const std::unordered_map<common::offset_t, common::sel_t>& boundNodeOffsets);
Copy link
Copy Markdown
Contributor

@adsharma adsharma May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A node can appear multiple times in a factorized/vectorized input because the same bound node may be paired with different upstream values. Example:

  MATCH (seed:user), (a:user {id: 100})
  WITH seed, a
  MATCH (a)-[:follows]->(b)
  RETURN seed.name, b.name

Here a is the same node in every tuple, but each tuple has a different seed. If the scanner stores only:

 unordered_map<offset_t, sel_t>

then duplicate a.offset values collapse to one sel_t.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a problem for native tables. Should we use the same method?

Native rel table scan does not use unordered_map<offset_t, vector<sel_t>>.

It uses the original selection vector directly and processes bound-node positions in order:

  • RelTableScanState::cachedBoundNodeSelVector stores the input sel_ts.
  • RelTableScanState::currBoundNodeIdx is an index into that cached selection vector.
  • Native CSR scan repeatedly reads:

cachedBoundNodeSelVector[currBoundNodeIdx]

then uses that selected row to get the bound node offset.

Relevant files:

  • src/include/storage/table/rel_table.h:22
  • src/storage/table/rel_table.cpp:91
  • src/storage/table/csr_node_group.cpp:181

So duplicates are naturally preserved because the scan state is position-driven, not offset-keyed. If the same
node offset appears in three input rows, it appears three times in cachedBoundNodeSelVector, and native scan can
flatten back to each sel_t separately.

The parquet path’s unordered_map<offset_t, sel_t> is different: it indexes by node offset, so duplicate offsets
collapse. That is less general than native CSR behavior. A closer parquet equivalent would be either:

unordered_map<offset_t, vector<sel_t>>

or, more native-like, avoid offset-key ownership and drive scanning from cachedBoundNodeSelVector /
currBoundNodeIdx, preserving each input position independently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive scanning from cachedBoundNodeSelVector

This is the plan. I have already implemented this in my previous PR. But had to revert it to avoid risk. It also helps use indptr for fwd scans atleast

A node can appear multiple times in a factorized/vectorized input because the same bound node may be paired with different upstream values

Right now it doesn't cause issues because parquet_node_table always sends one offset at a time. We might want to revisit this if node_table behaviour changes. Although not sure if there's aggregation going on at the higher(operator) level

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our test coverage doesn't exercise many such self-join and complex code paths that would catch a less general implementation. We might have to create 3 hop tests that specifically catch these cases before our users do.

@adsharma adsharma merged commit ec3f0b6 into LadybugDB:main May 10, 2026
4 checks passed
@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 11, 2026

benchmark times for first 10 queries. Full suite is running for more than a half a day now

==================================================================================================
Query                          Description                                         parquet avg(ms)
==================================================================================================
q01_count_nodes                Count all nodes                                               2.3
q02_count_edges_meta           Count all edges (metadata fast path)                          0.3
q03_outdeg_high                Out-degree of high-degree node (9766, deg=14                  3.5
q04_outdeg_med                 Out-degree of medium-degree node (3, deg=454                 69.7
q05_outdeg_low                 Out-degree of low-degree node (1000)                         78.0
q06_top10_outdeg               Top-10 nodes by out-degree                               146411.1
q07_count_active_src           Count nodes with at least one out-edge                   146435.0
q08_full_scan_rel              Full edge scan — count bound rel variable                146376.4
q09_full_scan_star             Full edge scan — count(*) (no rel var)                        0.2
q10_outdeg_node_a              Out-degree of second high-degree node (9765,                  2.3

@aheev aheev deleted the icedisk-impl2 branch May 11, 2026 00:57
@adsharma
Copy link
Copy Markdown
Contributor

Anything that takes that long is hard to reproduce and will slow us down. How about we look at query plans for:

q06_top10_outdeg               Top-10 nodes by out-degree                               146411.1
q07_count_active_src           Count nodes with at least one out-edge                   146435.0
q08_full_scan_rel              Full edge scan — count bound rel variable                146376.4

And compare them to native tables to see if there is a slow down in the parquet path? I would disable hash indices for that work so disk usage/access is comparable.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 11, 2026

Anything that takes that long is hard to reproduce and will slow us down. How about we look at query plans for:

q06_top10_outdeg               Top-10 nodes by out-degree                               146411.1
q07_count_active_src           Count nodes with at least one out-edge                   146435.0
q08_full_scan_rel              Full edge scan — count bound rel variable                146376.4

And compare them to native tables to see if there is a slow down in the parquet path? I would disable hash indices for that work so disk usage/access is comparable.

I think parquet would be slower for sure because we are effectively not using indptrs and also parquet doesn't have index so some queries might be even slower

I will run and see the difference

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 11, 2026

ran into a lot of issue while running a disable_hash_index build, so gone ahead with 0.16.1 official release

The times are not even comparable

python3 benchmark.py --backends native

==================================================================================================
Query                          Description                                          native avg(ms)
==================================================================================================
q01_count_nodes                Count all nodes                                               1.4
q02_count_edges_meta           Count all edges (metadata fast path)                          6.6
q03_outdeg_high                Out-degree of high-degree node (9766, deg=14                  6.4
q04_outdeg_med                 Out-degree of medium-degree node (3, deg=454                  6.3
q05_outdeg_low                 Out-degree of low-degree node (1000)                          6.4
q06_top10_outdeg               Top-10 nodes by out-degree                                  320.4
q07_count_active_src           Count nodes with at least one out-edge                      232.3
q08_full_scan_rel              Full edge scan — count bound rel variable                    89.7
q09_full_scan_star             Full edge scan — count(*) (no rel var)                        5.4
q10_outdeg_node_a              Out-degree of second high-degree node (9765,                  4.8

db size was 1.2 GB. where as icebug-disk parquet files were

-rw-rw-r-- 1 aheev aheev 253821746 May  9 09:25 indices_follows.parquet
-rw-rw-r-- 1 aheev aheev  17354289 May  9 09:25 indptr_follows.parquet
-rw-rw-r-- 1 aheev aheev  16574296 May  9 10:01 nodes_user.parquet

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 11, 2026

aha! failing on large queries. Even with 32 GB max_db_size

RUN ERROR: Buffer manager exception: Unable to allocate memory! The buffer pool is full and no memory could be freed!

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 15, 2026

without the hash_index even faster? lbdb size is 1 GB now. Will run on bigger queries tmrw


==================================================================================================
Query                          Description                                          native avg(ms)
==================================================================================================
q01_count_nodes                Count all nodes                                               0.7
q02_count_edges_meta           Count all edges (metadata fast path)                          5.4
q03_outdeg_high                Out-degree of high-degree node (9766, deg=14                  5.0
q04_outdeg_med                 Out-degree of medium-degree node (3, deg=454                  4.2
q05_outdeg_low                 Out-degree of low-degree node (1000)                          4.3
q06_top10_outdeg               Top-10 nodes by out-degree                                  211.0
q07_count_active_src           Count nodes with at least one out-edge                      171.7
q08_full_scan_rel              Full edge scan — count bound rel variable                    55.5
q09_full_scan_star             Full edge scan — count(*) (no rel var)                        4.3
q10_outdeg_node_a              Out-degree of second high-degree node (9765,                  4.2

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 16, 2026

same as native with hash_index failed at query 24, fails with Buffer Manager exception. But still way faster than parquet

python3 benchmark.py --backends native

============================================================
Setting up backend: native
[0, 'id', 'INT32', 'NULL', True]
Running 1 warmup + 5 timed runs per query...
  [native] q01_count_nodes: Count all nodes
    result=[3997962]  avg=1.7ms  min=1.4ms  max=2.7ms
  [native] q02_count_edges_meta: Count all edges (metadata fast path)
    result=[69362378]  avg=8.6ms  min=7.9ms  max=9.5ms
  [native] q03_outdeg_high: Out-degree of high-degree node (9766, deg=14815)
    result=[12870]  avg=7.3ms  min=6.1ms  max=11.9ms
  [native] q04_outdeg_med: Out-degree of medium-degree node (3, deg=454)
    result=[2]  avg=6.3ms  min=5.7ms  max=7.6ms
  [native] q05_outdeg_low: Out-degree of low-degree node (1000)
    result=[28]  avg=6.3ms  min=5.9ms  max=6.6ms
  [native] q06_top10_outdeg: Top-10 nodes by out-degree
    result=[9767, 14815]  avg=350.4ms  min=328.0ms  max=367.2ms
  [native] q07_count_active_src: Count nodes with at least one out-edge
    result=[3997962]  avg=284.1ms  min=264.8ms  max=298.3ms
  [native] q08_full_scan_rel: Full edge scan — count bound rel variable
    result=[69362378]  avg=90.4ms  min=85.5ms  max=101.0ms
  [native] q09_full_scan_star: Full edge scan — count(*) (no rel var)
    result=[69362378]  avg=5.7ms  min=5.5ms  max=5.9ms
  [native] q10_outdeg_node_a: Out-degree of second high-degree node (9765, deg≈12870)
    result=[48]  avg=4.1ms  min=3.7ms  max=4.5ms
  [native] q11_indeg_high: In-degree (follower count) of high-degree node (9766)
    result=[12870]  avg=3.6ms  min=3.4ms  max=3.9ms
  [native] q12_top10_indeg: Top-10 most-followed nodes (influencer detection)
    result=[9767, 14815]  avg=4120.5ms  min=3588.2ms  max=4553.9ms
  [native] q13_mutual_follows_count: Count reciprocal (mutual) follow pairs across the full graph
    result=[34681189]  avg=9252.4ms  min=8649.7ms  max=9636.1ms
  [native] q14_mutual_follows_of_node: List nodes in mutual-follow relationship with 9766
    result=[12870]  avg=1217.2ms  min=1110.4ms  max=1456.4ms
  [native] q15_pymk_high: PYMK recommendations for 9766: top-10 2-hop nodes scored by path count
    result=[9767, 3416]  avg=357.5ms  min=327.2ms  max=384.9ms
  [native] q16_pymk_med: PYMK recommendations for 3: top-10 2-hop nodes scored by path count
    result=[32, 2]  avg=53.8ms  min=51.8ms  max=57.9ms
  [native] q17_common_follows: Common follows between 9766 and 9765 (mutual friends count)
    result=[0]  avg=13.7ms  min=12.3ms  max=15.2ms
  [native] q18_ego_net_size: Ego network size of 9766 (1-hop, undirected)
    result=[12870]  avg=11.3ms  min=10.4ms  max=12.2ms
  [native] q19_2hop_reach: Distinct nodes reachable within 2 hops from 9766 (BFS depth-2)
    result=[330350]  avg=404.6ms  min=386.3ms  max=422.6ms
  [native] q20_3hop_reach: Distinct nodes reachable within 3 hops from 3 (BFS depth-3)
    result=[12047]  avg=51.9ms  min=30.4ms  max=62.2ms
  [native] q21_shortest_path: Shortest path length between 9766 and 1000
    result=[3]  avg=6856.1ms  min=6824.8ms  max=6878.3ms
  [native] q22_all_shortest_paths: Count of all shortest paths between 9766 and 9765
    result=[32, 3]  avg=13639.9ms  min=13580.3ms  max=13703.4ms
  [native] q23_triangles_node: Triangle count through 3 (local)
    result=[2]  avg=41.8ms  min=36.4ms  max=44.6ms
  [native] q24_global_triangles: Global triangle count (all directed 3-cycles, canonical ordering)
    WARMUP ERROR: Buffer manager exception: Unable to allocate memory! The buffer pool is full and no memory could be freed!
    RUN ERROR: Buffer manager exception: Unable to allocate memory! The buffer pool is full and no memory could be freed!

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 16, 2026

@adsharma how can I verify if hash_index is created or not? size on disk? If you look above, the queries w/o index are faster

@adsharma
Copy link
Copy Markdown
Contributor

call show_indexes() return *;

and size on disk.

@adsharma
Copy link
Copy Markdown
Contributor

Good to see benchmarks are better without hash indexes!

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 16, 2026

4M * 8

call show_indexes() return *;

it returns empty on both 0.16.1 and latest

@adsharma
Copy link
Copy Markdown
Contributor

I remember fixing it - but may be the PR didn't go out.

Let me push the ART index branch, which could help you. Still validating it.

@adsharma
Copy link
Copy Markdown
Contributor

Probably easier to create a second demo db with the patterns you want. No one has seen this commit, we could just reset the HEAD to the previous version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants