feat: Implement icebug-disk format by aheev · Pull Request #429 · LadybugDB/ladybug

aheev · 2026-04-28T08:52:06Z

Added IceDisk storage tables: Added IceDiskNodeTable and IceDiskRelTable
IceDisk test suite: Added demo_db_ice_disk.test covering node/rel table scans against IceDisk storage.
IceDisk utilities: Added ice_disk_utils.h with shared helpers for path resolution and Parquet-backed row group access.
added icebug-disk spec implementation

Closes #469

adsharma · 2026-05-08T14:52:49Z

How's this different from:

src/include/storage/table/parquet_rel_table.h
src/include/storage/table/parquet_node_table.h

aheev · 2026-05-08T14:55:53Z

How's this different from:

src/include/storage/table/parquet_rel_table.h
src/include/storage/table/parquet_node_table.h

updating description. Give me a minute

aheev · 2026-05-08T15:15:48Z

@adsharma

Improvements:

Batch scanning in node table
revamped rel_table scanning. Previously, we used to scan every row group for each boundNodeOffset. Changed it to boundNodeOffset driven scan. Based on fwd or bwd scan, we'll fetch the rowGroups
Other improvements like removing unnecessary data from scanStates, removing repeated state data population, caching
blocking user from creating mixed tables/rels

However, the main motivation behind adding new classes is having separate entities for icebug-format because it may diverge from existing columnar implementations. As discussed earlier, we are planning to use/replace exisiting parquet or arrow classes for normal tables or delegate to duckDB anyway

aheev · 2026-05-08T15:23:06Z

@adsharma

With this implementation, we can have a working db as an output from non-icedisk-to-icedisk-converter or user can pass the schema.cypher or run the cmds themselves to create icebug-disk tables. If you want to attach from existing db, we already support attaching external ladybug dbs

The main issue is path though. If the user moves the db or path, he/she might need to trigger create table queries again. DuckDB provides override option. We may have to provide similar option in the future

Couple of clarifications:

Where to add version? I am thinking metadata in each file
Should we keep the current column(0) for nbr_offset in in indptr? or allow users to specify

adsharma · 2026-05-08T15:29:53Z

The improvements are nice. But we can't have two implementations with 2x the bugs. Need to remove one and justify the other with benchmarks/data comparing the improvement.

DuckDB and Lance could be additional ColumnarBaseTable implementations. They don't remove the need for parquet.

adsharma · 2026-05-08T15:37:59Z

uvx icebug-format --help

It already creates a metadata table thusly:

        # Create global metadata
        con.execute(f"""
        CREATE TABLE {csr_table_name}_metadata AS
        SELECT {total_nodes} AS n_nodes, {total_edges} AS n_edges, {directed} AS directed
        """)

Yes, this is a good place to add version.

Should we keep the current column(0) for nbr_offset in in indptr? or allow users to specify

nbr_offset is kuzu/ladybug specific terminology. I'd avoid it in the code/docs. CSR format is broadly defined. Include a pointer to existing definition. My recollection is that indptr always has only one column specifying the offset into indices. However, indices can have many columns if there are properties on the edge.

We can specify that target is always col0 and additional columns appear in an order as specified in the schema.

The distinction between schema.cypher and catalog entry are mechanical. I don't think we should spend a lot of time specifying that or we'll be duplicating a significant chunk of docs.ladybugdb.com.

However, getting other Graph databases and analytical packages to adopt icebug is an explicit goal. So I'd try to strike a balance between over-specifying (as opposed to just refer to reference implementation) and adding ladybug specific assumptions that would rub the non-ladybug people the wrong way.

aheev · 2026-05-08T16:28:09Z

dataset addition PR: LadybugDB/dataset#1

aheev · 2026-05-08T16:33:14Z

The improvements are nice. But we can't have two implementations with 2x the bugs. Need to remove one and justify the other with benchmarks/data comparing the improvement.

DuckDB and Lance could be additional ColumnarBaseTable implementations. They don't remove the need for parquet.

I will run a benchmark tmrw and attach the results

But, what about support for normal parquet tables? We can just remove them from docs until we change the existing classes to normal tables

aheev · 2026-05-08T16:36:15Z

uvx icebug-format --help

It already creates a metadata table thusly:

        # Create global metadata
        con.execute(f"""
        CREATE TABLE {csr_table_name}_metadata AS
        SELECT {total_nodes} AS n_nodes, {total_edges} AS n_edges, {directed} AS directed
        """)

I will update the tool and add it in this repo tmrw

adsharma · 2026-05-08T17:22:42Z

I will update the tool and add it in this repo tmrw

Ladybug-Memory/icebug-format#2 (comment)
Ladybug-Memory/icebug-format#2 (comment)

If the issue is icebug-format is in a for profit company namespace, we can move it some place else. But ladybugdb core repo is the wrong place for the spec and the script.

aheev · 2026-05-09T01:40:51Z

I will update the tool and add it in this repo tmrw

Ladybug-Memory/icebug-format#2 (comment) Ladybug-Memory/icebug-format#2 (comment)

If the issue is icebug-format is in a for profit company namespace, we can move it some place else. But ladybugdb core repo is the wrong place for the spec and the script.

Not really about about the company. But since the cli tool necessarily converts to ladybugDB' icebug-format impl, It should live under LadybugDB org

Here's my propposal:

ladybugDB icebug impl spec and cli tools under LadybugDB org
Just the icebug-format spec in icebug-format

adsharma · 2026-05-09T01:46:50Z

The cli tool should be generic and usable with another database that wants to implement icebug-format.

Previously people thought of specs as detailed docs that someone else would read to clean room implement an idea. This existed in the era of closed source software. An emerging consensus (probably in the last few weeks of social media) now claims that "AI slop" or a working prototype is a better idea than writing specs.

While that idea has problems, it's likely how things are going to work in the near future. People are going use icebug-format.py with agents to come up with new implementations. For that reason, I prefer the cli and the spec to live in the same repo.

aheev · 2026-05-09T01:48:57Z

The cli tool should be generic and usable with another database that wants to implement icebug-format.

Previously people thought of specs as detailed docs that someone else would read to clean room implement an idea. This existed in the era of closed source software. An emerging consensus (probably in the last few weeks of social media) now claims that "AI slop" or a working prototype is a better idea than writing specs.

While that idea has problems, it's likely how things are going to work in the near future. People are going use icebug-format.py with agents to come up with new implementations. For that reason, I prefer the cli and the spec to live in the same repo.

Understood. I thought it's gonnna be used only for ladybug

aheev force-pushed the icedisk-impl branch 5 times, most recently from 69f32c5 to 4c5f9bf Compare May 8, 2026 09:41

aheev changed the title ~~[WIP] feat: add icebug-disk tables~~ feat: add icebug-disk tables May 8, 2026

aheev force-pushed the icedisk-impl branch from 8cdeae7 to 09ef657 Compare May 8, 2026 14:47

aheev changed the title ~~feat: add icebug-disk tables~~ feat: Implement icebug-disk format May 8, 2026

aheev mentioned this pull request May 8, 2026

add icebug-disk dataset LadybugDB/dataset#1

Open

aheev added 8 commits May 9, 2026 09:24

feat: add icebug-disk tables

8391a85

fix tests in demo_db ice_disk test

cb20856

fix path validations

8358cf5

fix IceDiskNodeTable::getNumTotalRows

3e895f2

fix IceDiskNodeTable scan init

5aadc7c

fix ice disk node table scan

2a522e8

move rowGroupStartOffsets into ice disk node table shared state

5a98715

remove this in ice_disk_node_table.cpp

d2b5c4e

aheev added 8 commits May 9, 2026 09:24

fix ice-disk rel scan

f426ddf

move const data out of ice-disk shared states

b49b49e

fix demo_db ice_disk storage paths

532d776

fix ice-disk node table scan initScanState

723ef93

fix reset indicesRowGroupStartOffsets

7ae4efc

fix/optimize ice-disk rel table scan

4e8f407

revert table_path, indptr, indices options

39635d6

add ice-disk impl spec

ace05d2

aheev force-pushed the icedisk-impl branch from f3ce6b2 to ace05d2 Compare May 9, 2026 03:55

Conversation

aheev commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adsharma commented May 8, 2026

Uh oh!

aheev commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aheev commented May 8, 2026

Uh oh!

aheev commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adsharma commented May 8, 2026

Uh oh!

adsharma commented May 8, 2026

Uh oh!

aheev commented May 8, 2026

Uh oh!

aheev commented May 8, 2026

Uh oh!

aheev commented May 8, 2026

Uh oh!

adsharma commented May 8, 2026

Uh oh!

aheev commented May 9, 2026

Uh oh!

adsharma commented May 9, 2026

Uh oh!

aheev commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aheev commented Apr 28, 2026 •

edited

Loading

aheev commented May 8, 2026 •

edited

Loading

aheev commented May 8, 2026 •

edited

Loading