Skip to content

feat: Implement icebug-disk format#429

Open
aheev wants to merge 16 commits intoLadybugDB:mainfrom
aheev:icedisk-impl
Open

feat: Implement icebug-disk format#429
aheev wants to merge 16 commits intoLadybugDB:mainfrom
aheev:icedisk-impl

Conversation

@aheev
Copy link
Copy Markdown
Contributor

@aheev aheev commented Apr 28, 2026

  • Added IceDisk storage tables: Added IceDiskNodeTable and IceDiskRelTable
  • IceDisk test suite: Added demo_db_ice_disk.test covering node/rel table scans against IceDisk storage.
  • IceDisk utilities: Added ice_disk_utils.h with shared helpers for path resolution and Parquet-backed row group access.
  • added icebug-disk spec implementation

Closes #469

@aheev aheev force-pushed the icedisk-impl branch 5 times, most recently from 69f32c5 to 4c5f9bf Compare May 8, 2026 09:41
@aheev aheev changed the title [WIP] feat: add icebug-disk tables feat: add icebug-disk tables May 8, 2026
@aheev aheev changed the title feat: add icebug-disk tables feat: Implement icebug-disk format May 8, 2026
@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

How's this different from:

src/include/storage/table/parquet_rel_table.h
src/include/storage/table/parquet_node_table.h

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

How's this different from:

src/include/storage/table/parquet_rel_table.h
src/include/storage/table/parquet_node_table.h

updating description. Give me a minute

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

@adsharma

Improvements:

  • Batch scanning in node table
  • revamped rel_table scanning. Previously, we used to scan every row group for each boundNodeOffset. Changed it to boundNodeOffset driven scan. Based on fwd or bwd scan, we'll fetch the rowGroups
  • Other improvements like removing unnecessary data from scanStates, removing repeated state data population, caching
  • blocking user from creating mixed tables/rels

However, the main motivation behind adding new classes is having separate entities for icebug-format because it may diverge from existing columnar implementations. As discussed earlier, we are planning to use/replace exisiting parquet or arrow classes for normal tables or delegate to duckDB anyway

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

@adsharma

With this implementation, we can have a working db as an output from non-icedisk-to-icedisk-converter or user can pass the schema.cypher or run the cmds themselves to create icebug-disk tables. If you want to attach from existing db, we already support attaching external ladybug dbs

The main issue is path though. If the user moves the db or path, he/she might need to trigger create table queries again. DuckDB provides override option. We may have to provide similar option in the future

Couple of clarifications:

  • Where to add version? I am thinking metadata in each file
  • Should we keep the current column(0) for nbr_offset in in indptr? or allow users to specify

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

The improvements are nice. But we can't have two implementations with 2x the bugs. Need to remove one and justify the other with benchmarks/data comparing the improvement.

DuckDB and Lance could be additional ColumnarBaseTable implementations. They don't remove the need for parquet.

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

uvx icebug-format --help

It already creates a metadata table thusly:

        # Create global metadata
        con.execute(f"""
        CREATE TABLE {csr_table_name}_metadata AS
        SELECT {total_nodes} AS n_nodes, {total_edges} AS n_edges, {directed} AS directed
        """)

Yes, this is a good place to add version.

Should we keep the current column(0) for nbr_offset in in indptr? or allow users to specify

nbr_offset is kuzu/ladybug specific terminology. I'd avoid it in the code/docs. CSR format is broadly defined. Include a pointer to existing definition. My recollection is that indptr always has only one column specifying the offset into indices. However, indices can have many columns if there are properties on the edge.

We can specify that target is always col0 and additional columns appear in an order as specified in the schema.

The distinction between schema.cypher and catalog entry are mechanical. I don't think we should spend a lot of time specifying that or we'll be duplicating a significant chunk of docs.ladybugdb.com.

However, getting other Graph databases and analytical packages to adopt icebug is an explicit goal. So I'd try to strike a balance between over-specifying (as opposed to just refer to reference implementation) and adding ladybug specific assumptions that would rub the non-ladybug people the wrong way.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

dataset addition PR: LadybugDB/dataset#1

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

The improvements are nice. But we can't have two implementations with 2x the bugs. Need to remove one and justify the other with benchmarks/data comparing the improvement.

DuckDB and Lance could be additional ColumnarBaseTable implementations. They don't remove the need for parquet.

I will run a benchmark tmrw and attach the results

But, what about support for normal parquet tables? We can just remove them from docs until we change the existing classes to normal tables

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 8, 2026

uvx icebug-format --help

It already creates a metadata table thusly:

        # Create global metadata
        con.execute(f"""
        CREATE TABLE {csr_table_name}_metadata AS
        SELECT {total_nodes} AS n_nodes, {total_edges} AS n_edges, {directed} AS directed
        """)

I will update the tool and add it in this repo tmrw

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 8, 2026

I will update the tool and add it in this repo tmrw

Ladybug-Memory/icebug-format#2 (comment)
Ladybug-Memory/icebug-format#2 (comment)

If the issue is icebug-format is in a for profit company namespace, we can move it some place else. But ladybugdb core repo is the wrong place for the spec and the script.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 9, 2026

I will update the tool and add it in this repo tmrw

Ladybug-Memory/icebug-format#2 (comment) Ladybug-Memory/icebug-format#2 (comment)

If the issue is icebug-format is in a for profit company namespace, we can move it some place else. But ladybugdb core repo is the wrong place for the spec and the script.

Not really about about the company. But since the cli tool necessarily converts to ladybugDB' icebug-format impl, It should live under LadybugDB org

Here's my propposal:

  • ladybugDB icebug impl spec and cli tools under LadybugDB org
  • Just the icebug-format spec in icebug-format

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented May 9, 2026

The cli tool should be generic and usable with another database that wants to implement icebug-format.

Previously people thought of specs as detailed docs that someone else would read to clean room implement an idea. This existed in the era of closed source software. An emerging consensus (probably in the last few weeks of social media) now claims that "AI slop" or a working prototype is a better idea than writing specs.

While that idea has problems, it's likely how things are going to work in the near future. People are going use icebug-format.py with agents to come up with new implementations. For that reason, I prefer the cli and the spec to live in the same repo.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 9, 2026

The cli tool should be generic and usable with another database that wants to implement icebug-format.

Previously people thought of specs as detailed docs that someone else would read to clean room implement an idea. This existed in the era of closed source software. An emerging consensus (probably in the last few weeks of social media) now claims that "AI slop" or a working prototype is a better idea than writing specs.

While that idea has problems, it's likely how things are going to work in the near future. People are going use icebug-format.py with agents to come up with new implementations. For that reason, I prefer the cli and the spec to live in the same repo.

Understood. I thought it's gonnna be used only for ladybug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add Icebug-disk spec impl

2 participants