Skip to content

Add icebug mem impl#496

Open
aheev wants to merge 9 commits into
LadybugDB:mainfrom
aheev:add-icebug-mem-impl
Open

Add icebug mem impl#496
aheev wants to merge 9 commits into
LadybugDB:mainfrom
aheev:add-icebug-mem-impl

Conversation

@aheev
Copy link
Copy Markdown
Contributor

@aheev aheev commented May 18, 2026

  • ArrowNodeTable is reused for icebug-memory node tables
  • ArrowRelTable is extended to support icebug-memory rel tables
  • new API createArrowCsrRelTable is added to arrow_table_support to support icebug-memory rel table creation. It takes indices, indptr arrow data, bundles them in ArrowCsrRelData and adds it in csrRegistry

Closes: #329

@adsharma
Copy link
Copy Markdown
Contributor

I realize this is not ready for review. But thanks for sharing it early for feedback!

Two high level concerns and details:

  • Adds a lot of new code. One of my concerns as a maintainer is to minimize the amount of code to maintain
  • Ice-mem branding. These things change and I would keep them on the docs/communication and use more technical names such as arrow-csr in the code.

Recommendations:

  • Keep ArrowNodeTable; do not add IceMemNodeTable.
  • Extend ArrowRelTable with a layout enum, e.g. TRIPLES vs CSR.
  • Put the layout-specific edge cursor behind a small helper, not a new table class.
  • For TRIPLES, keep current behavior: read from/to, lookup node PKs, emit matches.
  • For CSR, store indices and indptr Arrow arrays in the Arrow rel table data. For FWD scans, use the bound node offset to directly slice CSR. For BWD scans, either require/provide reverse CSR or explicitly fall back to a full scan.
  • Extend ArrowTableSupport registration with a tagged rel payload, rather than inventing icebug-memory catalog format.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 19, 2026

  • Ice-mem branding. These things change and I would keep them on the docs/communication and use more technical names such as arrow-csr in the code.

Brand names tend to change far less frequently than technical details, wouldn’t you agree? Perhaps only on rare occasions. Also, if we were to name it arrow-csr, it might give the impression that we are confining the ice-mem format within both arrow and CSR

Recommendations:

  • Keep ArrowNodeTable; do not add IceMemNodeTable.
  • Extend ArrowRelTable with a layout enum, e.g. TRIPLES vs CSR.
  • Put the layout-specific edge cursor behind a small helper, not a new table class.
  • For TRIPLES, keep current behavior: read from/to, lookup node PKs, emit matches.
  • For CSR, store indices and indptr Arrow arrays in the Arrow rel table data. For FWD scans, use the bound node offset to directly slice CSR. For BWD scans, either require/provide reverse CSR or explicitly fall back to a full scan.

There is already a significant amount of redundant state and data in both the table and scan state classes due to inheritance from the native classes. For instance, if you look at setToTable, much of the code is not strictly necessary but is retained to remain compliant with the native class structure. Consequently, any changes to the native classes can propagate and affect the external storage classes as well. I have faced these issues while working on icebug-disk implementation

void ArrowRelTableScanState::setToTable(const transaction::Transaction* transaction, Table* table_,
    std::vector<column_id_t> columnIDs_, std::vector<ColumnPredicateSet> columnPredicateSets_,
    RelDataDirection direction_) {
    // Same behavior as IceDiskRelTable: no local table for external data sources.
    TableScanState::setToTable(transaction, table_, std::move(columnIDs_),
        std::move(columnPredicateSets_));
    columns.resize(columnIDs.size());
    direction = direction_;
    for (size_t i = 0; i < columnIDs.size(); ++i) {
        auto columnID = columnIDs[i];
        if (columnID == INVALID_COLUMN_ID || columnID == ROW_IDX_COLUMN_ID) {
            columns[i] = nullptr;
        } else {
            columns[i] = table->cast<RelTable>().getColumn(columnID, direction);
        }
    }
    csrOffsetColumn = table->cast<RelTable>().getCSROffsetColumn(direction);
    csrLengthColumn = table->cast<RelTable>().getCSRLengthColumn(direction);
    nodeGroupIdx = INVALID_NODE_GROUP_IDX;
}

Even with the ColumnarTableBase abstraction and the majority of common utilities in ArrowUtils, the Arrow classes remain large and complex. Introducing new data and states from icebug-memory only adds to this complexity, increasing redundant compliance code, more conditional checks, and potential risk.

  • Extend ArrowTableSupport registration with a tagged rel payload, rather than inventing icebug-memory catalog format.

didn't get this part

@adsharma
Copy link
Copy Markdown
Contributor

Extend ArrowTableSupport registration with a tagged rel payload, rather than inventing icebug-memory catalog format.

The Tag in this context is TRIPLES or CSR (feel free to rename). The status right now:

Parquet Arrow
Triples
CSR

In the future there may be a legit use case for running cypher over a triple table stored in parquet (we support this for duckdb/sqlite/postgres).

I'm thinking about how to implement this matrix with the most code reuse and maintainability. I agree that sometimes its necessary to duplicate code for clarity. Inheritance is not the best way to share code etc.

But I'm not convinced that we should have 4 cases above * {node, rel} = 8 classes.

@aheev aheev force-pushed the add-icebug-mem-impl branch from 69ff7fc to 42ecaa6 Compare May 19, 2026 16:50
@adsharma
Copy link
Copy Markdown
Contributor

Also think about it from the perspective of someone reading the code. They already know what "Arrow" or "CSR" means. If not, it's probably 2 secs away in the browser. But searching for icemem?

@aheev aheev changed the title [WIP] Add icebug mem impl Add icebug mem impl May 20, 2026
@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 20, 2026

@adsharma could you PTAL? Couple of clarifications

  • arrow C data interface doesn't support reading metadata out of the box. How do we go about version checks?
  • are we planning to have an icebug-format converter similar to icebug-disk?

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 20, 2026

python PR: LadybugDB/ladybug-python#17

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 20, 2026

@adsharma should we add support for optional bwd indices and indptr in icebug-disk too?

@aheev aheev force-pushed the add-icebug-mem-impl branch from 9803e33 to 87b5945 Compare May 20, 2026 05:03
@adsharma
Copy link
Copy Markdown
Contributor

should we add support for optional bwd indices and indptr in icebug-disk too?

The problem with providing both fwd and bwd direction in user facing APIs is that users can screw up and provide logically inconsistent data.

This is why storing data in one direction (fwd) and then using --directed is preferable for logical consistency. icebug-format cli.py does this.

However, as a lower level API (when processing directed and computing bwd edges is expensive), there could be a case for an internal API.

For now, let's do only the logically consistent version and punt on the other option.

Not clear we need to version these things, since everything is super well defined and has been around for many years in the industry. Arrow for 10 years and CSR for even longer.

Example: https://github.com/tensorflow/community/blob/master/rfcs/20200519-csr-sparse-matrix.md

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 20, 2026

should we add support for optional bwd indices and indptr in icebug-disk too?

The problem with providing both fwd and bwd direction in user facing APIs is that users can screw up and provide logically inconsistent data.

This is why storing data in one direction (fwd) and then using --directed is preferable for logical consistency. icebug-format cli.py does this.

However, as a lower level API (when processing directed and computing bwd edges is expensive), there could be a case for an internal API.

For now, let's do only the logically consistent version and punt on the other option.

I don't follow. Can you put these together in a issue/disc and discuss there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Implement icebug-memory tables

2 participants