feat(ontology): Parquet persistence — save/load OntologyRuntime by DecisionNerd · Pull Request #668 · DecisionNerd/graphforge

DecisionNerd · 2026-05-30T15:39:06Z

Closes #562

Sixth issue in the v0.5.0: Ontology Runtime milestone.

Summary

save_parquet(runtime, dir) — writes all 8 Arrow tables as Parquet files to dir:

ontology_meta.parquet, entity_types.parquet, relation_types.parquet, property_types.parquet, type_constraints.parquet, semantic_flags.parquet, cardinality_rules.parquet, aliases.parquet
Each file carries GraphForge metadata in Arrow schema key-value metadata: graphforge.ontology_id, graphforge.ontology_version, graphforge.ontology_checksum, graphforge.writer_version

load_parquet(dir) — reads the 8 Parquet files and reconstructs the full OntologyRuntime without YAML/JSON parsing or re-validation:

Rebuilds entity_name_to_id, entity_id_to_name, relation_name_to_id, relation_id_to_name
Rebuilds property_name_to_id from the property_types table
Rebuilds ancestors and descendants inheritance closure from the entity_types table

New error variants: OntologyError::Parquet(String) and OntologyError::ChecksumMismatch { cached, computed } (reserved for future stale-cache detection)

Test plan

save_parquet_creates_files — all 8 .parquet files exist after save
load_parquet_roundtrip_row_counts — row counts match original after round-trip
load_parquet_lookup_maps_reconstructed — entity_name_to_id and relation_name_to_id intact
load_parquet_ancestors_reconstructed — Employee⊆Person closure preserved
save_load_full_example — storage.md example compiles, saves, loads correctly
cargo test -p gf-ontology — 64 tests pass
cargo clippy -p gf-ontology -- -D warnings — clean

🤖 Generated with Claude Code

Note

Add Parquet save/load persistence for `OntologyRuntime`

Adds save_parquet and load_parquet as public APIs in the gf-ontology crate, writing/reading OntologyRuntime state as eight Parquet files in a directory (entity types, relation types, property types, type constraints, semantic flags, cardinality rules, aliases, and ontology meta).
Each Parquet file embeds GraphForge key-value metadata in the Arrow schema; load_parquet verifies an optional expected checksum against the stored metadata, returning ChecksumMismatch on mismatch.
Loading reconstructs name/id lookup maps and entity inheritance ancestor/descendant closures directly from the Parquet tables, bypassing YAML/JSON re-parsing.
Two new OntologyError variants are added: Parquet(String) for I/O errors and ChecksumMismatch { cached, computed } for integrity failures.

^{Macroscope summarized 5ff4b71.}

Summary by CodeRabbit

New Features
- Parquet-based persistence for ontology runtime with save/load snapshot support.
- Optional checksum verification to detect mismatched or stale snapshots.
Bug Fixes
- Improved error handling and clearer error variants for Parquet I/O and checksum failures.
Chores
- Added Parquet format support as a dependency.

- persistence.rs: save_parquet() writes 8 Arrow tables as Parquet files with GraphForge metadata (ontology_id, version, checksum, writer_version) embedded in Arrow schema metadata; load_parquet() reads them back and reconstructs all lookup maps (name↔id, property_name_to_id) and the inheritance closure without re-parsing or re-validating - error.rs: add OntologyError::Parquet and OntologyError::ChecksumMismatch - Cargo.toml: add parquet = { workspace = true } - 5 new tests (64 total): file creation, roundtrip row counts, lookup maps reconstructed, ancestors closure reconstructed, full example doc Closes #562 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-30T15:39:16Z

Walkthrough

Adds Parquet snapshot persistence to OntologyRuntime: new parquet dependency, OntologyError Parquet/checksum variants, save_parquet and load_parquet(expected_checksum), helpers to read/write Arrow/Parquet preserving schema metadata, runtime reconstruction logic, and tests including checksum validation.

Changes

Parquet Persistence Implementation

Layer / File(s)	Summary
Error types and module setup `crates/gf-ontology/Cargo.toml`, `crates/gf-ontology/src/error.rs`, `crates/gf-ontology/src/lib.rs`	`parquet` dependency added; `OntologyError` gains `Parquet` and `ChecksumMismatch` variants; crate exposes `persistence` module and re-exports `load_parquet`/`save_parquet`.
Parquet save and load operations `crates/gf-ontology/src/persistence.rs`	`save_parquet` writes eight `OntologyRuntime` RecordBatches as `{table}.parquet` with GraphForge metadata; `load_parquet(dir, expected_checksum)` reads the eight files, optionally verifies `graphforge.ontology_checksum`, and coordinates reconstruction; includes Parquet IO error mapping and writer helper.
Runtime reconstruction and I/O helpers `crates/gf-ontology/src/persistence.rs`	Rebuilds entity/relation name↔id maps, property composite-key map, and computes transitive inheritance closure; provides `read_batch_parquet` preserving schema for zero-row files and `try_string_col` for safe string extraction.
Integration tests `crates/gf-ontology/src/persistence.rs`	Tests validate Parquet file creation, row-count round-trips, lookup-map and inheritance reconstruction, YAML→compile→save→load end-to-end flow, and checksum success/failure behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

DecisionNerd/graphforge#666: Related Arrow compiler changes and shared checksum/metadata handling used by Parquet persistence.
DecisionNerd/graphforge#664: Also updates crates/gf-ontology/src/error.rs; both PRs extend OntologyError with related variants.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding Parquet persistence (save/load functionality) for OntologyRuntime, which is the primary objective of this PR.
Linked Issues check	✅ Passed	The PR successfully implements all core coding requirements from `#562`: save_parquet writes eight Parquet files with embedded GraphForge metadata, load_parquet reconstructs OntologyRuntime with checksum validation and lookup map/inheritance closure rebuilding, new OntologyError variants are added, and integration tests verify the functionality.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing Parquet persistence: dependency addition in Cargo.toml, error type extensions, module exposure, and persistence implementation with comprehensive tests. No unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	PR description is comprehensive and addresses the template structure well, covering objectives, changes, testing, and implementation details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/562-ontology-parquet

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-30T15:39:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.08%. Comparing base (80386b0) to head (5ff4b71).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #668   +/-   ##
=======================================
  Coverage   97.08%   97.08%           
=======================================
  Files           2        2           
  Lines         274      274           
  Branches       41       41           
=======================================
  Hits          266      266           
  Misses          5        5           
  Partials        3        3

Flag	Coverage Δ
full-coverage	`97.08% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
parser	`∅ <ø> (∅)`
planner	`∅ <ø> (∅)`
executor	`∅ <ø> (∅)`
storage	`∅ <ø> (∅)`
ast	`∅ <ø> (∅)`
types	`∅ <ø> (∅)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80386b0...5ff4b71. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

crates/gf-ontology/src/persistence.rs (2)
74-74: 💤 Low value

Derive writer_version from the crate version instead of a hardcoded literal.

"0.5.0" will drift from the actual crate version on the next bump. Prefer env!("CARGO_PKG_VERSION").
♻️ Suggested change
-        ("graphforge.writer_version".into(), "0.5.0".into()),
+        ("graphforge.writer_version".into(), env!("CARGO_PKG_VERSION").into()),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ontology/src/persistence.rs` at line 74, Replace the hardcoded
writer version string for the "graphforge.writer_version" entry with the crate's
actual package version at compile time by using env!("CARGO_PKG_VERSION");
locate the tuple ("graphforge.writer_version", "0.5.0".into()) in persistence.rs
and change it to derive its value from env!("CARGO_PKG_VERSION") (ensuring you
still call .into() or otherwise convert to the expected type).
174-252: ⚖️ Poor tradeoff

Schema column indices in persistence already match the compiler’s batch layout

rebuild_entity_maps, rebuild_relation_maps, rebuild_property_map, and rebuild_inheritance_closure use batch.column(n) indices that line up with the current field order in crates/gf-ontology/src/schemas.rs (ENTITY_TYPES_SCHEMA, RELATION_TYPES_SCHEMA, PROPERTY_TYPES_SCHEMA) and the array order passed to RecordBatch::try_new in crates/gf-ontology/src/compiler.rs (so there’s no current mismatch risk; the downcast_ref checks also guard expected physical types).

To make this resilient to future schema/column reordering (especially where multiple fields share the same Arrow type), consider deriving indices from batch.schema() via index_of("...") (or asserting field names at runtime).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ontology/src/persistence.rs` around lines 174 - 252, The current
functions (rebuild_entity_maps, rebuild_relation_maps, rebuild_property_map, and
rebuild_inheritance_closure) rely on hardcoded column indices which will break
if schemas reorder; change each function to lookup column indices by name via
batch.schema().index_of("field_name") (or call schema.field_with_name and
unwrap/error) for the specific fields used (e.g., "id", "name" for
ENTITY/RELATION, "property_type_id"/"owner_type_id"/"name" for PROPERTY), then
use those index values for batch.column(...) and keep the existing downcast_ref
checks; add clear error mapping to OntologyError when a field name is missing or
has the wrong physical type.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Around line 125-164: load_parquet currently ignores the per-file schema
metadata that save_parquet sets (graphforge.ontology_checksum), so stale/corrupt
snapshots never trigger OntologyError::ChecksumMismatch; update load_parquet to
read the schema metadata produced by read_batch_parquet for each table (use
TABLE_NAMES and the results from read_batch_parquet), extract the
"graphforge.ontology_checksum" value and compare it against an expected checksum
(either add an expected_checksum parameter to load_parquet or accept an
Option<&str> so callers can opt-in); if any file's checksum differs or is
missing return Err(OntologyError::ChecksumMismatch) before reconstructing maps
(so rebuild_entity_maps, rebuild_relation_maps, rebuild_inheritance_closure
never run on bad data), and add a unit test that writes parquet via save_parquet
and verifies load_parquet returns ChecksumMismatch when the stored checksum is
changed.

---

Nitpick comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Line 74: Replace the hardcoded writer version string for the
"graphforge.writer_version" entry with the crate's actual package version at
compile time by using env!("CARGO_PKG_VERSION"); locate the tuple
("graphforge.writer_version", "0.5.0".into()) in persistence.rs and change it to
derive its value from env!("CARGO_PKG_VERSION") (ensuring you still call .into()
or otherwise convert to the expected type).
- Around line 174-252: The current functions (rebuild_entity_maps,
rebuild_relation_maps, rebuild_property_map, and rebuild_inheritance_closure)
rely on hardcoded column indices which will break if schemas reorder; change
each function to lookup column indices by name via
batch.schema().index_of("field_name") (or call schema.field_with_name and
unwrap/error) for the specific fields used (e.g., "id", "name" for
ENTITY/RELATION, "property_type_id"/"owner_type_id"/"name" for PROPERTY), then
use those index values for batch.column(...) and keep the existing downcast_ref
checks; add clear error mapping to OntologyError when a field name is missing or
has the wrong physical type.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fdfbaeb7-e8d5-4a4e-9e12-13fb8619525b

📥 Commits

Reviewing files that changed from the base of the PR and between 80386b0 and 45cd6a8.

📒 Files selected for processing (4)

crates/gf-ontology/Cargo.toml
crates/gf-ontology/src/error.rs
crates/gf-ontology/src/lib.rs
crates/gf-ontology/src/persistence.rs

- try_string_col: replace panicking expect() with Result return for public save_parquet function - load_parquet: add expected_checksum param; reads graphforge.ontology_checksum from schema metadata and returns OntologyError::ChecksumMismatch on mismatch - read_batch_parquet: use builder schema (not batch schema) in concat_batches to preserve Arrow key-value metadata through Parquet round-trip - 2 new tests: correct checksum passes, wrong checksum returns ChecksumMismatch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

DecisionNerd · 2026-05-30T15:54:10Z

Fixes Applied Successfully

Fixed 1 file based on 2 CodeRabbit feedback items.

Files modified:

crates/gf-ontology/src/persistence.rs

Fix #1 (LOW): try_string_col replaces the panicking string_col helper — now returns on malformed instead of panicking.

Fix #2 (HIGH): load_parquet gains an expected_checksum: Option<&str> parameter. When Some, it reads graphforge.ontology_checksum from the Parquet file's Arrow schema metadata and returns OntologyError::ChecksumMismatch if the values differ. Also fixed a subtle Parquet metadata round-trip bug where concat_batches was using the batch schema (which loses custom Arrow metadata) instead of the builder schema (which preserves it).

Commit: 5ff4b71

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/gf-ontology/src/persistence.rs (1)

303-313: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Panic risk: unknown parent id triggers unwrap() on a recoverable load path.

desc.get_mut(&parent).unwrap() (Line 310) panics if a parent_id references an entity id that isn't present in entity_name_to_id (e.g. a corrupt or partially-written snapshot). Since load_parquet performs no validation and expected_checksum is optional (None skips the checksum guard entirely), this is reachable and crashes instead of returning a recoverable OntologyError. Prefer surfacing a Parquet error so callers can fall back to recompiling from source.

🛡️ Proposed fix to avoid the panic

         for id in &all_ids {
             let mut current = *id;
             while let Some(&parent) = parent_of.get(&current) {
                 if anc[id].contains(&parent) {
                     break;
                 }
-                anc.get_mut(id).unwrap().insert(parent);
-                desc.get_mut(&parent).unwrap().insert(*id);
+                let desc_set = desc.get_mut(&parent).ok_or_else(|| {
+                    OntologyError::Parquet(format!(
+                        "entity_types references unknown parent id {parent}"
+                    ))
+                })?;
+                desc_set.insert(*id);
+                anc.get_mut(id).unwrap().insert(parent);
                 current = parent;
             }
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ontology/src/persistence.rs` around lines 303 - 313, The loop that
walks parent_of uses desc.get_mut(&parent).unwrap(), which can panic if a parent
id is missing (e.g., corrupt snapshot); instead detect the missing parent and
return a recoverable Parquet-style OntologyError from load_parquet so callers
can fallback to recompiling. Replace unwrap on desc.get_mut(&parent) with an
explicit check (e.g., desc.get_mut(&parent).ok_or_else(||
OntologyError::Parquet("missing parent id ...".into()))?) or similar error
construction, and propagate that Result out of the function (update the loop to
return Err when a parent is absent). Ensure the error mentions the missing
parent id and that this change covers desc, anc, parent_of, all_ids and
integrates with load_parquet's Result return type rather than panicking.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Around line 303-313: The loop that walks parent_of uses
desc.get_mut(&parent).unwrap(), which can panic if a parent id is missing (e.g.,
corrupt snapshot); instead detect the missing parent and return a recoverable
Parquet-style OntologyError from load_parquet so callers can fallback to
recompiling. Replace unwrap on desc.get_mut(&parent) with an explicit check
(e.g., desc.get_mut(&parent).ok_or_else(|| OntologyError::Parquet("missing
parent id ...".into()))?) or similar error construction, and propagate that
Result out of the function (update the loop to return Err when a parent is
absent). Ensure the error mentions the missing parent id and that this change
covers desc, anc, parent_of, all_ids and integrates with load_parquet's Result
return type rather than panicking.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43896030-bff8-4693-9665-c02acebd05bf

📥 Commits

Reviewing files that changed from the base of the PR and between 45cd6a8 and 5ff4b71.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (1)

crates/gf-ontology/src/persistence.rs

macroscopeapp · 2026-05-30T16:00:16Z

+fn try_string_col(
+    batch: &RecordBatch,
+    col: usize,
+    row: usize,
+    label: &str,
+) -> Result<String, OntologyError> {
+    let arr = batch
+        .column(col)
+        .as_any()
+        .downcast_ref::<StringArray>()
+        .ok_or_else(|| OntologyError::Parquet(format!("{label} col {col} is not Utf8")))?;
+    if row >= batch.num_rows() {
+        return Err(OntologyError::Parquet(format!(
+            "{label}: row {row} out of range (have {} rows)",
+            batch.num_rows()
+        )));
+    }
+    Ok(arr.value(row).to_owned())
+}


🟡 Medium src/persistence.rs:339

try_string_col validates row bounds but not col bounds. batch.column(col) panics on out-of-bounds column indices, so a corrupt or mismatched Parquet file with fewer than 4 columns causes a panic instead of returning OntologyError::Parquet. Consider adding a bounds check for col before accessing batch.column(col).

fn try_string_col( batch: &RecordBatch, col: usize, row: usize, label: &str, ) -> Result<String, OntologyError> { + if col >= batch.num_columns() { + return Err(OntologyError::Parquet(format!( + "{label}: col {col} out of range (have {} columns)", + batch.num_columns() + ))); + } let arr = batch .column(col) .as_any() .downcast_ref::<StringArray>() .ok_or_else(|| OntologyError::Parquet(format!("{label} col {col} is not Utf8")))?;

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file crates/gf-ontology/src/persistence.rs around lines 339-357: `try_string_col` validates `row` bounds but not `col` bounds. `batch.column(col)` panics on out-of-bounds column indices, so a corrupt or mismatched Parquet file with fewer than 4 columns causes a panic instead of returning `OntologyError::Parquet`. Consider adding a bounds check for `col` before accessing `batch.column(col)`. Evidence trail: crates/gf-ontology/src/persistence.rs lines 339-357 (REVIEWED_COMMIT): `try_string_col` checks row bounds (line 350) but not col bounds before `batch.column(col)` (line 345-346). Callers at lines 63-65 pass col indices 0, 1, 3. Arrow docs at https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html confirm `column(index)` panics on out-of-bounds.

macroscopeapp Bot reviewed May 30, 2026

View reviewed changes

Comment thread crates/gf-ontology/src/persistence.rs Outdated

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread crates/gf-ontology/src/persistence.rs Outdated

DecisionNerd and others added 2 commits May 30, 2026 09:47

chore: update Cargo.lock after parquet dep

ee6d8f4

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

macroscopeapp Bot reviewed May 30, 2026

View reviewed changes

DecisionNerd merged commit 883a208 into main May 30, 2026
42 checks passed

DecisionNerd deleted the feature/562-ontology-parquet branch May 30, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ontology): Parquet persistence — save/load OntologyRuntime#668

feat(ontology): Parquet persistence — save/load OntologyRuntime#668
DecisionNerd merged 3 commits into
mainfrom
feature/562-ontology-parquet

DecisionNerd commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

DecisionNerd commented May 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

macroscopeapp Bot May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DecisionNerd commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Add Parquet save/load persistence for OntologyRuntime

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

codecov Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DecisionNerd commented May 30, 2026

Fixes Applied Successfully

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DecisionNerd commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Add Parquet save/load persistence for `OntologyRuntime`

coderabbitai Bot commented May 30, 2026 •

edited

Loading

codecov Bot commented May 30, 2026 •

edited

Loading