Skip to content

feat(ontology): Parquet persistence — save/load OntologyRuntime#668

Merged
DecisionNerd merged 3 commits into
mainfrom
feature/562-ontology-parquet
May 30, 2026
Merged

feat(ontology): Parquet persistence — save/load OntologyRuntime#668
DecisionNerd merged 3 commits into
mainfrom
feature/562-ontology-parquet

Conversation

@DecisionNerd
Copy link
Copy Markdown
Owner

@DecisionNerd DecisionNerd commented May 30, 2026

Closes #562

Sixth issue in the v0.5.0: Ontology Runtime milestone.

Summary

save_parquet(runtime, dir) — writes all 8 Arrow tables as Parquet files to dir:

  • ontology_meta.parquet, entity_types.parquet, relation_types.parquet, property_types.parquet, type_constraints.parquet, semantic_flags.parquet, cardinality_rules.parquet, aliases.parquet
  • Each file carries GraphForge metadata in Arrow schema key-value metadata: graphforge.ontology_id, graphforge.ontology_version, graphforge.ontology_checksum, graphforge.writer_version

load_parquet(dir) — reads the 8 Parquet files and reconstructs the full OntologyRuntime without YAML/JSON parsing or re-validation:

  • Rebuilds entity_name_to_id, entity_id_to_name, relation_name_to_id, relation_id_to_name
  • Rebuilds property_name_to_id from the property_types table
  • Rebuilds ancestors and descendants inheritance closure from the entity_types table

New error variants: OntologyError::Parquet(String) and OntologyError::ChecksumMismatch { cached, computed } (reserved for future stale-cache detection)

Test plan

  • save_parquet_creates_files — all 8 .parquet files exist after save
  • load_parquet_roundtrip_row_counts — row counts match original after round-trip
  • load_parquet_lookup_maps_reconstructedentity_name_to_id and relation_name_to_id intact
  • load_parquet_ancestors_reconstructed — Employee⊆Person closure preserved
  • save_load_full_example — storage.md example compiles, saves, loads correctly
  • cargo test -p gf-ontology — 64 tests pass
  • cargo clippy -p gf-ontology -- -D warnings — clean

🤖 Generated with Claude Code

Note

Add Parquet save/load persistence for OntologyRuntime

  • Adds save_parquet and load_parquet as public APIs in the gf-ontology crate, writing/reading OntologyRuntime state as eight Parquet files in a directory (entity types, relation types, property types, type constraints, semantic flags, cardinality rules, aliases, and ontology meta).
  • Each Parquet file embeds GraphForge key-value metadata in the Arrow schema; load_parquet verifies an optional expected checksum against the stored metadata, returning ChecksumMismatch on mismatch.
  • Loading reconstructs name/id lookup maps and entity inheritance ancestor/descendant closures directly from the Parquet tables, bypassing YAML/JSON re-parsing.
  • Two new OntologyError variants are added: Parquet(String) for I/O errors and ChecksumMismatch { cached, computed } for integrity failures.

Macroscope summarized 5ff4b71.

Summary by CodeRabbit

  • New Features

    • Parquet-based persistence for ontology runtime with save/load snapshot support.
    • Optional checksum verification to detect mismatched or stale snapshots.
  • Bug Fixes

    • Improved error handling and clearer error variants for Parquet I/O and checksum failures.
  • Chores

    • Added Parquet format support as a dependency.

Review Change Stack

- persistence.rs: save_parquet() writes 8 Arrow tables as Parquet files
  with GraphForge metadata (ontology_id, version, checksum, writer_version)
  embedded in Arrow schema metadata; load_parquet() reads them back and
  reconstructs all lookup maps (name↔id, property_name_to_id) and the
  inheritance closure without re-parsing or re-validating
- error.rs: add OntologyError::Parquet and OntologyError::ChecksumMismatch
- Cargo.toml: add parquet = { workspace = true }
- 5 new tests (64 total): file creation, roundtrip row counts, lookup
  maps reconstructed, ancestors closure reconstructed, full example doc

Closes #562

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

Walkthrough

Adds Parquet snapshot persistence to OntologyRuntime: new parquet dependency, OntologyError Parquet/checksum variants, save_parquet and load_parquet(expected_checksum), helpers to read/write Arrow/Parquet preserving schema metadata, runtime reconstruction logic, and tests including checksum validation.

Changes

Parquet Persistence Implementation

Layer / File(s) Summary
Error types and module setup
crates/gf-ontology/Cargo.toml, crates/gf-ontology/src/error.rs, crates/gf-ontology/src/lib.rs
parquet dependency added; OntologyError gains Parquet and ChecksumMismatch variants; crate exposes persistence module and re-exports load_parquet/save_parquet.
Parquet save and load operations
crates/gf-ontology/src/persistence.rs
save_parquet writes eight OntologyRuntime RecordBatches as {table}.parquet with GraphForge metadata; load_parquet(dir, expected_checksum) reads the eight files, optionally verifies graphforge.ontology_checksum, and coordinates reconstruction; includes Parquet IO error mapping and writer helper.
Runtime reconstruction and I/O helpers
crates/gf-ontology/src/persistence.rs
Rebuilds entity/relation name↔id maps, property composite-key map, and computes transitive inheritance closure; provides read_batch_parquet preserving schema for zero-row files and try_string_col for safe string extraction.
Integration tests
crates/gf-ontology/src/persistence.rs
Tests validate Parquet file creation, row-count round-trips, lookup-map and inheritance reconstruction, YAML→compile→save→load end-to-end flow, and checksum success/failure behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding Parquet persistence (save/load functionality) for OntologyRuntime, which is the primary objective of this PR.
Linked Issues check ✅ Passed The PR successfully implements all core coding requirements from #562: save_parquet writes eight Parquet files with embedded GraphForge metadata, load_parquet reconstructs OntologyRuntime with checksum validation and lookup map/inheritance closure rebuilding, new OntologyError variants are added, and integration tests verify the functionality.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing Parquet persistence: dependency addition in Cargo.toml, error type extensions, module exposure, and persistence implementation with comprehensive tests. No unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed PR description is comprehensive and addresses the template structure well, covering objectives, changes, testing, and implementation details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/562-ontology-parquet

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.08%. Comparing base (80386b0) to head (5ff4b71).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #668   +/-   ##
=======================================
  Coverage   97.08%   97.08%           
=======================================
  Files           2        2           
  Lines         274      274           
  Branches       41       41           
=======================================
  Hits          266      266           
  Misses          5        5           
  Partials        3        3           
Flag Coverage Δ
full-coverage 97.08% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
parser ∅ <ø> (∅)
planner ∅ <ø> (∅)
executor ∅ <ø> (∅)
storage ∅ <ø> (∅)
ast ∅ <ø> (∅)
types ∅ <ø> (∅)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80386b0...5ff4b71. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread crates/gf-ontology/src/persistence.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/gf-ontology/src/persistence.rs (2)

74-74: 💤 Low value

Derive writer_version from the crate version instead of a hardcoded literal.

"0.5.0" will drift from the actual crate version on the next bump. Prefer env!("CARGO_PKG_VERSION").

♻️ Suggested change
-        ("graphforge.writer_version".into(), "0.5.0".into()),
+        ("graphforge.writer_version".into(), env!("CARGO_PKG_VERSION").into()),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ontology/src/persistence.rs` at line 74, Replace the hardcoded
writer version string for the "graphforge.writer_version" entry with the crate's
actual package version at compile time by using env!("CARGO_PKG_VERSION");
locate the tuple ("graphforge.writer_version", "0.5.0".into()) in persistence.rs
and change it to derive its value from env!("CARGO_PKG_VERSION") (ensuring you
still call .into() or otherwise convert to the expected type).

174-252: ⚖️ Poor tradeoff

Schema column indices in persistence already match the compiler’s batch layout

rebuild_entity_maps, rebuild_relation_maps, rebuild_property_map, and rebuild_inheritance_closure use batch.column(n) indices that line up with the current field order in crates/gf-ontology/src/schemas.rs (ENTITY_TYPES_SCHEMA, RELATION_TYPES_SCHEMA, PROPERTY_TYPES_SCHEMA) and the array order passed to RecordBatch::try_new in crates/gf-ontology/src/compiler.rs (so there’s no current mismatch risk; the downcast_ref checks also guard expected physical types).

To make this resilient to future schema/column reordering (especially where multiple fields share the same Arrow type), consider deriving indices from batch.schema() via index_of("...") (or asserting field names at runtime).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ontology/src/persistence.rs` around lines 174 - 252, The current
functions (rebuild_entity_maps, rebuild_relation_maps, rebuild_property_map, and
rebuild_inheritance_closure) rely on hardcoded column indices which will break
if schemas reorder; change each function to lookup column indices by name via
batch.schema().index_of("field_name") (or call schema.field_with_name and
unwrap/error) for the specific fields used (e.g., "id", "name" for
ENTITY/RELATION, "property_type_id"/"owner_type_id"/"name" for PROPERTY), then
use those index values for batch.column(...) and keep the existing downcast_ref
checks; add clear error mapping to OntologyError when a field name is missing or
has the wrong physical type.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Around line 125-164: load_parquet currently ignores the per-file schema
metadata that save_parquet sets (graphforge.ontology_checksum), so stale/corrupt
snapshots never trigger OntologyError::ChecksumMismatch; update load_parquet to
read the schema metadata produced by read_batch_parquet for each table (use
TABLE_NAMES and the results from read_batch_parquet), extract the
"graphforge.ontology_checksum" value and compare it against an expected checksum
(either add an expected_checksum parameter to load_parquet or accept an
Option<&str> so callers can opt-in); if any file's checksum differs or is
missing return Err(OntologyError::ChecksumMismatch) before reconstructing maps
(so rebuild_entity_maps, rebuild_relation_maps, rebuild_inheritance_closure
never run on bad data), and add a unit test that writes parquet via save_parquet
and verifies load_parquet returns ChecksumMismatch when the stored checksum is
changed.

---

Nitpick comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Line 74: Replace the hardcoded writer version string for the
"graphforge.writer_version" entry with the crate's actual package version at
compile time by using env!("CARGO_PKG_VERSION"); locate the tuple
("graphforge.writer_version", "0.5.0".into()) in persistence.rs and change it to
derive its value from env!("CARGO_PKG_VERSION") (ensuring you still call .into()
or otherwise convert to the expected type).
- Around line 174-252: The current functions (rebuild_entity_maps,
rebuild_relation_maps, rebuild_property_map, and rebuild_inheritance_closure)
rely on hardcoded column indices which will break if schemas reorder; change
each function to lookup column indices by name via
batch.schema().index_of("field_name") (or call schema.field_with_name and
unwrap/error) for the specific fields used (e.g., "id", "name" for
ENTITY/RELATION, "property_type_id"/"owner_type_id"/"name" for PROPERTY), then
use those index values for batch.column(...) and keep the existing downcast_ref
checks; add clear error mapping to OntologyError when a field name is missing or
has the wrong physical type.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fdfbaeb7-e8d5-4a4e-9e12-13fb8619525b

📥 Commits

Reviewing files that changed from the base of the PR and between 80386b0 and 45cd6a8.

📒 Files selected for processing (4)
  • crates/gf-ontology/Cargo.toml
  • crates/gf-ontology/src/error.rs
  • crates/gf-ontology/src/lib.rs
  • crates/gf-ontology/src/persistence.rs

Comment thread crates/gf-ontology/src/persistence.rs Outdated
DecisionNerd and others added 2 commits May 30, 2026 09:47
- try_string_col: replace panicking expect() with Result return for
  public save_parquet function
- load_parquet: add expected_checksum param; reads graphforge.ontology_checksum
  from schema metadata and returns OntologyError::ChecksumMismatch on mismatch
- read_batch_parquet: use builder schema (not batch schema) in concat_batches
  to preserve Arrow key-value metadata through Parquet round-trip
- 2 new tests: correct checksum passes, wrong checksum returns ChecksumMismatch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@DecisionNerd
Copy link
Copy Markdown
Owner Author

Fixes Applied Successfully

Fixed 1 file based on 2 CodeRabbit feedback items.

Files modified:

  • crates/gf-ontology/src/persistence.rs

Fix #1 (LOW): try_string_col replaces the panicking string_col helper — now returns on malformed instead of panicking.

Fix #2 (HIGH): load_parquet gains an expected_checksum: Option<&str> parameter. When Some, it reads graphforge.ontology_checksum from the Parquet file's Arrow schema metadata and returns OntologyError::ChecksumMismatch if the values differ. Also fixed a subtle Parquet metadata round-trip bug where concat_batches was using the batch schema (which loses custom Arrow metadata) instead of the builder schema (which preserves it).

Commit: 5ff4b71

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/gf-ontology/src/persistence.rs (1)

303-313: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Panic risk: unknown parent id triggers unwrap() on a recoverable load path.

desc.get_mut(&parent).unwrap() (Line 310) panics if a parent_id references an entity id that isn't present in entity_name_to_id (e.g. a corrupt or partially-written snapshot). Since load_parquet performs no validation and expected_checksum is optional (None skips the checksum guard entirely), this is reachable and crashes instead of returning a recoverable OntologyError. Prefer surfacing a Parquet error so callers can fall back to recompiling from source.

🛡️ Proposed fix to avoid the panic
         for id in &all_ids {
             let mut current = *id;
             while let Some(&parent) = parent_of.get(&current) {
                 if anc[id].contains(&parent) {
                     break;
                 }
-                anc.get_mut(id).unwrap().insert(parent);
-                desc.get_mut(&parent).unwrap().insert(*id);
+                let desc_set = desc.get_mut(&parent).ok_or_else(|| {
+                    OntologyError::Parquet(format!(
+                        "entity_types references unknown parent id {parent}"
+                    ))
+                })?;
+                desc_set.insert(*id);
+                anc.get_mut(id).unwrap().insert(parent);
                 current = parent;
             }
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ontology/src/persistence.rs` around lines 303 - 313, The loop that
walks parent_of uses desc.get_mut(&parent).unwrap(), which can panic if a parent
id is missing (e.g., corrupt snapshot); instead detect the missing parent and
return a recoverable Parquet-style OntologyError from load_parquet so callers
can fallback to recompiling. Replace unwrap on desc.get_mut(&parent) with an
explicit check (e.g., desc.get_mut(&parent).ok_or_else(||
OntologyError::Parquet("missing parent id ...".into()))?) or similar error
construction, and propagate that Result out of the function (update the loop to
return Err when a parent is absent). Ensure the error mentions the missing
parent id and that this change covers desc, anc, parent_of, all_ids and
integrates with load_parquet's Result return type rather than panicking.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Around line 303-313: The loop that walks parent_of uses
desc.get_mut(&parent).unwrap(), which can panic if a parent id is missing (e.g.,
corrupt snapshot); instead detect the missing parent and return a recoverable
Parquet-style OntologyError from load_parquet so callers can fallback to
recompiling. Replace unwrap on desc.get_mut(&parent) with an explicit check
(e.g., desc.get_mut(&parent).ok_or_else(|| OntologyError::Parquet("missing
parent id ...".into()))?) or similar error construction, and propagate that
Result out of the function (update the loop to return Err when a parent is
absent). Ensure the error mentions the missing parent id and that this change
covers desc, anc, parent_of, all_ids and integrates with load_parquet's Result
return type rather than panicking.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43896030-bff8-4693-9665-c02acebd05bf

📥 Commits

Reviewing files that changed from the base of the PR and between 45cd6a8 and 5ff4b71.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (1)
  • crates/gf-ontology/src/persistence.rs

Comment on lines +339 to +357
fn try_string_col(
batch: &RecordBatch,
col: usize,
row: usize,
label: &str,
) -> Result<String, OntologyError> {
let arr = batch
.column(col)
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| OntologyError::Parquet(format!("{label} col {col} is not Utf8")))?;
if row >= batch.num_rows() {
return Err(OntologyError::Parquet(format!(
"{label}: row {row} out of range (have {} rows)",
batch.num_rows()
)));
}
Ok(arr.value(row).to_owned())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium src/persistence.rs:339

try_string_col validates row bounds but not col bounds. batch.column(col) panics on out-of-bounds column indices, so a corrupt or mismatched Parquet file with fewer than 4 columns causes a panic instead of returning OntologyError::Parquet. Consider adding a bounds check for col before accessing batch.column(col).

 fn try_string_col(
     batch: &RecordBatch,
     col: usize,
     row: usize,
     label: &str,
 ) -> Result<String, OntologyError> {
+    if col >= batch.num_columns() {
+        return Err(OntologyError::Parquet(format!(
+            "{label}: col {col} out of range (have {} columns)",
+            batch.num_columns()
+        )));
+    }
     let arr = batch
         .column(col)
         .as_any()
         .downcast_ref::<StringArray>()
         .ok_or_else(|| OntologyError::Parquet(format!("{label} col {col} is not Utf8")))?;
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/gf-ontology/src/persistence.rs around lines 339-357:

`try_string_col` validates `row` bounds but not `col` bounds. `batch.column(col)` panics on out-of-bounds column indices, so a corrupt or mismatched Parquet file with fewer than 4 columns causes a panic instead of returning `OntologyError::Parquet`. Consider adding a bounds check for `col` before accessing `batch.column(col)`.

Evidence trail:
crates/gf-ontology/src/persistence.rs lines 339-357 (REVIEWED_COMMIT): `try_string_col` checks row bounds (line 350) but not col bounds before `batch.column(col)` (line 345-346). Callers at lines 63-65 pass col indices 0, 1, 3. Arrow docs at https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html confirm `column(index)` panics on out-of-bounds.

@DecisionNerd DecisionNerd merged commit 883a208 into main May 30, 2026
42 checks passed
@DecisionNerd DecisionNerd deleted the feature/562-ontology-parquet branch May 30, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rust: persist and reload ontology runtime tables as Parquet

1 participant