feat(ontology): Parquet persistence — save/load OntologyRuntime#668
Conversation
- persistence.rs: save_parquet() writes 8 Arrow tables as Parquet files
with GraphForge metadata (ontology_id, version, checksum, writer_version)
embedded in Arrow schema metadata; load_parquet() reads them back and
reconstructs all lookup maps (name↔id, property_name_to_id) and the
inheritance closure without re-parsing or re-validating
- error.rs: add OntologyError::Parquet and OntologyError::ChecksumMismatch
- Cargo.toml: add parquet = { workspace = true }
- 5 new tests (64 total): file creation, roundtrip row counts, lookup
maps reconstructed, ancestors closure reconstructed, full example doc
Closes #562
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WalkthroughAdds Parquet snapshot persistence to OntologyRuntime: new ChangesParquet Persistence Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #668 +/- ##
=======================================
Coverage 97.08% 97.08%
=======================================
Files 2 2
Lines 274 274
Branches 41 41
=======================================
Hits 266 266
Misses 5 5
Partials 3 3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crates/gf-ontology/src/persistence.rs (2)
74-74: 💤 Low valueDerive
writer_versionfrom the crate version instead of a hardcoded literal.
"0.5.0"will drift from the actual crate version on the next bump. Preferenv!("CARGO_PKG_VERSION").♻️ Suggested change
- ("graphforge.writer_version".into(), "0.5.0".into()), + ("graphforge.writer_version".into(), env!("CARGO_PKG_VERSION").into()),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gf-ontology/src/persistence.rs` at line 74, Replace the hardcoded writer version string for the "graphforge.writer_version" entry with the crate's actual package version at compile time by using env!("CARGO_PKG_VERSION"); locate the tuple ("graphforge.writer_version", "0.5.0".into()) in persistence.rs and change it to derive its value from env!("CARGO_PKG_VERSION") (ensuring you still call .into() or otherwise convert to the expected type).
174-252: ⚖️ Poor tradeoffSchema column indices in persistence already match the compiler’s batch layout
rebuild_entity_maps,rebuild_relation_maps,rebuild_property_map, andrebuild_inheritance_closureusebatch.column(n)indices that line up with the current field order incrates/gf-ontology/src/schemas.rs(ENTITY_TYPES_SCHEMA,RELATION_TYPES_SCHEMA,PROPERTY_TYPES_SCHEMA) and the array order passed toRecordBatch::try_newincrates/gf-ontology/src/compiler.rs(so there’s no current mismatch risk; thedowncast_refchecks also guard expected physical types).To make this resilient to future schema/column reordering (especially where multiple fields share the same Arrow type), consider deriving indices from
batch.schema()viaindex_of("...")(or asserting field names at runtime).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gf-ontology/src/persistence.rs` around lines 174 - 252, The current functions (rebuild_entity_maps, rebuild_relation_maps, rebuild_property_map, and rebuild_inheritance_closure) rely on hardcoded column indices which will break if schemas reorder; change each function to lookup column indices by name via batch.schema().index_of("field_name") (or call schema.field_with_name and unwrap/error) for the specific fields used (e.g., "id", "name" for ENTITY/RELATION, "property_type_id"/"owner_type_id"/"name" for PROPERTY), then use those index values for batch.column(...) and keep the existing downcast_ref checks; add clear error mapping to OntologyError when a field name is missing or has the wrong physical type.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Around line 125-164: load_parquet currently ignores the per-file schema
metadata that save_parquet sets (graphforge.ontology_checksum), so stale/corrupt
snapshots never trigger OntologyError::ChecksumMismatch; update load_parquet to
read the schema metadata produced by read_batch_parquet for each table (use
TABLE_NAMES and the results from read_batch_parquet), extract the
"graphforge.ontology_checksum" value and compare it against an expected checksum
(either add an expected_checksum parameter to load_parquet or accept an
Option<&str> so callers can opt-in); if any file's checksum differs or is
missing return Err(OntologyError::ChecksumMismatch) before reconstructing maps
(so rebuild_entity_maps, rebuild_relation_maps, rebuild_inheritance_closure
never run on bad data), and add a unit test that writes parquet via save_parquet
and verifies load_parquet returns ChecksumMismatch when the stored checksum is
changed.
---
Nitpick comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Line 74: Replace the hardcoded writer version string for the
"graphforge.writer_version" entry with the crate's actual package version at
compile time by using env!("CARGO_PKG_VERSION"); locate the tuple
("graphforge.writer_version", "0.5.0".into()) in persistence.rs and change it to
derive its value from env!("CARGO_PKG_VERSION") (ensuring you still call .into()
or otherwise convert to the expected type).
- Around line 174-252: The current functions (rebuild_entity_maps,
rebuild_relation_maps, rebuild_property_map, and rebuild_inheritance_closure)
rely on hardcoded column indices which will break if schemas reorder; change
each function to lookup column indices by name via
batch.schema().index_of("field_name") (or call schema.field_with_name and
unwrap/error) for the specific fields used (e.g., "id", "name" for
ENTITY/RELATION, "property_type_id"/"owner_type_id"/"name" for PROPERTY), then
use those index values for batch.column(...) and keep the existing downcast_ref
checks; add clear error mapping to OntologyError when a field name is missing or
has the wrong physical type.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fdfbaeb7-e8d5-4a4e-9e12-13fb8619525b
📒 Files selected for processing (4)
crates/gf-ontology/Cargo.tomlcrates/gf-ontology/src/error.rscrates/gf-ontology/src/lib.rscrates/gf-ontology/src/persistence.rs
- try_string_col: replace panicking expect() with Result return for public save_parquet function - load_parquet: add expected_checksum param; reads graphforge.ontology_checksum from schema metadata and returns OntologyError::ChecksumMismatch on mismatch - read_batch_parquet: use builder schema (not batch schema) in concat_batches to preserve Arrow key-value metadata through Parquet round-trip - 2 new tests: correct checksum passes, wrong checksum returns ChecksumMismatch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes Applied SuccessfullyFixed 1 file based on 2 CodeRabbit feedback items. Files modified:
Fix #1 (LOW): Fix #2 (HIGH): Commit: |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/gf-ontology/src/persistence.rs (1)
303-313:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPanic risk: unknown parent id triggers
unwrap()on a recoverable load path.
desc.get_mut(&parent).unwrap()(Line 310) panics if aparent_idreferences an entity id that isn't present inentity_name_to_id(e.g. a corrupt or partially-written snapshot). Sinceload_parquetperforms no validation andexpected_checksumis optional (Noneskips the checksum guard entirely), this is reachable and crashes instead of returning a recoverableOntologyError. Prefer surfacing aParqueterror so callers can fall back to recompiling from source.🛡️ Proposed fix to avoid the panic
for id in &all_ids { let mut current = *id; while let Some(&parent) = parent_of.get(¤t) { if anc[id].contains(&parent) { break; } - anc.get_mut(id).unwrap().insert(parent); - desc.get_mut(&parent).unwrap().insert(*id); + let desc_set = desc.get_mut(&parent).ok_or_else(|| { + OntologyError::Parquet(format!( + "entity_types references unknown parent id {parent}" + )) + })?; + desc_set.insert(*id); + anc.get_mut(id).unwrap().insert(parent); current = parent; } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gf-ontology/src/persistence.rs` around lines 303 - 313, The loop that walks parent_of uses desc.get_mut(&parent).unwrap(), which can panic if a parent id is missing (e.g., corrupt snapshot); instead detect the missing parent and return a recoverable Parquet-style OntologyError from load_parquet so callers can fallback to recompiling. Replace unwrap on desc.get_mut(&parent) with an explicit check (e.g., desc.get_mut(&parent).ok_or_else(|| OntologyError::Parquet("missing parent id ...".into()))?) or similar error construction, and propagate that Result out of the function (update the loop to return Err when a parent is absent). Ensure the error mentions the missing parent id and that this change covers desc, anc, parent_of, all_ids and integrates with load_parquet's Result return type rather than panicking.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/gf-ontology/src/persistence.rs`:
- Around line 303-313: The loop that walks parent_of uses
desc.get_mut(&parent).unwrap(), which can panic if a parent id is missing (e.g.,
corrupt snapshot); instead detect the missing parent and return a recoverable
Parquet-style OntologyError from load_parquet so callers can fallback to
recompiling. Replace unwrap on desc.get_mut(&parent) with an explicit check
(e.g., desc.get_mut(&parent).ok_or_else(|| OntologyError::Parquet("missing
parent id ...".into()))?) or similar error construction, and propagate that
Result out of the function (update the loop to return Err when a parent is
absent). Ensure the error mentions the missing parent id and that this change
covers desc, anc, parent_of, all_ids and integrates with load_parquet's Result
return type rather than panicking.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 43896030-bff8-4693-9665-c02acebd05bf
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (1)
crates/gf-ontology/src/persistence.rs
| fn try_string_col( | ||
| batch: &RecordBatch, | ||
| col: usize, | ||
| row: usize, | ||
| label: &str, | ||
| ) -> Result<String, OntologyError> { | ||
| let arr = batch | ||
| .column(col) | ||
| .as_any() | ||
| .downcast_ref::<StringArray>() | ||
| .ok_or_else(|| OntologyError::Parquet(format!("{label} col {col} is not Utf8")))?; | ||
| if row >= batch.num_rows() { | ||
| return Err(OntologyError::Parquet(format!( | ||
| "{label}: row {row} out of range (have {} rows)", | ||
| batch.num_rows() | ||
| ))); | ||
| } | ||
| Ok(arr.value(row).to_owned()) | ||
| } |
There was a problem hiding this comment.
🟡 Medium src/persistence.rs:339
try_string_col validates row bounds but not col bounds. batch.column(col) panics on out-of-bounds column indices, so a corrupt or mismatched Parquet file with fewer than 4 columns causes a panic instead of returning OntologyError::Parquet. Consider adding a bounds check for col before accessing batch.column(col).
fn try_string_col(
batch: &RecordBatch,
col: usize,
row: usize,
label: &str,
) -> Result<String, OntologyError> {
+ if col >= batch.num_columns() {
+ return Err(OntologyError::Parquet(format!(
+ "{label}: col {col} out of range (have {} columns)",
+ batch.num_columns()
+ )));
+ }
let arr = batch
.column(col)
.as_any()
.downcast_ref::<StringArray>()
.ok_or_else(|| OntologyError::Parquet(format!("{label} col {col} is not Utf8")))?;🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/gf-ontology/src/persistence.rs around lines 339-357:
`try_string_col` validates `row` bounds but not `col` bounds. `batch.column(col)` panics on out-of-bounds column indices, so a corrupt or mismatched Parquet file with fewer than 4 columns causes a panic instead of returning `OntologyError::Parquet`. Consider adding a bounds check for `col` before accessing `batch.column(col)`.
Evidence trail:
crates/gf-ontology/src/persistence.rs lines 339-357 (REVIEWED_COMMIT): `try_string_col` checks row bounds (line 350) but not col bounds before `batch.column(col)` (line 345-346). Callers at lines 63-65 pass col indices 0, 1, 3. Arrow docs at https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html confirm `column(index)` panics on out-of-bounds.
Closes #562
Sixth issue in the v0.5.0: Ontology Runtime milestone.
Summary
save_parquet(runtime, dir)— writes all 8 Arrow tables as Parquet files todir:ontology_meta.parquet,entity_types.parquet,relation_types.parquet,property_types.parquet,type_constraints.parquet,semantic_flags.parquet,cardinality_rules.parquet,aliases.parquetgraphforge.ontology_id,graphforge.ontology_version,graphforge.ontology_checksum,graphforge.writer_versionload_parquet(dir)— reads the 8 Parquet files and reconstructs the fullOntologyRuntimewithout YAML/JSON parsing or re-validation:entity_name_to_id,entity_id_to_name,relation_name_to_id,relation_id_to_nameproperty_name_to_idfrom theproperty_typestableancestorsanddescendantsinheritance closure from theentity_typestableNew error variants:
OntologyError::Parquet(String)andOntologyError::ChecksumMismatch { cached, computed }(reserved for future stale-cache detection)Test plan
save_parquet_creates_files— all 8.parquetfiles exist after saveload_parquet_roundtrip_row_counts— row counts match original after round-tripload_parquet_lookup_maps_reconstructed—entity_name_to_idandrelation_name_to_idintactload_parquet_ancestors_reconstructed— Employee⊆Person closure preservedsave_load_full_example— storage.md example compiles, saves, loads correctlycargo test -p gf-ontology— 64 tests passcargo clippy -p gf-ontology -- -D warnings— clean🤖 Generated with Claude Code
Note
Add Parquet save/load persistence for
OntologyRuntimesave_parquetandload_parquetas public APIs in thegf-ontologycrate, writing/readingOntologyRuntimestate as eight Parquet files in a directory (entity types, relation types, property types, type constraints, semantic flags, cardinality rules, aliases, and ontology meta).load_parquetverifies an optional expected checksum against the stored metadata, returningChecksumMismatchon mismatch.OntologyErrorvariants are added:Parquet(String)for I/O errors andChecksumMismatch { cached, computed }for integrity failures.Macroscope summarized 5ff4b71.
Summary by CodeRabbit
New Features
Bug Fixes
Chores