release: AIngle v0.6.3 — total data integrity hardening#83
Merged
ApiliumDevTeam merged 16 commits intomainfrom Mar 19, 2026
Merged
release: AIngle v0.6.3 — total data integrity hardening#83ApiliumDevTeam merged 16 commits intomainfrom
ApiliumDevTeam merged 16 commits intomainfrom
Conversation
- GraphDB.flush() now flushes DAG store alongside triple store - DAG persistent init failure is now fatal (no silent in-memory fallback) - DAG action failures on triple insert/delete return errors instead of being silently swallowed — prevents triples existing without audit trail - GraphQL mutations now record DAG actions (previously bypassed DAG/Raft entirely, causing split-brain in cluster mode) - DagStore.put() validates parent hashes exist before accepting actions, preventing orphaned entries that break traversal and time-travel queries - Corrupted actions in DAG backend are now logged during index rebuild instead of being silently skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProofStore was purely in-memory — all proofs lost on restart. This adds a ProofBackend trait (mirroring the DagBackend pattern) with Memory and Sled implementations. - New `proofs/backend.rs` with ProofBackend trait, MemoryProofBackend, and SledProofBackend (tree "proofs" in a dedicated sled DB) - Refactored ProofStore to use backend trait instead of HashMap - `ProofStore::with_sled(path)` constructor for persistent storage - `ProofStore::flush()` for durable writes - `AppState::with_db_path` creates Sled-backed ProofStore (uses `proofs.sled` sibling directory to avoid sled lock contention with the graph DB) - `AppState::flush()` now flushes proof store alongside graph - Stats rebuilt from backend on startup (no tokio lock needed)
ClusterSnapshot did not include proofs — new nodes joining a cluster started with zero proofs. This adds proof snapshot export/import to the Raft state machine. - Added ProofSnapshot struct and ProofSnapshotProvider trait - ClusterSnapshot now has `proofs: Vec<ProofSnapshot>` field (backward-compatible via serde(default)) - Blake3 checksum now covers proofs alongside triples + ineru_ltm - CortexSnapshotBuilder exports proofs via provider during build - install_snapshot() imports proofs when present - ProofStore implements ProofSnapshotProvider (sync methods for export/import via backend.list_all) - Wired into cluster_init: proof provider set on CortexStateMachine
Ineru only saved on explicit shutdown — a crash meant data loss. This adds a configurable periodic flush task. - Added `flush_interval_secs` to CortexConfig (default: 300s) - Added `--flush-interval <SECS>` CLI argument (0 = disabled) - Spawns tokio task that calls state.flush() at the configured interval - Flushes graph DB, proof store, and Ineru snapshot atomically
AuditLog::record() used `let _ = writeln!()` which silently ignored write failures, and never called sync_all() — data could be lost on crash. - writeln! errors now logged via log::error! - file.sync_all() called after each write (log::warn on failure) - OpenOptions::open failures logged instead of silently ignored
insert_batch() performed individual puts — a failure mid-batch left the store in an inconsistent state with some triples written and indexes partially updated. - Added apply_batch() to StorageBackend trait (default: sequential puts) - SledBackend overrides with sled::Batch for atomic writes - Refactored GraphStore::insert_batch() into 3 phases: 1. Collect non-duplicate triples 2. Atomic backend batch write 3. Update indexes only on success
Gossip only synced triples — DAG actions were not replicated between peers. This adds tip-based DAG sync to the P2P layer. - New p2p/dag_sync.rs module with tip collection, missing action computation, serialized action fetch, and batch ingestion - Added DagTipSync, RequestDagActions, SendDagActions message variants (feature-gated under "dag") - Gossip loop (Task 2) now sends DagTipSync alongside BloomSync - Message handler (Task 3) handles all 3 DAG message types: - DagTipSync: compute missing actions via DagStore::compute_missing() - RequestDagActions: fetch and send actions by hash - SendDagActions: ingest received actions via DagStore::ingest() - Reuses existing DagStore::compute_missing() and ingest() APIs
End-to-end tests verifying data flows correctly across all AIngle subsystems — these caught the sled lock contention bug where ProofStore and GraphDB shared the same sled path. - ProofStore Sled round-trip (20 proofs write/reopen/delete/verify) - Graph+DAG triple materialization consistency (50 triples + deletes) - Batch insert index consistency (100 triples, duplicate handling) - AppState flush/restore full cycle (triples + proofs survive restart) - Raft snapshot with proofs serialization round-trip - Snapshot checksum changes when proofs are included - Graph Sled persistence with float precision verification - Audit log fsync integrity (50 entries write/reopen/query filters)
Audit found 5 locations where important data was written to disk without sync_all(), meaning a crash or power loss could lose the data even after a "successful" write returned. - main.rs: DAG signing key — `let _ = write_all` replaced with proper error handling + fsync (key loss broke all future DAG signatures) - kaneru/persistence.rs: Agent state + LearningEngine saves now fsync after write_all (ML weights/Q-values could be lost) - ineru/lib.rs: Memory snapshot now uses File::create + write_all + sync_all instead of std::fs::write (which never fsyncs) - p2p/identity.rs: Node Ed25519 key now fsynced on both Unix and non-Unix (identity loss = can't rejoin P2P mesh) - p2p/peer_store.rs: Known peers JSON now fsynced (peer list loss = must rediscover all network peers)
Mutex lock poisoning in the WAL writer caused panics that crashed the entire node. The WAL is the most critical data path — a panic here takes down all Raft consensus operations. All 4 .lock().unwrap() calls replaced with .lock().map_err() that returns io::Error, propagating the failure gracefully instead of aborting the process.
RuleEngine used .write().unwrap() and .read().unwrap() on 9 lock acquisitions. A panic in any thread holding these locks would cascade to crash all subsequent validation/inference operations. Replaced all 9 occurrences with .unwrap_or_else(|p| p.into_inner()) which recovers the data from a poisoned lock and continues operating. Stats and inferred triples are non-critical — crashing the server over a stats counter is disproportionate.
KaneruAgent::learn() called .unwrap() on current_state (Option) and observation_history.back() (Option) — both panic if called before any observation is recorded. Replaced with early returns + log::warn for graceful degradation. The agent now safely skips learning when called in an invalid state instead of crashing the entire process.
p2p_status() and list_peers() called serde_json::to_value().unwrap() which would panic the server if serialization ever failed. Replaced with match that returns 500 Internal Server Error with error details.
…line This release fixes 16 bugs found during an exhaustive input/output audit of the entire AIngle data pipeline: ## Data integrity (6 fixes) - Persistent ProofStore with Sled backend (was in-memory only) - Proofs included in Raft cluster snapshots (new nodes got 0 proofs) - Periodic auto-flush every 300s (crash = data loss window reduced) - Audit log fsync + error reporting (was silently dropping writes) - Atomic batch insert via sled::Batch (partial writes impossible) - P2P DAG action sync via tip exchange (DAG wasn't replicated) ## fsync hardening (5 fixes) - DAG signing key write — error handling + fsync - Kaneru agent state + ML weights — fsync after save - Ineru memory snapshot — fsync after write - P2P node identity key — fsync on all platforms - Peer store JSON — fsync after write ## Panic elimination (5 fixes) - WAL writer: lock().unwrap() → lock().map_err() (4 sites) - Rule engine: poisoned lock recovery (9 sites) - Kaneru agent: unwrap on Option → graceful early return - P2P REST endpoints: unwrap → HTTP 500 error response - ProofStore init: blocking_write removed entirely ## Testing - 8 new cross-subsystem data integrity tests - 1092+ tests passing across all core crates, 0
Exposes GraphStore::insert_batch() via REST API for efficient bulk
data loading. Uses sled::Batch for atomic writes when using Sled backend.
- POST /api/v1/triples/batch with JSON body {"triples": [...]}
- Returns 201 with inserted IDs, total count, and duplicate count
- Validates all inputs before writing (empty subject/predicate → 400)
- Namespace scoping enforced per triple
- Duplicates silently skipped (reported in response)
- Audit log records batch_create with insert/duplicate counts
- Events broadcast for each new triple
GET /api/v1/proofs/:id/verify returned 422 when the stored proof_data didn't match the expected ZkProof structure (e.g. user submitted arbitrary JSON without the required commitment/challenge/response fields). Now: - Malformed proof data → 200 with valid:false + error details - Proof not found → 404 - Valid proof → 200 with valid:true This matches the semantic contract: verification tells you whether a proof is valid, it shouldn't fail with a server error just because the proof data is garbage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete data integrity audit and hardening across the entire AIngle pipeline. Fixes 18 bugs found during exhaustive input/output testing of all subsystems.
Data integrity (7 fixes)
ProofBackendtrait with Sled implementationProofSnapshotProvidertrait + checksum coverage--flush-interval)writeln!withoutsync_all(), errors silently ignored withlet _insert_batch()did sequential puts; partial failure = inconsistency. Now usessled::BatchWouldBlock. Fixed with separateproofs.sleddirectoryfsync hardening (5 fixes)
let _ = write_allreplaced with error handling + fsyncstd::fs::write(never fsyncs) replaced withFile::create+sync_allPanic elimination (5 fixes)
lock().unwrap()→lock().map_err()(4 sites)unwrap_or_else(|p| p.into_inner())(9 sites)unwrap()on Option → graceful early return withlog::warnto_value().unwrap()→ match with HTTP 500blocking_writeremoved entirelyNew features
POST /api/v1/triples/batch— atomic bulk triple insert endpoint with duplicate detection--flush-interval <SECS>CLI flag for configurable periodic flush (0 = disabled)Bug fixes
GET /api/v1/proofs/:id/verifyreturned 422 on malformed proof data — now returns 200 withvalid: false+ error detailsTesting
Test plan
cargo check --workspace— clean compilationcargo test -p aingle_graph --features dag— 244 passedcargo test -p aingle_cortex --lib— 153 passedcargo test -p aingle_cortex --test data_integrity_test— 8 passedcargo test -p aingle_cortex --test proof_system_test— 14 passedcargo test -p aingle_cortex --test rate_limiting_test— 16 passedcargo test -p aingle_raft— 33 passedcargo test -p aingle_wal— 20 passedcargo test -p aingle_logic— 33 passedcargo test -p ineru— 57 passedcargo test -p kaneru— 444 passedaingle-cortex --version→v0.6.3🤖 Generated with Claude Code