Skip to content

fix(recovery): SSTable auto-repair, chaos tests, WAL decoupling, group commit config#381

Merged
ElioNeto merged 2 commits into
mainfrom
fix/final-batch-complex
May 26, 2026
Merged

fix(recovery): SSTable auto-repair, chaos tests, WAL decoupling, group commit config#381
ElioNeto merged 2 commits into
mainfrom
fix/final-batch-complex

Conversation

@ElioNeto
Copy link
Copy Markdown
Owner

Summary

Four complex issues resolved:

CRASH-SST-001 (#357) — SSTable Auto-Repair

  • discover_sstables_from_disk validates each SSTable on startup
  • Corrupted/truncated SSTables are moved to quarantine/ subdirectory
  • Data recovered via WAL replay (runs before SSTable discovery)
  • Added quarantined_count and recovered_count to engine stats

TEST-003 (#373) — Chaos Testing

  • 6 chaos tests for I/O fault tolerance and corruption handling
  • Tests: deleted SSTable, compact with missing SSTable, restart with missing SSTable, corrupted data blocks, corrupted bloom filter, missing WAL
  • All tests verify engine survives without panic

WRITE-SERIAL-001 (#362) — Decoupled WAL I/O

  • Writes now use two-phase approach: brief lock to clone Arc handle, then WAL I/O performed OUTSIDE the core lock
  • Re-acquire lock briefly for memtable insert
  • Concurrent writers can perform WAL I/O in parallel
  • Crash-safe: WAL-before-memtable order preserved

WRITE-WAL-001 (#363) — WAL sync_interval config

  • Added wal_sync_interval to EngineOptions and WalConfig
  • Wired set_sync_interval() through engine initialization
  • Default remains 4 (existing behavior)

MEM-MEMTABLE-002 (#367) — Analysis

Closes #357
Closes #373
Closes #362
Closes #363

ElioNeto added 2 commits May 26, 2026 14:25
CRASH-SST-001 (#357):
- discover_sstables_from_disk now validates each SSTable on startup
- Corrupted/truncated SSTables are moved to quarantine/ subdirectory
- Data recovered via WAL replay (runs before SSTable discovery)
- Added quarantined_count and recovered_count to stats

TEST-003 (#373):
- 6 chaos tests for I/O fault tolerance and corruption handling
- Tests: deleted SSTable, compact with missing SSTable, restart with
  missing SSTable, corrupted data blocks, corrupted bloom filter, missing WAL
- All tests verify engine survives without panic or crash

Closes #357
Closes #373
WRITE-SERIAL-001 (#362):
- Writes (put, delete, delete_range) now use a two-phase approach:
  1. Brief lock to clone Arc<WriteAheadLog> handle
  2. WAL I/O performed OUTSIDE the core lock (write_record is internally
     synchronized via WAL's own Mutex)
  3. Re-acquire lock briefly for memtable insert
- This allows concurrent writers to perform WAL I/O in parallel instead
  of serializing all I/O behind the core lock
- Crash-safe: WAL-before-memtable order preserved; crash between WAL
  write and memtable insert recovers via WAL replay

WRITE-WAL-001 (#363):
- Added wal_sync_interval to EngineOptions and WalConfig
- Wired set_sync_interval() through engine initialization so the
  WAL fsync interval is configurable via LsmConfig
- Default remains 4 (existing behavior)

Closes #362
Closes #363
@ElioNeto ElioNeto merged commit 53eeea3 into main May 26, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment