A production-pattern storage engine in C++17 on Linux, implementing the core internals of systems like LevelDB, Bitcask, and NetApp ONTAP's WAFL: append-only writes, write-ahead logging, crash recovery, LRU caching, and compaction.
StoreLite is a log-structured key-value storage engine inspired by LSM-tree based systems such as LevelDB and RocksDB.
| Mode | Write | Read | Compaction Savings |
|---|---|---|---|
| No-sync / No-WAL (max throughput) | 407K ops/s / 58.7 MB/s | 308K ops/s | 76.3% |
| WAL disabled, fsync enabled | 255 ops/s | 445K ops/s | 2.2% |
| Full durable (sync + WAL) | 130 ops/s | 397K ops/s | 55.1% |
The 3,100x write throughput gap between durable and no-sync mode is not a bug. It is the fundamental storage engineering tradeoff —
fdatasyncwaits for disk confirmation on every write. This is why real systems use WAL group commit, write coalescing, and battery-backed NVRAM.
Put("k", "v")
│
▼
┌─────────┐ O_SYNC ┌───────────┐
│ WAL │ ────────► │ wal.log │ ← durable before segment write
└────┬────┘ └───────────┘
│
▼
┌──────────────┐ pwrite ┌──────────────────┐
│ Active Seg │ ───────► │ N.seg (disk) │ ← append-only
│ (write-only) │ └──────────────────┘
└──────┬───────┘
│ rotate when full (default 64MB)
▼
┌──────────────┐
│ Immutable │ ──► Compactor ──► merged segment
│ Segments │ (background) stale data reclaimed
└──────────────┘
│
▼
┌──────────────┐
│ Hash Index │ key → { seg_id, offset, val_len, timestamp_us }
│ (in-memory) │
└──────┬───────┘
│ cache-first on reads
▼
┌──────────────┐
│ LRU Cache │ O(1) get/put — doubly-linked list + hashmap
└──────────────┘
┌──────────┬────────────────┬──────────┬──────────┬──────────────────────┐
│ crc32: 4 │ timestamp_us:8 │ key_len:4│ val_len:4│ key │ value │
└──────────┴────────────────┴──────────┴──────────┴───────┴──────────────┘
◄──────────────── 20-byte header ─────────────────► variable length
val_len == 0xFFFFFFFF → tombstone (deletion marker)
- CRC32 covers all bytes after the CRC field — detects bit rot and partial writes
- Timestamp enables last-write-wins conflict resolution during compaction
- All multi-byte fields are little-endian
| File | Lines | Responsibility |
|---|---|---|
record.cpp |
~130 | CRC32 (table-driven), encode/decode, tombstone encoding |
segment.cpp |
~160 | Append-only file; pwrite/pread/fdatasync; full scan for recovery |
wal.cpp |
~90 | O_SYNC write-ahead log; replay on crash; truncate on clean shutdown |
lru_cache.cpp |
~70 | Fixed-capacity LRU; O(1) via doubly-linked list + unordered_map |
compactor.cpp |
~100 | Scan N segments → emit 1; keep latest timestamp per key; unlink stale files |
engine.cpp |
~300 | Orchestrator; std::shared_mutex for concurrent reads; segment rotation |
main.cpp |
~180 | CLI benchmark (write/read/mixed/delete) + interactive REPL |
| Concept | Where used |
|---|---|
| Log-structured writes | Every write is a sequential pwrite — no random I/O on write path |
| WAL + crash recovery | O_SYNC guarantees durability; WAL replayed on startup; partial records at end ignored |
fdatasync vs fsync |
Data-only flush per write — skips inode metadata update |
pwrite / pread |
Offset-explicit I/O — safe for concurrent readers without seeking |
| Tombstone deletions | Deletions write a sentinel record; compaction purges both tombstone and shadowed data |
| Segment rotation | Active segment sealed at segment_max_bytes; becomes immutable candidate for compaction |
| LRU O(1) eviction | Doubly-linked list (access order) + hashmap (O(1) lookup) — classic interview data structure |
std::shared_mutex |
Multiple concurrent reads; exclusive lock only on writes and segment rotation |
Demonstrates the fdatasync cost: 255 writes/sec vs 407K — 1,600x slowdown for durability
Production-safe mode: WAL flushed before segment write, fdatasync on every record
Requirements: g++ >= 9, make, Linux (kernel ≥ 3.0)
git clone https://github.com/<your-username>/storeLite
cd StoreLite
makemake test# Max throughput (OS-buffered writes)
./storelite --bench --no-sync --no-wal
# fsync per write, no WAL
./storelite --bench --no-wal -n 1000
# Full durability: WAL + fsync (slow — intended)
./storelite --bench
# Custom: 500k ops, 512B values
./storelite --bench --no-sync --no-wal -n 500000 -v 512./storelite --interactive --data-dir /tmp/mydb> put name amit
OK
> put project StoreLite
OK
> get name
amit
> del name
OK
> get name
(nil)
> scan
project = StoreLite
(1 keys)
> stats
keys=1 segs=1 bytes=1024 chits=2 cmiss=1
> compact
Compacted: kept=1 dropped=1
> quit
Workload: 100k unique keys written twice (100k stale entries)
50k keys then deleted (50k tombstones)
Before: 2 segments, ~34.6 MB
After: 1 segment, ~8.2 MB
Savings: 76.3%
Algorithm:
1. Rotate active segment → immutable
2. Scan all immutable segments oldest-first
3. Build map: key → latest {value, timestamp_us}
4. Emit only live (non-tombstone) entries to new segment
5. Atomically update in-memory index
6. unlink() old segment files
| Suite | Count | What is tested |
|---|---|---|
| CRC32 | 4 | Known vector 0xCBF43926, determinism, collision detection |
| Record encode/decode | 6 | Round-trip, tombstone, CRC corruption rejection, short buffer |
| LRU Cache | 8 | Basic get/put, eviction order, invalidation, hit/miss counters |
| Segment | 4 | Append, scan, tombstone records, persist across reopen |
| WAL | 5 | Log/replay, tombstone replay, clear, empty check |
| Engine (integration) | 75 | put/get/del, overwrite, persistence across reopen, scan, post-compaction reads |
Traditional B-tree storage engines (like the ones used in older databases) require random I/O for updates — finding the right page, modifying it, writing it back. Log-structured engines convert all writes to sequential appends, which is dramatically faster on both spinning disks (no seek time) and SSDs (avoids write amplification).
The tradeoff: read amplification and space amplification grow over time as stale data accumulates — which is exactly what compaction solves.
This is the same architecture used in LevelDB, RocksDB, Apache Cassandra, and the core write path of NetApp ONTAP's WAFL (Write Anywhere File Layout).
MIT





