Description
During compaction (execute_compaction() in compaction.rs), tombstones are detected by checking if the value is empty (value.is_empty()), rather than checking the LogRecord::is_deleted flag. This is fragile and can cause data loss.
Location
src/core/engine/compaction.rs line 114–115:
// Skip tombstones (empty values) during compaction
if !value.is_empty() {
// keep this record
}
The problem
- The
LogRecord metadata (including is_deleted) is lost during flush — flush_memtable_impl() converts LogRecord to (Vec<u8>, Vec<u8>), discarding the flag
- Compaction then has no way to distinguish a tombstone from a legitimate empty value
- A key with an intentionally empty value (
"") would be permanently deleted during compaction
Root cause
Table::build() stores data as BTreeMap<Vec<u8>, Vec<u8>> (key → value bytes). The is_deleted, timestamp, and column_family fields are lost. This was a tradeoff when the duplicate MemTable types were unified (#140).
Proposed fix
- Change
Table::build() to store BTreeMap<Vec<u8>, LogRecord> instead of BTreeMap<Vec<u8>, Vec<u8>>
- Update compaction to check
LogRecord::is_deleted instead of value.is_empty()
- Update
flush_memtable_impl() to preserve LogRecord metadata when building tables
- Update
VersionSet::get() and scan() to extract value from LogRecord
Impact
- Low probability but high severity: valid data with empty strings would be silently deleted
- Fix requires propagating LogRecord through the entire Table/VersionSet layer
Severity
High — potential silent data loss for empty-string values.
Labels
Description
During compaction (
execute_compaction()incompaction.rs), tombstones are detected by checking if the value is empty (value.is_empty()), rather than checking theLogRecord::is_deletedflag. This is fragile and can cause data loss.Location
src/core/engine/compaction.rsline 114–115:The problem
LogRecordmetadata (includingis_deleted) is lost during flush —flush_memtable_impl()convertsLogRecordto(Vec<u8>, Vec<u8>), discarding the flag"") would be permanently deleted during compactionRoot cause
Table::build()stores data asBTreeMap<Vec<u8>, Vec<u8>>(key → value bytes). Theis_deleted,timestamp, andcolumn_familyfields are lost. This was a tradeoff when the duplicateMemTabletypes were unified (#140).Proposed fix
Table::build()to storeBTreeMap<Vec<u8>, LogRecord>instead ofBTreeMap<Vec<u8>, Vec<u8>>LogRecord::is_deletedinstead ofvalue.is_empty()flush_memtable_impl()to preserve LogRecord metadata when building tablesVersionSet::get()andscan()to extract value from LogRecordImpact
Severity
High — potential silent data loss for empty-string values.
Labels