RocksDB Schema

To improve build time, we are considering replacing the hashdb LMDB data store with RocksDB.

RocksDB rocksdb.org has some particular provisions and constraints significant to the design of dist_hash:

RocksDB is optimized for fast storage on one node where size of data is greater than size of memory.
RocksDB maintains DB integrity during writes across threads using its own locks.
RocksDB does not maintain DB integrity during writes across processes or machines. Different processes/machines must use their own DB. To merge them to one, one process is required.
RocksDB achieves fast storage via its merge architecture. A background thread is required for this. Write amplification can occur if it cannot keep up. Checks are available to observe status.
RocksDB uses "column families" to store multiple tables in one database. Although each table is managed in its own file, write integrity is preserved across the database. Depending on design complexity and processing burden, dist_hash may not use this provision.
dist_hash uses column families to wrap tables under one DB.
Unlike LMDB, RocksDB Keys and values can be megabytes long.
Unlike LMDB, RocksDB does not allow multimaps. To keep keys unique, information in the data field must be appended to the key. RocksDB provides range-scan, allowing detection of values within the range. hash_source store uses this approach.
RocksDB architecture employs multiple optimization techniques including levels and Bloom filters.

tables

hashdb stores source and hash information in several tables:

Binds source hash values to source ID values.

A special key, 0x00, is used to store the current highest source ID value. When a new source ID is requested, it will be this value plus one.

Interfaces:

bool insert(file hash, &changes, &source_id) acquires a source ID for the file hash. True and new source ID or false and existing source ID.
bool find(file hash, &source_id) returns source ID and true else 0 and false.
it iterator() returns a new source ID iterator. Delete it when done.
string next(it) returns the next source ID for the given iterator else "".

Stores all (repository name, filename) tuples associated with each source ID.

Interfaces:

void insert(source_id, repository name, filename) Add the record. Update status if already there.
bool find(source_id, max_count, &names[]) Get filename, repository_name pairs, false if no source_id.

source data store maps the source file hash to the source metadata.

Interfaces:

add(file hash, <source metadata>) Add source metadata to added file hash. Use merge to log unexpected change request.
data = get(file hash) Get data.

Stores entropy, block label, and count metadata associated with a hash.

Interfaces:

merge(block hash, (entropy, label, count)) Add the block hash, increment count if already there.
(entropy, label, count) = get(block hash) Get metadata about the block hash, count=0 if not present.

Stores source sub_count values from each source ID for a hash.

Interfaces:

merge(block hash, source_id, sub_count) Add the record, increment sub_count if already there.
[(source_id, sub_count)] = get(block hash, max_count) Get up to max_count (source_id, sub_count) tuples.