Skip to content
Bruce Allen edited this page May 15, 2017 · 3 revisions

To improve build time, we are considering replacing the hashdb LMDB data store with RocksDB.

RocksDB rocksdb.org has some particular provisions and constraints significant to the design of dist_hash:

  • RocksDB is optimized for fast storage on one node where size of data is greater than size of memory.
  • RocksDB maintains DB integrity during writes across threads using its own locks.
  • RocksDB does not maintain DB integrity during writes across processes or machines. Different processes/machines must use their own DB. To merge them to one, one process is required.
  • RocksDB achieves fast storage via its merge architecture. A background thread is required for this. Write amplification can occur if it cannot keep up. Checks are available to observe status.
  • RocksDB uses "column families" to store multiple tables in one database. Although each table is managed in its own file, write integrity is preserved across the database. Depending on design complexity and processing burden, dist_hash may not use this provision.
  • dist_hash uses column families to wrap tables under one DB.
  • Unlike LMDB, RocksDB Keys and values can be megabytes long.
  • Unlike LMDB, RocksDB does not allow multimaps. To keep keys unique, information in the data field must be appended to the key. RocksDB provides range-scan, allowing detection of values within the range. hash_source store uses this approach.
  • RocksDB architecture employs multiple optimization techniques including levels and Bloom filters.

tables

hashdb stores source and hash information in several tables:

Source ID Store

Binds source hash values to source ID values.

  • key = file hash
  • value = source ID, starting at 1.

A special key, 0x00, is used to store the current highest source ID value. When a new source ID is requested, it will be this value plus one.

Interfaces:

  • bool insert(file hash, &changes, &source_id) acquires a source ID for the file hash. True and new source ID or false and existing source ID.
  • bool find(file hash, &source_id) returns source ID and true else 0 and false.
  • it iterator() returns a new source ID iterator. Delete it when done.
  • string next(it) returns the next source ID for the given iterator else "".

Source Name Store

Stores all (repository name, filename) tuples associated with each source ID.

  • key = source_id + repository name + filename
  • value = ""

Interfaces:

  • void insert(source_id, repository name, filename) Add the record. Update status if already there.
  • bool find(source_id, max_count, &names[]) Get filename, repository_name pairs, false if no source_id.

Source Data Store

source data store maps the source file hash to the source metadata.

  • key = the source file hash
  • value = source metadata, multiple fields:
    • file_hash The full file hash.
    • size The file size.
    • type A label indicating the type of file, currently not used.
    • count The count of block hashes that can be stored for the source, which will be more than the amount actually stored if the source contains blocks with all zero bytes.
    • zero_count The number of block hashes skipped because the block was all zeros.
    • non_probative_count The number of blocks marked as non-probative.

Interfaces:

  • add(file hash, <source metadata>) Add source metadata to added file hash. Use merge to log unexpected change request.
  • data = get(file hash) Get data.

Hash Data Store

Stores entropy, block label, and count metadata associated with a hash.

  • key = block hash
  • value = count

Interfaces:

  • merge(block hash, (entropy, label, count)) Add the block hash, increment count if already there.
  • (entropy, label, count) = get(block hash) Get metadata about the block hash, count=0 if not present.

Hash Source Store

Stores source sub_count values from each source ID for a hash.

  • key = block hash + source ID
  • value = sub_count

Interfaces:

  • merge(block hash, source_id, sub_count) Add the record, increment sub_count if already there.
  • [(source_id, sub_count)] = get(block hash, max_count) Get up to max_count (source_id, sub_count) tuples.