Skip to content

mv delayed bloom

Matthew Von-Maszewski edited this page Oct 29, 2016 · 10 revisions

Status

  • merged to master - October 10, 2016
  • code complete - October 6, 2016
  • development started - September 15, 2016

History / Context

This branch began as a single purpose branch to address a random read performance opportunity that became obvious during the dynamic block size work. Then the testing of this random read change uncovered a problem with the write throttle during heavy loads. Both issues are addressed within this branch.

The random read performance improvement only impacts random reads / random iterators when the size of the dataset is large and the available RAM is small in comparison. This is the condition where it is highly likely that the desired .sst table file is not present in the leveldb file cache. The read operation therefore forces leveldb to open an .sst table file. The open operation requires:

  • read of the file footer (at end of file)
  • read file index block
  • decompress file index block
  • read bloom filter
  • decompress bloom filter
  • read file statistics table
  • decompress file statistics table

Then, the read operation uses the bloom filter and file index to determine if desired key is present and potentially read one data block that might contain the key (data block read, data block decompress).

The optimization acknowledges that the bloom filter can reach 600K bytes in size while the typical data block is only 4K in size. Reading 600K to prevent a 4K read is not necessarily the correct action if this .sst file might only be open for a single read. Similarly, iterators never reference the bloom filter. Random iterators would begin faster if their .sst files did not read the bloom filter upon file open. The code now delays loading the bloom filter until the second random read operation.

The exception to the above optimization is that newly created .sst files within the overlap levels (level 0 and level 1) that open on a background compaction thread do load their bloom filter, i.e. the output files of a compaction. This is because there is no latency to a user request to perform this operation on the background thread. Newly opened input files used on background compaction do not load their bloom filter.

Testing of the above optimization demonstrated a 2x performance increase on random read operations with a 4 terabyte dataset across 64 vnodes (leveldb databases). The server booted with only 8 gigabytes of RAM.

The write throttle problem became obvious during the creation of the 4 terabyte dataset. This was completely independent of the new branch. Two functions within db/version_set.cc were not tuned together and therefore created extremely heavy throttling by effectively fighting each other. The two routines are VersionSet::Finalize() and VersionSet::UpdatePenalty(). The key issue was with Finalize()'s treatment of level 1 versus level 2. Finalize() would not schedule a level 1 compaction until level 2's disk usage dropped below a designated threshold. UpdatePenalty() did not acknowledge this decision in its throttle penalty computation. UpdatePenalty started assigning non-linear penalty (throttle multiplier) once level 1 exceed 8 files needing compaction. Testing demonstrated level 1 reaching 30 and more pending files. The penalty computed by UpdatePenalty() effectively blocked the write channel.

The primary correction was to have VersionSet::Finalize() only delay level 1 compactions if its file count at level 1 was below the trigger count UpdatePenalty() uses to begin penalty assignment. UpdatePenalty() also received minor adjustments to the penalty computation constant. But the adjustments are not the primary algorithmic correction. The fix is the change to Finalize() to match UpdatePenalty().

Branch description

db/builder.cc

BuildTable() routine converts the "imm_" write buffer into a level 0 table file.

The BuildTable() routine already performed a test open operation of a newly created .sst table file by calling TableCache::NewIterator(). The NewIterator() call now includes the optional fourth parameter that receives a pointer to the Table object related to the .sst table file. The Table object point is then used to force a read of the bloom filter if this table file is part of an overlapped level. The assumption is that overlapped levels get heavy access and therefore should have the bloom filter ready once activated.

The Table object is "owned" by the iterator. The pointer is only valid for as long as variable "it" is valid.

db/c_test.c

Corrected unrelated typo: "(null" -> "(null)"

db/db_impl.cc

Exactly like BuildTable() discussed previously, the DBImpl::FinishCompactionOutputFile() function performs a test of a newly created .sst table file by calling TableCache::NewIterator(). Again the call now includes the optional fourth parameter that receives a pointer to the related Table object. The function loads the bloom filter for this new .sst file if the file will reside in an overlap level.

db/table_cache.cc & .h

Previously the TableAndFile object existed within only one entry within the file cache. The delayed bloom filter logic creates a situation where the same TableAndFile object can exist in two file cache entries: the old file cache entry that tracks the memory usage without the bloom filter and a new file cache entry that tracks the memory usage including the size of the bloom filter. There is no guarantee that older Get or Iterate operations will release their file cache handles synchronously with the delayed load of the bloom filter. Therefore, DeleteEntry has to track how many file cache entries are using the TableAndFile object. The TableAndFile object only deletes once all file cache entries delete, no matter what order.

The TableAndFile object now contains a volatile variable user_count. It is maintained via atomic operations and controls when TableAndFile truly deletes within DeleteEntry().

TableCache::FindTable() function now has a sixth parameter. This parameter is a bool that allows the caller to indicate whether or not the requested table is intended for use with an iterator. The flag lets the function decide whether or not to read the bloom filter when the table already exists in the file cache (which implies this is not its first use). Iterators do not use the bloom filter.

db/version_set.cc

The changes in db/version_set.cc relate only to the write throttle bug.

VersionSet::Finalize() previously withheld compactions on level 1 (top overlapped level) while level 2 (landing level) still exceeded its "desired size". The goal was to minimize the total number of bytes that could potentially participate in a compaction of level 1 and level 2 files. This goal was counter to the penalty calculation in VersionSet::UpdatePenalty(). The result was that a extremely high penalty could result due to Finalize() intentionally not scheduling needed compactions for level 1 when the number of files on level 1 exceeded config::kL0_CompactionTrigger. The penalty could be large enough to effectively block all writes.

VersionSet::Finalize() will now ignore the previous optimization whenever the number of level 1 files exceeds config::kL0_CompactionTrigger (today, 6 files).

VersionSet::UpdatePenalty() received a couple of parameter tunings. The penalty for overlapped layers is now only considered if the number of files is greater than config::kL0_SlowdownWritesTrigger. Previously it was config::kL0_CompactionTrigger (a smaller number). This change is to allows Finalize() to wait until kL0_CompactionTrigger files to submit a compaction without UpdatePenalty() setting a penalty at the same time.

Other constants within UpdatePenalty() changed to make the penalty less dramatic for overlapped files.

table/table.cc && include/leveldb/table.h

The Table class supports a new routine, ReadFilter(). This routine encapsulates the necessary steps to load a bloom filter. Previously the same steps were executed as part of the OpenTable() routine.

Table::ReadMeta() no longer reads the bloom filter. It saves the information needed for a later operation within the filter_handle and filter_policy variables which are new to the Rep class.

Table::ReadFilter() contains the code previously within ReadMeta() to actually retrieve the bloom filter from disk.

tools/sst_scan.cc

sst_scan needs to perform the ReadFilter() call to force .sst data to include statistics about the bloom filter.

util/hot_backup.cc

Unrelated edit to clear a warning.

util/throttle.cc

Restore old adjustment to throttle that divided by the number of compaction threads. Without this, the updated code in VersionSet::UpdatePenalties() was over throttling write operations.