mv throttle 5

Status

merged to master -
code complete - July 22, 2015
development started - June 26, 2015

History / Context

Basho's write throttle is an ongoing learning and tuning experience. This branch updates changes first released in branch mv-sequential-tunings. The mv-throttle-5 branch corrects a couple of bugs from the mv-sequential-tunings branch as well as improves overall throughput of leveldb under heavy loads.

The write throttle is key modification to leveldb to enhance users' performance experience. The write throttle gradually slows individual user write requests when leveldb's compactions are falling behind. The compactions fall behind when more user write requests accumulate in lower levels without being merged into higher levels. The worst case scenario is when one of two conditions occur that cause Google's original write logic to stall (block all writes) until compactions catch up. Basho's write throttle incrementally delays user write requests to give compactions time to catch up without causing sudden blockage of all writes.

Google divides the "levels" into two types: overlapping and sorted. Google's original leveldb allows .sst table files in level-0 to have overlapping key ranges. Any given key can be present in one or more of the .sst table files in level-0. Levels 1 through 6 are sorted. Each .sst table file in a given sorted level must contain a unique, contiguous key range. A given key can only exist in one possible file within each level. Different revisions of a key can exist once in every level-1 through level-6. Only the key in the lowest level is considered "current".

Basho previously redefined both level-0 and level-1 to be overlapping levels. This change increased compaction performance by gathering a larger byte count of keys in overlapping files before merging them with the first sorted level, now level-2. Basho's nickname for level-2 is the "landing level": all keys land on level-2 in sorted order for the first time.

A previous branch, mv-sequential-tunings, addressed a key shortcoming of Basho's original throttle. The original throttle only activated when work queues accumulated too many compactions across many databases. Thus the original throttle responded only to global loads of all open databases (Riak vnodes). A single database with a focused write load, such as a Riak handoff, would not activate the write throttle. The single database would then experience stalls. The stalls could last long enough to trigger timeout errors in network connections, which in turn failed the Riak handoff process. The mv-sequential-tunings branch added logic to activate the write throttle when a single database accumulates too many waiting compactions.

This branch fixes two bugs in mv-sequential-tunings' logic and changes the compaction selection rules. The selection rule changes result in better throughput in all test work loads. There are also adjustments to the penalty computation that result in smaller worst-case latencies.

The first fix is to count the move of a .sst table file via rename as a compaction. The global write throttle estimates the average amount of backlog by dividing the size of the waiting queue by the number of compactions. There were no move operations prior to the mv-sequential-tunings branch. So the requirement to account for a "move" as a "compaction" was not previously needed, and then missed when "move" was activated.

The second fix is to apply to always use the larger of the global throttle value or the unadjusted throttle value when a database (vnode) knows it is behind in compactions. The mv-sequential-tunings branch would always use the unadjusted throttle which could be significantly smaller than the global throttle during heavy loads. The selection of the wrong throttle value lead to erratic latencies and reduced throughput.

Basho's original compaction selection algorithm grew out of Google's algorithm. Both algorithms tended to give lower levels priority over higher levels. This tendency inflated within Basho's algorithm with the addition of simultaneous compaction support. Under long periods of heavy load, especially load targeting only one or a few databases (vnodes), the middle levels would over accumulate .sst table files. An over accumulation in level-2, the landing level, caused extremely inefficient compaction behavior. It could take the compaction threads hours to correct the inefficient .sst table file positioning within the levels once the write load ended. This branch now blocks .sst table files from compacting upward if the receiving level is larger than its "desired" size. The blocking effectively give the higher levels priority in compaction selection when they accumulate too many files. The new compaction selection algorithm also better supports tiered storage arrays when the "slow" array is considerably slower than the "fast" array and the "fast" array is quickly reaching its storage capacity.

Branch Description

Note: Portions of the logic change in VersionSet::UpdatePenalty() (db/version_set.cc) where checked into the repository without a formal branch. This was due to the timing of a customer facing issue. That direct checkin had code that was messy. Specifically, a large #if 0/#else/#endif block is eliminated in this branch to make code flow more natural. The original, old code was previously within the #if 0/#else section.

db/db_impl.cc, routine DBImpl::BackgroundCompaction()

"if" block govern by IsTrivialMove() causes a rename operation to quickly move a .sst file from one level's directory to another. This move operation needs to count as "compaction" when the throttle compares work queue backlog to compaction completion rate. The new call to SetThrottleWriteRate() causes the compaction count to increase without changing the accumulation of time spent compacting and number of keys compacted, i.e. first two call parameters are zero.

db/db_impl.cc, routine DBImpl::DoCompactionwork()

The SetThrottleWriteRate() call was adjusted. The fourth parameter is eliminated with this branch. The previous tracking of work queue size was done incorrectly. The work queue size should be a snapshot per tracking period, not an accumulation of size seen at the time of each compaction. Snapshot code change is discussed as part of throttle.cc changes.

db/db_impl.cc, DB::CheckAvailableCompactions(), DBImpl::CheckAvailableCompactions()

CheckAvailableCompactions() is a new Basho specific API call. This API call allows a completing compaction thread to suggest that other databases recheck if they have an eligible grooming compaction. Grooming compactions do NOT accumulate on the thread work queues. So any given database might have previously had a grooming compaction rejected. Previously there was no "retry" for a grooming compaction. This API call creates a clean means of creating that "retry".

DB::CheckAvailableCompactions() provides a default implementation for base class. DBImpl::CheckAvailableCompactions() provides the live implementation used in leveldb.

db/db_impl.h

Add CheckAvailableCompactions() declaration for implementation class.

db/db_test.cc

Correct a test's kMaxFiles constant to account for Basho's leveldb having two overlapped levels instead of just one.

db/version_set.cc, routine Version::Get()

Correct the Basho specific performance counter to accumulate number of files searched on a level, not how many times a level gets searched.

db/version_set.cc, routine Finalize()

This routine embodies the compaction selection algorithm. The key adjustment is to have each decision also verify that the size of the destination level, i.e. "level+1", is currently within size tolerance. This branch uses m_DesiredBytesForLevel for the decision. m_MaxBytesForLevel seems more intuitive, but once any level gets above that value a write throttle penalty occurs and the performance goes down. (m_DesireBytesForLevel + m_MaxBytesForLevel)/2 might be a better choice, but requires future testing to verify. That test would allow grooming compactions to potential clear the destination level naturally instead of via a blockage of the lower level.

db/version_set.cc, routine UpdatePenalty()

This routine previously had two distinct blocks: block for overlapping levels and block for sorted levels. This tuning now uses the same calculations of penalty for both types of levels. Only the parameters for calculating the penalty differ.

The parameters used became progressively smaller during the testing of this branch. There might now be an even simpler way to code the penalty, but not worthy of the effort at this time.

The overlapping levels continue to set penalty parameters based upon the count of files in the level, not byte count of the level. The file count compares kL0_SlowdowWritesTrigger and kL0_CompactionTrigger. This branch significantly changed the penalty parameters for level-1. The parameters started high and gradually settled on lower values due to testing. It makes no change to level-0.

The sorted levels use the same penalty parameters for all level that exceeds m_MaxBytesForLevel. This has not changed. The branch changes the parameter calculations for level-2 (the landing level) if its total byte count falls between m_DesiredBytesForLevel and m_MaxBytesForLevel. The parameters create a sliding scale from light to slightly heavy penalty respectively.

db/version_set.h, WriteThrottleUsec()

This is a key change in this branch. GetThrottleWriteRate() is the rate used for global write throttling when the work queues persistently have waiting compactions. GetUnadjustedThrottleRate() is the rate used by a single database (vnode) that is behind in its compactions. Testing demonstrated that the single database case needed to use whichever throttle was biggest. Otherwise the single database would actually contribute more load to an overloaded compaction queue instead of helping reduce the burden.

include/leveldb/db.h

Declare the new Basho specific API function in the base class.

include/leveldb/env.h

Remove a Basho specific API function in the Env class. The function is no longer necessary. It was part of an incorrect design.

util/env_posix.cc, routine GetBackgroundBacklog()

The code design that used this function was incorrect. The corrected design gets similar information in a different fashion within the main throttle loop (util/throttle.cc).

util/thread_tasks.cc, entire file

This is a new file. Previously both the declaration and definition of the various ThreadTasks existed within thread_tasks.h. The addition of GroomingPollTask object and its usage within the CompactionTask necessitated the creation of this new source file.

CompactionTask::operator() performs the execution as previously implemented. This function contains new code to see if the main work queue is in a state that might, not guaranteed, support a grooming compaction. If so, create a GroomingPollTask object to query all the open databases. The subtlety here is that a grooming compaction can only occur on one specific thread of the thread pool. It is highly probable that current thread is that specific thread. The poll needs to occur from a different thread pool and on a different thread. That way the current thread can complete and be ready to accept a future grooming task.

This thread to thread stuff is not perfect. It is quite possible that the current thread will yield its CPU time for the entire execution of the GroomingPollTask on another thread. No grooming compaction will start in such a case. But that is ok.

GroomingPollTask::operator() calls the DBList object with the function that should execute on every user database. The first parameter of ScanDBs is "false" to indicate that internal databases, such as AAE, should not be polled.

The routine is inefficient in that it will always call every database. If one database successfully starts a grooming task, it is extremely unlikely that any subsequent database will start one too. There could be a loop exit. However ScanDBs() does not have abort logic at this time.

util/thread_tasks.h

Add GroomingPoolTask object. Move its operator() and CompactionTask's operator() into the new source file util/thread_tasks.cc.

util/throttle.cc

m_Backlog: Previously the throttle tracked the number of backlog items present at each compaction. Several line changes within ThrottleThread() present a new approach to backlog tracking. Now, the current period's m_Backlog is kept out of the accumulation and set later to to the number of backlog items existing at that moment.

Mathematically this is changing from "average backlog existing per compaction" to "backlog to compaction ratio over the hour". The latter is more appropriate for estimating a throttle that balances the ingest rate against the compaction rate. The former can cause huge, unreasonable throttle spikes.

(note to self, still some debug code at end of SetThrottleWriteRate() ... clean it out)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly