Skip to content

mv tuning9

Matthew Von-Maszewski edited this page Jul 9, 2014 · 7 revisions

Status

  • merged to master - July 7, 2014
  • code complete - July 3, 2014
  • development started - June 30, 2014

History / Context

leveldb's performance is highly influenced by three caches: file cache, page cache, and block cache. The file cache is the most important. leveldb's performance drops 60% or more when the file cache thrashes. The page cache is the next most important to performance, especially in environments with heavy write loads and frequent accesses to recent write operations. The block cache is typically helpful, but has the flaw of effectively flushing recent data after a compaction cycle. The page cache is currently the only way to cover the block cache's flaw.

Basho's leveldb has used static rules for which files get directed to the page cache and which do not. This branch makes the rules dynamic based upon current memory usage. Memory available to the block cache is implicitly diverted to page cache. "Implicitly diverting" means that the block cache will restrict its size to leave room in the operating system for more page cache memory.

The posix_fadvise() function is the tool for directing newly written data into the operating system's page cache or directly to disk. The previous static rules always sent level 0 and level 1 files to the page cache, all other files directly to disk. The static rules were suboptimal in two situations: systems with large physical RAM to vnode (database) ratios, and systems with small physical RAM to vnode (database) ratios. Hence, almost all situations were suboptimal. The static rules compromised in favor of simple code over higher performance.

This mv-tuning9 branch adds one function that estimates the impact of sending the next compaction file into the page cache versus sending it directly to disk. The function selects the upcoming file to go to the page cache only if all current files at the lower, higher volatility levels can already fit within the page cache. The function takes the amount of RAM currently allocated to the block cache and compares it to the total file size of all current files in the lower levels. Only if there is available space does the new file go directly to the page cache.

Branch Description

db/db_impl.cc

All of the core logic is within this source file. The routine DBImpl::OpenCompactionOutputFile() previously contained a hard coded test against IsLevelOverlapped() to decide page cache versus disk. This function returns true for levels 0 and 1. A new dynamic routine Send2PageCache() now encapsulates the page cache routing decision.

Send2PageCache() is called much earlier in the OpenCompactionOutputFile() routine than where the "if" test occurs. This is due to its usage of data structures within the "versions_" object that are not thread safe. The call to Send2PageCache() occurs during the window that OpenCompactionOutputFile() holds the lock that protects "versions_" object.

The logic of Send2PageCache() is quite simple. level-0 always gets shoved into the page cache. Other levels require the sum of their predecessor levels to be smaller than the block cache's current size.

util/cache2.cc

DoubleCache::GetCapacity() is upgraded to deliver the size of the block cache in two different ways. The new, second parameter of the function controls which size. The default is to give the size AFTER subtracting a page cache usage estimation against ALL files. The new size, delivered when EstimatePageCache is false, does not make any page cache allowances. This second size lets Send2PageCache() view the entire block cache size and determine how best to populate it (or not) with lower level file data.