Skip to content

mv write sizing

Matthew Von-Maszewski edited this page May 8, 2014 · 4 revisions

Status

  • merged to master May 8, 2014
  • code complete May 8, 2014
  • development started April 30, 2014

History / Context

Basho's Riak utilizes the posix memory mapped files for all data written within leveldb. Google's original code used a fixed 2 megabyte mapping window. Basho raised this to 20 megabytes to increase write throughput. The 20 megabyte size created two problems:

  1. Each database / vnode holds at least two files open: manifest file and recovery log file. Each file allocating 40 megabytes of memory space. A server with 64 vnodes plus 64 anti_entropy vnodes would allocate 5 gigabytes of space just for these files. It is unlikely that physical RAM is actually allocated to all 5 gigabytes at a time, but that memory usage must still be assumed to safely calculating memory available for file and block caches.

  2. Riak's anti-entropy and 2i features tend to create smaller .sst table files. The entire 20 megabyte mapping window is always written to disk before truncated to the actual size of data. The wasted space at the end of the file can represent a large amount of unnecessary disk activity on these common, smaller files.

This branch modifies the file creation routines to include a mmap size parameter. The sizes are differently based upon the file type/usage:

  • Manifest file: 4,096 byte window. File state information is gradually appended to this file over time. There is no performance benefit to having a large mmap window.

  • Level 1+ .sst files: 20 Megabyte window. This is the same as before this branch. However, there is a new Option parameter, mmap_size, that will globally change the window size. (The mmap_size option is from the github user Licenser and detailed in pull request #122.)

  • Level-0 .sst files and Recovery .log files: window size varies. The window size for these two files is based upon the value of Options.write_buffer_size. When write_buffer_size is "large", greater than 10Mbytes, the window size is 2/3 of write_buffer_size. The size is a guess of how to write the file in two buffers with not too much extra space at the end. Similarly when write_buffer_size is "small", the window size is 1.2 times write_buffer_size in hopes of getting the entire file within one buffer.

Branch description

db/db_impl.cc