Skip to content

Mv sst fadvise

Matthew Von-Maszewski edited this page Jul 22, 2013 · 6 revisions

Status

  • merged to "master" July 22, 2013
  • development started July 11, 2013

History / Context

posix_fadvise() calls were added in haste during the Riak 1.2 development cycle. This was long before there was a good understanding of how leveldb's file operations interacted. The mv-sst-fadvise branch cleans up the usage of posix_fadvise(). Better posix_fadvise() leads to better Linux page cache management and more consistent latencies on disk operations.

A race condition was found during the work on this branch. It is theoretically possible for a new .sst table file to close before all write operations completed. To make matters worse, the write operations happen on a background thread that has no way to communicate with Google's logging mechanism. This branch both eliminates the race condition and forces error reporting to syslog and the performance counters. Note: there are no known instances of this race condition actually occurring, but there is no way to really know that it never did occur.

Branch description

Added NewWriteOnlyFile() to util/env_posix.cc and include/leveldb/env.h. This call distinguishes written files that will not be immediately opened again within the current process. The old function, NewWritableFiles() still exists and is used by .sst table files (and unit tests) that need to be able to quickly open a file just written.

Files created/written by NewWriteOnlyFile() push their write and close operations to a background thread queue. Files created/written by NewWritableFile() perform all their operations on the current thread. The two calls only differ under the covers by the former setting is_write_only_ to true in PosixMmapFile object.

The race condition created in Riak 1.2 was the use of the background thread for all file writes, then using current thread for close operations on files that would be re-opened quickly. First, the files that would be re-opened quickly were already on background threads so did not benefit from further threading. Second, the extra thread performing the write operations had the potential of not completing before the other thread closed the file handle.

NewWriteOnlyFile() is now used for recovery log files.

The routines BGFileCloser(), BGFileCloser2, BGFileUnmapper(), and BGFileUnmapper2 in env_posix.cc were modified to monitor return values from unix functions called. If an error is seen, the new performance counter ePerfBGWriteError is incremented and a message written to syslog. Syslog is unfortunately necessary since leveldb does not currently pass handles to the LOG file. The intent is to have the Riak admin console monitor this new performance counter like it monitors ePerfReadBlockError to alert Riak users.

(Also note that the above four routines are highly redundant and beg for a clean, consolidated rewrite. The day for the rewrite is unfortunately not today. This branch may go into a patch release soon, so changes needed to be minimized.)

The virtual function SetMetadataOffset() was added to the WritableFile class and implemented in PosixMmapFile. This call allows the function creating a new .sst table file to provide a hint as to what portion of the new file should remain in the page cache. All .sst table files are immediately reopened after their creation. The metadata is reread as part of the open. There is measurable benefit to having the metadata marked in the Linux page cache to speed the reopen processing.