Skip to content

mv valgrind cleanup

Matthew Von-Maszewski edited this page Sep 4, 2013 · 16 revisions

Status

  • merged to master September 4, 2013
  • development started August 27, 2013

History / Context

Valgrind is an excellent program diagnostic tool (www.valgrind.org). Unfortunately, it is easy to let a program get into a condition such that valgrind generates more "noise" than actionable information. The Riak branch of leveldb was in such a condition. Some of the noise was from the original Google implementation. Some of the noise was from programmer laziness in the Riak modifications. This branch cleans up all the noise and addresses an actual edge case memory leak detected by valgrind. The memory leak joined the code base with Riak 1.4.

Valgrind was used with each of the unit test programs ("make check"). The branch cleans up the leveldb code and/or the unit test as appropriate to achieve a clean valgrind execution of the following command:

valgrind --leak-check=full --show-reachable=yes ./unit_test_program

"unit_test_program" is a placeholder for each of the Google unit test programs created by "make check". Warning: valgrind makes some programs run for a long, long time. Example: "skiplist_test" requires 26 hours to complete and "db_bench" requires over an hour under valgrind. And valgrind limits everything to a single CPU. Pick a machine with a fast clock rate over one with tons of CPUs. The unit tests addressed by this branch are:

arena_test  coding_test      c_test         env_test           log_test       table_test         write_batch_test
bloom_test  corruption_test  dbformat_test  filename_test      memenv_test    version_edit_test
cache_test  crc32c_test      db_test        filter_block_test  skiplist_test  version_set_test

Branch description

pthread exit values

Any thread created with pthread_create allocates a block of memory that is later used to return the exit value of the thread. valgrind reports that exit value memory as a leak unless the code calls either pthread_join() or pthread_detach(). None of the original Google leveldb or unit test code included either call. Likewise, none of the Riak changes included either call.

Many of the unit tests utilize the Env::StartThread call to create a thread. Previously this call had a "void" return value. It now returns the pthread thread id. The Env::StartThread declaration changed in include/leveldb/env.h. Also the EnvPosix::StartThread declaration and implementation changed within util/env_posix.cc. The unit tests utilizing the call now save the return value and immediate call pthread_detach() with that value. Memory leak fixed.

Google's original PosixEnv::Schedule() routine in util/env_posix.cc called pthread_create directly instead of calling Env::StartThread. Riak's expansion of this routine to 4 background threads followed the same design. The pthread thread id has always been saved in both cases, but never used. The thread id is now used in the PosixEnv::~PosixEnv destructor to properly close all threads via pthread_join().

leveldb shutdown

Google's leveldb implementation lacks an API call for shutting down its background compaction thread. This branch adds a static function Env::Shutdown(). Env::Shutdown() is contained in util/env_posix.cc. The function current takes shutdown steps via the PosixEnv array to kill background threads, via the new shutdown calls in util/throttle.cc, and via new shutdown call in util/comparator.cc:

  • PosixEnv::~PosixEnv() now uses the previously ignored pthread thread id to synchronize a shutdown of its contained background threads (util/env_posix.cc). PosixEnv::BGThread() was previously an infinite loop. It now monitors a state variable to exit the loop. PosixBGThread() does not exit until its thread queue is empty. This behavior solves two problems. The common problem is that leveldb has no awareness of thread #3 which performs disk writes and closes asynchronously. Those need to complete! The race condition problem is that a leveldb database close should purge all entries from the other 3 queues. But the database close could be on a thread that is independent of the thread performing the shutdown (Travis CI is suspected to have hit this). This means close thread waits upon compactions to abort, but the thread that would perform the abort would be gone ... without this wait to finish queue logic.

  • util/throttle.cc & .h now have a new global shutdown function. It manipulates a new pthread_conditional and status flag to kill off the throttle maintenance thread.

  • util/comparator.cc & .h now have a ComparatorShutdown() global function. This cleans up the global comparator that leveldb creates upon first use of an Options object.

overlapped file memory leak

Riak 1.4 added code to force overlapped .sst table files to never flush from the file cache. The overlapped .sst files populate levels 0 and 1. They are accessed heavily. It was possible in several Riak environments for these heavily files to be flush regularly, requiring expensive reload operation to happen equally regularly. This modification provides a huge performance savings in particular physical environments and/or workloads.

However, there were some exception scenarios where "never flush" became a memory leak. mv-clean-overlaps branch previously addressed the most common scenario. valgrind use with Google's db_test program identified 4 more scenarios:

  • current overlapped files were not purged from memory on vnode / database close (fixed by mv-clean-overlaps)
  • new level 0 files, when aborted, still held memory (DBImpl::WriteLevel0Table in db/db_impl.cc)
  • new level 1 file s, when aborted, still held memory (DBImpl::DoCompactionWork in db/db_impl.cc)
  • level 0 or level 1 files are compacted into higher levels, but are still are part of an iterator (Riak's AAE or 2i) would not release memory when iterator released (Version::~Version in db/version_set.cc). This change was better, more generic, than mv-clean-overlaps branch and therefore mv-clean-overlaps code in VersionSet::~VersionSet was removed.
  • the repair operation builds its own map of .sst table files, some of which are overlapped files. It needed its own release code in db/repair.cc.