mv tuning7

Status

merged to master
code complete - May 30, 2014
development started - April 10, 2014

History / Context

This discussion covers mv-tuning7 branches in both the eleveldb repository and the leveldb repository. Most of the changes / fixes are motivated by Scott Fritchie's fault injection work. The most notable fix relates to a SEGFAULT error that can occur if Erlang garbage collects its heap before the glibc destructor code completes. (The discovery of this latter issue required the coupling of Valgrind with the fault injection.) The "Issues" addressed are:

eleveldb Issue #107: failure in case C1 is fixed. Failure in case C4 was moved to eleveldb Issue #112.
eleveldb Issue #110: we refer to this as the SegFault bug. The "Invalid write of size 8" is glibc changing the vptr of the C++ object that resides in the Erlang heap.
eleveldb Issue #71: this issue was addressed to properly correct eleveldb Issue 110 above.

Branch description

Object close and release recoding

The bulk of the changes in eleveldb's mv-tuning7 branch relate to eleveldb Issue #110 mentioned above. A race condition was unknowingly added in Riak 1.3 when the simple "struct" that was kept within Erlang's heap was replaced with a "class" that contained virtual members. The memory debugging tool Valgrind noted that an 8 byte region at the base of a believed destroyed object was randomly getting written after Erlang garbage collected. Debugging of the exact object and memory location by Steve Vinoski suggested that the object's vptr was changing very "late". Internet search revealed this:

    http://code.google.com/p/thread-sanitizer/wiki/PopularDataRaces#Data_race_on_vptr

The situation discussed in the above link completely described the problem. The eleveldb code was then reworked to keep the C++ class object on the C heap. Only a pointer to the object exists on the Erlang heap now.

The race condition involved database close operations and iterator close operations. Both databases and iterators placed their "class" objects within the Erlang heap. Everything worked fine if the close operation (either database or iterator) initiated via the eleveldb API and completely destroyed the object before the Erlang garbage collection reached the object's position in the Erlang heap. Unfortunately this assumption had the opportunity to fail if the Erlang process managing the database or iterator died. The assumption could also fail if API complete most, but not all of the glibc destructor code before Erlang's garbage collection reached the object. This last condition happened to be nicely stimulated by Valgrind.

New close protocol:

Claim the close: the eleveldb API code now schedules an async thread to perform the actual close work (relates to eleveldb Issue #75 mentioned above). The "claim" needs to happen during the initial API call to reduce the chance of other threads acting on the close instead of the async thread.
Perform the close: the winner of the claim performs the close steps. The close can take significant time and block its thread. Hence the desire to execute the close on an async thread whenever possible.

"Claim the close" has three potential contenders for an iterator:

The eleveldb API call to close the iterator,
A closing database will attempt to close any lingering iterators, and
Erlang's garbage collector can attempt to close the iterator.

A database object only has the latter two contenders for its close.

The claim is managed via an atomic compare and swap operation against the object's pointer within the Erlang heap. The first thread to successfully swap the pointer becomes responsible for the actual close operation (either immediately or by an asynchronous thread task).

The atomic compare and swap does create a very, very remote chance of a new type of SegFault. The following 3, increasingly unlikely, conditions must be met for the failure:

Erlang's garbage collector must win the "claim", completely execute the close, and finish its garbage collection.
The Erlang heap must have been obtained through a mmap() allocation and that allocation released to the operating system.
The API close must still attempt its "claim" after the above two.

The new design should only have Erlang's garbage collector winning the claim if the process holding the iterator or database object reference dies. When the process dies, the likelihood of that process making the close API calls dies with it.

eleveldb: c_src/eleveldb.cc and src/eleveldb.erl

Two close APIs were changed from synchronous to asynchronous operations: eleveldb_close() and iterator_close(). Both functions now use the "Claim the close" protocol discussed above. The asynchronous task creation mimics the same technique already employed in other api calls.

eleveldb: c_src/refobjects.cc and .h

The close protocol relies upon the new member variable m_ErlangThisPtr within the ErlRefObject class. This variable is key in the new protocol. The old protocol depended heavily upon m_CloseRequested instead. m_CloseRequested is still maintained, but is not essential to the protocol.

ErlRefObject::InitiateCloseRequest() is only called by the winner of the claim. It calls an object specialized Shutdown() function then waits on a condition variable until all references to the object release. Then object destruction completes, either by the thread that was waiting or by the thread that just performed the last release. Either is fine.

ErlRefObject::DecRef() is an important partner to InitiateCloseRequest(). It identifies the last external reference to the object and notifies the closing thread that the close can now finalize.

eleveldb: c_src/workitems.cc and .h

Add CloseTask and ItrCloseTask objects to manage close operations on a worker (async) thread. Code that used to happen synchronously within c_src/eleveldb.cc was moved here.

leveldb: db/db_impl.cc

Several Log() operations were either moved outside regions of code where the global mutex_ was held, or in some cases eliminated. The Log() statements are blocking. If the global mutex_ is held, all other activity within the same database is subject to being blocked also. Hence a blocking Log() statement could block everything. This was highly likely in situations with heavy disk activity.

s.ok() added within DBImpl::Recover() to prevent the system from hiding a recovery log failure.

The hardcoded 75,000 key limit was raised to 300,000. The 75,000 limit lead to many, many .sst table files that were small in size. There could be over 100,000 files open in a 4 terabyte database. This constant will likely get replaced with parameter that varies by level later.

leveldb: db/version_set.cc

Changes parallel work discussed for db_impl.cc above

leveldb: util/cache2.cc

The file cache will automatically close a file that has not been actively used. The original setting was to close after 4 days. This is too short when considering AAE will activate every 7 days and force all the files open again when rebuilding hash trees. The new 10 day setting will therefore only activate on systems that do not use AAE.

leveldb: util/env_posix.cc

Several file activities now have retry logic. This logic was added in response to Scott Fritchie's fault injection scenarios. It is unclear as to how many real world scenarios are actually handled by the logic.

leveldb: util/posix_logger.h

The spin lock presented in this file has been subsequently removed. fwrite() is defined in the C standard to be thread safe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly