Skip to content

Commit

Permalink
New spin_rw_mutex implementation with greatly improved performance (#…
Browse files Browse the repository at this point in the history
…1787)

You can see the performance in the following two tables. It's a
benchmark using our unit test spin_rw_test.cpp. We vary the ratio of
writers to readers from 1:9 (one write lock and modify for every 9 read
locks) to 1:9999.

All times are in seconds for this workload (smaller is better), and
they were executed on Linux on a machine with 32 physical cores (64
hyperthread cores).

Old code:

    threads  1:9     1:99      1:999     1:9999
    --------------------------------------------
     1       0.3      0.3       0.3        0.3
     2       0.5      0.5       0.6        0.5
     4       5.4      3.4       3.2        2.8
     8       9.9      9.5      10.6        9.0
    12      12.3     13.1      11.8       12.1
    16      13.7     14.3      14.0       14.8
    20      15.4     16.2      16.1       17.5
    24      17.9     16.9      18.3       18.1
    28      20.0     21.0      21.2       20.9
    32      21.4     22.2      22.8       20.9
    64      20.9     22.4      22.2       21.4

New code (this patch):

    threads  1:9     1:99      1:999     1:9999
    --------------------------------------------
     1	     0.2      0.2       0.2        0.2
     2	     0.9      0.7       0.5        0.5
     4	     1.4      1.0       0.8        0.8
     8	     3.6      1.5       1.1        1.0
    12	     5.1      2.4       1.4        1.2
    16	     6.0      2.8       1.8        1.4
    20	     6.8      3.6       2.1        1.7
    24	     8.4      4.4       2.4        2.0
    28	     9.1      5.0       2.8        2.2
    32	    10.8      5.3       3.2        2.4
    64	    11.8      5.8       4.2        3.0

So the performance of the new code has three interesting properties:
(a) For every thread count, and every write-to-read ratio, it is
superior to the old code (only exception: 2 threads, heavily weighted
to writers). (b) For every workload, the new code scales better,
versus thread count, than the old code did. (c) Whereas the old
code has similar performance regardless of workload, the new code
gets remarkably more efficient as use is dominated by readers --
there is VERY little interference between simultaneous readers.

I don't expect much in OIIO to speed up *today* as a result of this,
because there are only a couple places where we use the spin_rw_mutex.
But I'm laying the groundwork for some improvements I'm doing to the
ImageCache/TextureSystem, which currently doesn't use rw locks but I'm
trying out an improvement that will utilize them, and I think this is
going to be a key component to making it scale better with larger number
of cores.
  • Loading branch information
lgritz committed Oct 22, 2017
1 parent e94bc7e commit c940875
Show file tree
Hide file tree
Showing 2 changed files with 119 additions and 2 deletions.
120 changes: 118 additions & 2 deletions src/include/OpenImageIO/thread.h
Original file line number Diff line number Diff line change
Expand Up @@ -164,10 +164,10 @@ pause (int delay)
// Helper class to deliver ever longer pauses until we yield our timeslice.
class atomic_backoff {
public:
atomic_backoff () : m_count(1) { }
atomic_backoff (int pausemax=16) : m_count(1), m_pausemax(pausemax) { }

void operator() () {
if (m_count <= 16) {
if (m_count <= m_pausemax) {
pause (m_count);
m_count *= 2;
} else {
Expand All @@ -177,6 +177,7 @@ class atomic_backoff {

private:
int m_count;
int m_pausemax;
};


Expand Down Expand Up @@ -284,6 +285,10 @@ typedef spin_mutex::lock_guard spin_lock;



#if 0

// OLD CODE vvvvvvvv


/// Spinning reader/writer mutex. This is just like spin_mutex, except
/// that there are separate locking mechanisms for "writers" (exclusive
Expand Down Expand Up @@ -395,6 +400,117 @@ class spin_rw_mutex {
};


#else

// vvv New spin rw lock Oct 2017

/// Spinning reader/writer mutex. This is just like spin_mutex, except
/// that there are separate locking mechanisms for "writers" (exclusive
/// holders of the lock, presumably because they are modifying whatever
/// the lock is protecting) and "readers" (non-exclusive, non-modifying
/// tasks that may access the protectee simultaneously).
class spin_rw_mutex {
public:
/// Default constructor -- initialize to unlocked.
///
spin_rw_mutex () { }

~spin_rw_mutex () { }

// Do not allow copy or assignment.
spin_rw_mutex (const spin_rw_mutex &) = delete;
const spin_rw_mutex& operator= (const spin_rw_mutex&) = delete;

/// Acquire the reader lock.
///
void read_lock () {
// first increase the readers, and if it turned out nobody was
// writing, we're done. This means that acquiring a read when nobody
// is writing is a single atomic operation.
int oldval = m_bits.fetch_add (1, std::memory_order_acquire);
if (! (oldval & WRITER))
return;
// Oops, we incremented readers but somebody was writing. Backtrack
// by subtracting, and do things the hard way.
int expected = (--m_bits) & NOTWRITER;

// Do compare-and-exchange until we can increase the number of
// readers by one and have no writers.
if (m_bits.compare_exchange_weak(expected, expected+1, std::memory_order_acquire))
return;
atomic_backoff backoff;
do {
backoff();
expected = m_bits.load() & NOTWRITER;
} while (! m_bits.compare_exchange_weak(expected, expected+1, std::memory_order_acquire));
}

/// Release the reader lock.
///
void read_unlock () {
// Atomically reduce the number of readers. It's at least 1,
// and the WRITER bit should definitely not be set, so this just
// boils down to an atomic decrement of m_bits.
m_bits.fetch_sub (1, std::memory_order_release);
}

/// Acquire the writer lock.
///
void write_lock () {
// Do compare-and-exchange until we have just ourselves as writer
int expected = 0;
if (m_bits.compare_exchange_weak(expected, WRITER, std::memory_order_acquire))
return;
atomic_backoff backoff;
do {
backoff();
expected = 0;
} while (! m_bits.compare_exchange_weak(expected, WRITER, std::memory_order_acquire));
}

/// Release the writer lock.
///
void write_unlock () {
// Remove the writer bit
m_bits.fetch_sub (WRITER, std::memory_order_release);
}

/// Helper class: scoped read lock for a spin_rw_mutex -- grabs the
/// read lock upon construction, releases the lock when it exits scope.
class read_lock_guard {
public:
read_lock_guard (spin_rw_mutex &fm) : m_fm(fm) { m_fm.read_lock(); }
~read_lock_guard () { m_fm.read_unlock(); }
private:
read_lock_guard(const read_lock_guard& other) = delete;
read_lock_guard& operator = (const read_lock_guard& other) = delete;
spin_rw_mutex & m_fm;
};

/// Helper class: scoped write lock for a spin_rw_mutex -- grabs the
/// read lock upon construction, releases the lock when it exits scope.
class write_lock_guard {
public:
write_lock_guard (spin_rw_mutex &fm) : m_fm(fm) { m_fm.write_lock(); }
~write_lock_guard () { m_fm.write_unlock(); }
private:
write_lock_guard(const write_lock_guard& other) = delete;
write_lock_guard& operator = (const write_lock_guard& other) = delete;
spin_rw_mutex & m_fm;
};

private:
// Use one word to hold the reader count, with a high bit indicating
// that it's locked for writing. This will only work if we have
// fewer than 2^30 simultaneous readers. I think that should hold
// us for some time.
enum { WRITER = 1<<30, NOTWRITER = WRITER-1 };
std::atomic<int> m_bits { 0 };
};

#endif


typedef spin_rw_mutex::read_lock_guard spin_rw_read_lock;
typedef spin_rw_mutex::write_lock_guard spin_rw_write_lock;

Expand Down
1 change: 1 addition & 0 deletions src/libutil/spin_rw_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ int main (int argc, char *argv[])
std::cout << Strutil::format ("%2d\t%s\t%5.1fs, range %.1f\t(%d iters/thread)\n",
nt, Strutil::timeintervalformat(t),
t, range, its);
std::cout.flush();
if (! wedge)
break; // don't loop if we're not wedging
}
Expand Down

0 comments on commit c940875

Please sign in to comment.