New spin_rw_mutex implementation with greatly improved performance #1787

lgritz · 2017-10-14T06:21:57Z

You can see the performance in the following two tables. It's a
benchmark using our unit test spin_rw_test.cpp. We vary the ratio of
writers to readers from 1:9 (one write lock and modify for every 9 read
locks) to 1:9999.

All times are in seconds for this workload (smaller is better), and
they were executed on Linux on a machine with 32 physical cores (64
hyperthread cores).

Old code:

threads  1:9     1:99      1:999     1:9999
--------------------------------------------
 1       0.3      0.3       0.3        0.3
 2       0.5      0.5       0.6        0.5
 4       5.4      3.4       3.2        2.8
 8       9.9      9.5      10.6        9.0
12      12.3     13.1      11.8       12.1
16      13.7     14.3      14.0       14.8
24      17.9     16.9      18.3       18.1
32      21.4     22.2      22.8       20.9
64      20.9     22.4      22.2       21.4

New code (this patch):

threads  1:9     1:99      1:999     1:9999
--------------------------------------------
 1       0.2      0.2       0.2        0.2
 2       0.9      0.7       0.5        0.5
 4       1.4      1.0       0.8        0.8
 8       3.6      1.5       1.1        1.0
12       5.1      2.4       1.4        1.2
16       6.0      2.8       1.8        1.4
24       8.4      4.4       2.4        2.0
32      10.8      5.3       3.2        2.4
64      11.8      5.8       4.2        3.0

So the performance of the new code has three interesting properties:
(a) For every thread count, and every write-to-read ratio, it is
superior to the old code (only exception: 2 threads, heavily weighted
to writers). (b) For every workload, the new code scales better,
versus thread count, than the old code did. (c) Whereas the old
code has similar performance regardless of workload, the new code
gets remarkably more efficient as use is dominated by readers --
there is VERY little interference between simultaneous readers.

I don't expect much in OIIO to speed up today as a result of this,
because there are only a couple places where we use the spin_rw_mutex.
But I'm laying the groundwork for some improvements I'm doing to the
ImageCache/TextureSystem, which currently doesn't use rw locks but I'm
trying out an improvement that will utilize them, and I think this is
going to be a key component to making it scale better with larger number
of cores.

You can see the performance in the following two tables. It's a benchmark using our unit test spin_rw_test.cpp. We vary the ratio of writers to readers from 1:9 (one write lock and modify for every 9 read locks) to 1:9999. All times are in seconds for this workload (smaller is better), and they were executed on Linux on a machine with 32 physical cores (64 hyperthread cores). Old code: threads 1:9 1:99 1:999 1:9999 -------------------------------------------- 1 0.3 0.3 0.3 0.3 2 0.5 0.5 0.6 0.5 4 5.4 3.4 3.2 2.8 8 9.9 9.5 10.6 9.0 12 12.3 13.1 11.8 12.1 16 13.7 14.3 14.0 14.8 20 15.4 16.2 16.1 17.5 24 17.9 16.9 18.3 18.1 28 20.0 21.0 21.2 20.9 32 21.4 22.2 22.8 20.9 64 20.9 22.4 22.2 21.4 New code (this patch): threads 1:9 1:99 1:999 1:9999 -------------------------------------------- 1 0.2 0.2 0.2 0.2 2 0.9 0.7 0.5 0.5 4 1.4 1.0 0.8 0.8 8 3.6 1.5 1.1 1.0 12 5.1 2.4 1.4 1.2 16 6.0 2.8 1.8 1.4 20 6.8 3.6 2.1 1.7 24 8.4 4.4 2.4 2.0 28 9.1 5.0 2.8 2.2 32 10.8 5.3 3.2 2.4 64 11.8 5.8 4.2 3.0 So the performance of the new code has three interesting properties: (a) For every thread count, and every write-to-read ratio, it is superior to the old code (only exception: 2 threads, heavily weighted to writers). (b) For every workload, the new code scales better, versus thread count, than the old code did. (c) Whereas the old code has similar performance regardless of workload, the new code gets remarkably more efficient as use is dominated by readers -- there is VERY little interference between simultaneous readers. I don't expect much in OIIO to speed up *today* as a result of this, because there are only a couple places where we use the spin_rw_mutex. But I'm laying the groundwork for some improvements I'm doing to the ImageCache/TextureSystem, which currently doesn't use rw locks but I'm trying out an improvement that will utilize them, and I think this is going to be a key component to making it scale better with larger number of cores.

fpsunflower · 2017-10-19T23:15:11Z

LGTM!

lgritz merged commit c940875 into AcademySoftwareFoundation:master Oct 22, 2017

lgritz deleted the lg-rwmutex branch October 22, 2017 05:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New spin_rw_mutex implementation with greatly improved performance #1787

New spin_rw_mutex implementation with greatly improved performance #1787

lgritz commented Oct 14, 2017

fpsunflower commented Oct 19, 2017

New spin_rw_mutex implementation with greatly improved performance #1787

New spin_rw_mutex implementation with greatly improved performance #1787

Conversation

lgritz commented Oct 14, 2017

fpsunflower commented Oct 19, 2017