Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New spin_rw_mutex implementation with greatly improved performance #1787

Merged
merged 1 commit into from Oct 22, 2017

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Oct 14, 2017

You can see the performance in the following two tables. It's a
benchmark using our unit test spin_rw_test.cpp. We vary the ratio of
writers to readers from 1:9 (one write lock and modify for every 9 read
locks) to 1:9999.

All times are in seconds for this workload (smaller is better), and
they were executed on Linux on a machine with 32 physical cores (64
hyperthread cores).

Old code:

threads  1:9     1:99      1:999     1:9999
--------------------------------------------
 1       0.3      0.3       0.3        0.3
 2       0.5      0.5       0.6        0.5
 4       5.4      3.4       3.2        2.8
 8       9.9      9.5      10.6        9.0
12      12.3     13.1      11.8       12.1
16      13.7     14.3      14.0       14.8
24      17.9     16.9      18.3       18.1
32      21.4     22.2      22.8       20.9
64      20.9     22.4      22.2       21.4

New code (this patch):

threads  1:9     1:99      1:999     1:9999
--------------------------------------------
 1       0.2      0.2       0.2        0.2
 2       0.9      0.7       0.5        0.5
 4       1.4      1.0       0.8        0.8
 8       3.6      1.5       1.1        1.0
12       5.1      2.4       1.4        1.2
16       6.0      2.8       1.8        1.4
24       8.4      4.4       2.4        2.0
32      10.8      5.3       3.2        2.4
64      11.8      5.8       4.2        3.0

So the performance of the new code has three interesting properties:
(a) For every thread count, and every write-to-read ratio, it is
superior to the old code (only exception: 2 threads, heavily weighted
to writers). (b) For every workload, the new code scales better,
versus thread count, than the old code did. (c) Whereas the old
code has similar performance regardless of workload, the new code
gets remarkably more efficient as use is dominated by readers --
there is VERY little interference between simultaneous readers.

I don't expect much in OIIO to speed up today as a result of this,
because there are only a couple places where we use the spin_rw_mutex.
But I'm laying the groundwork for some improvements I'm doing to the
ImageCache/TextureSystem, which currently doesn't use rw locks but I'm
trying out an improvement that will utilize them, and I think this is
going to be a key component to making it scale better with larger number
of cores.

You can see the performance in the following two tables. It's a
benchmark using our unit test spin_rw_test.cpp. We vary the ratio of
writers to readers from 1:9 (one write lock and modify for every 9 read
locks) to 1:9999.

All times are in seconds for this workload (smaller is better), and
they were executed on Linux on a machine with 32 physical cores (64
hyperthread cores).

Old code:

    threads  1:9     1:99      1:999     1:9999
    --------------------------------------------
     1       0.3      0.3       0.3        0.3
     2       0.5      0.5       0.6        0.5
     4       5.4      3.4       3.2        2.8
     8       9.9      9.5      10.6        9.0
    12      12.3     13.1      11.8       12.1
    16      13.7     14.3      14.0       14.8
    20      15.4     16.2      16.1       17.5
    24      17.9     16.9      18.3       18.1
    28      20.0     21.0      21.2       20.9
    32      21.4     22.2      22.8       20.9
    64      20.9     22.4      22.2       21.4

New code (this patch):

    threads  1:9     1:99      1:999     1:9999
    --------------------------------------------
     1	     0.2      0.2       0.2        0.2
     2	     0.9      0.7       0.5        0.5
     4	     1.4      1.0       0.8        0.8
     8	     3.6      1.5       1.1        1.0
    12	     5.1      2.4       1.4        1.2
    16	     6.0      2.8       1.8        1.4
    20	     6.8      3.6       2.1        1.7
    24	     8.4      4.4       2.4        2.0
    28	     9.1      5.0       2.8        2.2
    32	    10.8      5.3       3.2        2.4
    64	    11.8      5.8       4.2        3.0

So the performance of the new code has three interesting properties:
(a) For every thread count, and every write-to-read ratio, it is
superior to the old code (only exception: 2 threads, heavily weighted
to writers). (b) For every workload, the new code scales better,
versus thread count, than the old code did. (c) Whereas the old
code has similar performance regardless of workload, the new code
gets remarkably more efficient as use is dominated by readers --
there is VERY little interference between simultaneous readers.

I don't expect much in OIIO to speed up *today* as a result of this,
because there are only a couple places where we use the spin_rw_mutex.
But I'm laying the groundwork for some improvements I'm doing to the
ImageCache/TextureSystem, which currently doesn't use rw locks but I'm
trying out an improvement that will utilize them, and I think this is
going to be a key component to making it scale better with larger number
of cores.
@fpsunflower
Copy link
Contributor

LGTM!

@lgritz lgritz merged commit c940875 into AcademySoftwareFoundation:master Oct 22, 2017
@lgritz lgritz deleted the lg-rwmutex branch October 22, 2017 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants