spinlock tweaks that finally make it as good or better than TBB. by lgritz · Pull Request #589 · AcademySoftwareFoundation/OpenImageIO

lgritz · 2013-05-04T18:18:29Z

I dove really deep into the code and finally found the reason why (in particular, on my work machines which are Linux 64 bit gcc 4.4) my own spin locks continued to be not quite as good as TBB. Actually, it was worse... for many threads, mine were faster, but for <= 4 threads, TBB was faster, and of course those are important cases:

threads   TBB    No TBB 
  1       0.4     0.9  
  2       0.5     1.1  
  4       1.6     2.2  
  8       5.6     4.8  
 12      10.3     5.9  
 16       8.9     6.2  
 20       8.0     6.1  
 24       7.2     5.2  
 28       6.9     5.7  
 32       7.1     5.7

After a LOT of experimentation, I was able to narrow it down to one teeny thing... the gcc intrinsic __sync_lock_release (&m_locked) isn't anywhere near as efficient as "by hand" having a release barrier and then an assignment. I'm not sure why. I certainly don't care why. But there it is:

                          No TBB
threads   TBB    No TBB   new                             
  1       0.4     0.9     0.4 
  2       0.5     1.1     0.5
  4       1.6     2.2     1.5
  8       5.6     4.8     3.2
 12      10.3     5.9     4.2
 16       8.9     6.2     4.3
 20       8.0     6.1     4.2
 24       7.2     5.2     4.0
 28       6.9     5.7     3.6
 32       7.1     5.7     4.0

Crazy, no? Our home-spun spinlock is now as fast as TBB for <= 4 threads, and handily beats it for >= 8 threads. (It's a 12 core machine, incidentally, which is why performance flattens out at some point, but I still like to time it over-threaded.)

I also did some other cleanups, like removing the unused (badly performing) apple special cases and some other things that made debugging easier for me.

I'm really only able to test on 64 bit Linux and OSX.

So, dear readers, please please do me the favor of downloading this patch and doing the following on whatever platforms you have handy:

make nuke ; make USE_TBB=1
build/ARCH/libOpenImageIO/spinlock_test --wedge --threads 32 --trials 5
  # clip the resulting output
make nuke ; make USE_TBB=0
build/ARCH/libOpenImageIO/spinlock_test --wedge --threads 32 --trials 5
  # clip the resulting output

And post the timings here.

If it seems to hold for everybody, across various platforms and compilers (and I'm eager to especially hear about Windows [all flavors] and 32-bit Linux, none of which I'm unable to test myself), then I'll augment the patch to get rid of the remaining TBB code entirely, as we'll no longer want or need it.

lgritz · 2013-05-04T18:22:16Z

If you do the compile/timing and post here, please don't forget to say which OS, HW type, number of cores, and compiler release, so that if there are any outliers, we can focus there.

hobbes1069 · 2013-05-05T12:26:01Z

w/ TBB
hw threads = 3
threads time (best of 5)
------- ----------
1  0.3   range 0.39 (40000000 iters/thread)
2  1.3   range 0.80 (20000000 iters/thread)
4  2.9   range 0.38 (10000000 iters/thread)
8  3.2   range 0.34 (5000000 iters/thread)

12 3.2 range 0.18 (3333333 iters/thread)
16 3.4 range 0.21 (2500000 iters/thread)
20 3.5 range 0.19 (2000000 iters/thread)
24 3.5 range 0.51 (1666666 iters/thread)
28 3.0 range 0.63 (1428571 iters/thread)
32 3.6 range 0.57 (1250000 iters/thread)

w/o TBB
hw threads = 3
threads time (best of 5)

1  0.3   range 0.60 (40000000 iters/thread)
2  1.1   range 0.13 (20000000 iters/thread)
4  1.5   range 0.24 (10000000 iters/thread)
8  1.5   range 0.18 (5000000 iters/thread)

12 1.5 range 0.18 (3333333 iters/thread)
16 1.4 range 0.30 (2500000 iters/thread)
20 1.6 range 0.11 (2000000 iters/thread)
24 1.6 range 0.12 (1666666 iters/thread)
28 1.4 range 0.19 (1428571 iters/thread)
32 1.4 range 0.43 (1250000 iters/thread)

Wow! Over twice as fast with thread overload!

OS: Fedora Linux 18 - x86_64
HW: AMD Athlon(tm) II X3 455 Processor (3.3 GHz), 8GB memory
GCC: 4.7.2

You could build for i686 on a x86_64 linux install, or do you worry that
the results would be skewed? Let me know and I'll try it.

Thanks,
Richard

hobbes1069 · 2013-05-05T13:04:24Z

I can test gcc 4.8 as well if you think it would be any different.

Thanks,
Richard

hobbes1069 · 2013-05-05T16:00:00Z

Same as before but with GCC 4.8.0:
w/ TBB
hw threads = 3
threads time (best of 5)

1 0.3 range 0.01 (40000000 iters/thread)
2 1.0 range 0.11 (20000000 iters/thread)
4 1.4 range 0.12 (10000000 iters/thread)
8 1.4 range 0.31 (5000000 iters/thread)
12 1.7 range 0.14 (3333333 iters/thread)
16 1.8 range 0.13 (2500000 iters/thread)
20 1.9 range 0.11 (2000000 iters/thread)
24 1.9 range 0.21 (1666666 iters/thread)
28 2.1 range 0.07 (1428571 iters/thread)
32 2.2 range 0.06 (1250000 iters/thread)

w/o TBB
hw threads = 3
threads time (best of 5)

1 0.2 range 0.01 (40000000 iters/thread)
2 0.4 range 0.02 (20000000 iters/thread)
4 0.6 range 0.08 (10000000 iters/thread)
8 0.6 range 0.55 (5000000 iters/thread)
12 0.6 range 0.27 (3333333 iters/thread)
16 0.6 range 0.25 (2500000 iters/thread)
20 0.5 range 0.64 (2000000 iters/thread)
24 0.5 range 0.64 (1666666 iters/thread)
28 0.6 range 0.31 (1428571 iters/thread)
32 0.6 range 0.34 (1250000 iters/thread)

Richard

brechtvl · 2013-05-08T17:46:09Z

Tested on Windows 7 64 bit, on a Intel Core i7 3615QM @ 2.30 GHz (4 cores with hyperthreading), compiler is Visual Studio 2008.

I've never had TBB working on Windows, gives all kinds of linking errors that I couldn't solve easily.

No TBB

hw threads = 8
threads time (best of 5)
------- ----------
 1      3.1s      3.1s, range 0.0       (160000000 iters/thread)
 2      3.7s      3.7s, range 0.1       (80000000 iters/thread)
 4      8.6s      8.6s, range 0.5       (40000000 iters/thread)
 8      14.2s    14.2s, range 0.1       (20000000 iters/thread)
12      11.8s    11.8s, range 1.9       (13333333 iters/thread)
...
This was taking too long but you get the idea

New

hw threads = 8                                        
threads time (best of 5)                              
------- ----------                                    
 1        0.4   range 0.02      (40000000 iters/thread)
 2        0.5   range 0.01      (20000000 iters/thread)
 4        0.7   range 0.11      (10000000 iters/thread)
 8        1.4   range 0.10      (5000000 iters/thread)
12        1.4   range 0.04      (3333333 iters/thread)
16        1.0   range 0.01      (2500000 iters/thread)
20        1.0   range 0.05      (2000000 iters/thread)
24        1.0   range 0.07      (1666666 iters/thread)
28        1.0   range 0.02      (1428571 iters/thread)
32        1.0   range 0.02      (1250000 iters/thread)

lgritz · 2013-05-08T18:02:12Z

Thanks so much.

Yay, looks like a success to me.

lgritz · 2013-05-08T18:11:38Z

Incidentally, this test has proper assertions and verifies that the locks guard the accumulator variable properly, as well as timing it.

lgritz · 2013-05-08T18:49:14Z

Augmented the patch with a change to change the default to USE_TBB=0 and will soon merge. For a short time we'll try this out, and if there are no complaints, I'll submit another patch that excises TBB entirely.

…oundation#589)

lgritz added 2 commits May 8, 2013 11:46

spinlock tweaks that finally make it as good or better than TBB.

14d5d8e

Turn TBB use off by default

2d340df

lgritz merged commit 2d340df into AcademySoftwareFoundation:master May 8, 2013

GerHobbelt pushed a commit to GerHobbelt/oiio that referenced this pull request Dec 10, 2024

Add minor & major version to the configuration file (AcademySoftwareF…

7822163

…oundation#589)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spinlock tweaks that finally make it as good or better than TBB.#589

spinlock tweaks that finally make it as good or better than TBB.#589
lgritz merged 2 commits intoAcademySoftwareFoundation:masterfrom
lgritz:lg-spinlock

lgritz commented May 4, 2013

Uh oh!

lgritz commented May 4, 2013

Uh oh!

hobbes1069 commented May 5, 2013

Uh oh!

hobbes1069 commented May 5, 2013

Uh oh!

hobbes1069 commented May 5, 2013

Uh oh!

brechtvl commented May 8, 2013

Uh oh!

lgritz commented May 8, 2013

Uh oh!

lgritz commented May 8, 2013

Uh oh!

lgritz commented May 8, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lgritz commented May 4, 2013

Uh oh!

lgritz commented May 4, 2013

Uh oh!

hobbes1069 commented May 5, 2013

Uh oh!

hobbes1069 commented May 5, 2013

Uh oh!

hobbes1069 commented May 5, 2013

Uh oh!

brechtvl commented May 8, 2013

Uh oh!

lgritz commented May 8, 2013

Uh oh!

lgritz commented May 8, 2013

Uh oh!

lgritz commented May 8, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants