Skip to content

spinlock tweaks that finally make it as good or better than TBB.#589

Merged
lgritz merged 2 commits intoAcademySoftwareFoundation:masterfrom
lgritz:lg-spinlock
May 8, 2013
Merged

spinlock tweaks that finally make it as good or better than TBB.#589
lgritz merged 2 commits intoAcademySoftwareFoundation:masterfrom
lgritz:lg-spinlock

Conversation

@lgritz
Copy link
Collaborator

@lgritz lgritz commented May 4, 2013

I dove really deep into the code and finally found the reason why (in particular, on my work machines which are Linux 64 bit gcc 4.4) my own spin locks continued to be not quite as good as TBB. Actually, it was worse... for many threads, mine were faster, but for <= 4 threads, TBB was faster, and of course those are important cases:

threads   TBB    No TBB 
  1       0.4     0.9  
  2       0.5     1.1  
  4       1.6     2.2  
  8       5.6     4.8  
 12      10.3     5.9  
 16       8.9     6.2  
 20       8.0     6.1  
 24       7.2     5.2  
 28       6.9     5.7  
 32       7.1     5.7   

After a LOT of experimentation, I was able to narrow it down to one teeny thing... the gcc intrinsic __sync_lock_release (&m_locked) isn't anywhere near as efficient as "by hand" having a release barrier and then an assignment. I'm not sure why. I certainly don't care why. But there it is:

                          No TBB
threads   TBB    No TBB   new                             
  1       0.4     0.9     0.4 
  2       0.5     1.1     0.5
  4       1.6     2.2     1.5
  8       5.6     4.8     3.2
 12      10.3     5.9     4.2
 16       8.9     6.2     4.3
 20       8.0     6.1     4.2
 24       7.2     5.2     4.0
 28       6.9     5.7     3.6
 32       7.1     5.7     4.0

Crazy, no? Our home-spun spinlock is now as fast as TBB for <= 4 threads, and handily beats it for >= 8 threads. (It's a 12 core machine, incidentally, which is why performance flattens out at some point, but I still like to time it over-threaded.)

I also did some other cleanups, like removing the unused (badly performing) apple special cases and some other things that made debugging easier for me.

I'm really only able to test on 64 bit Linux and OSX.

So, dear readers, please please do me the favor of downloading this patch and doing the following on whatever platforms you have handy:

make nuke ; make USE_TBB=1
build/ARCH/libOpenImageIO/spinlock_test --wedge --threads 32 --trials 5
  # clip the resulting output
make nuke ; make USE_TBB=0
build/ARCH/libOpenImageIO/spinlock_test --wedge --threads 32 --trials 5
  # clip the resulting output

And post the timings here.

If it seems to hold for everybody, across various platforms and compilers (and I'm eager to especially hear about Windows [all flavors] and 32-bit Linux, none of which I'm unable to test myself), then I'll augment the patch to get rid of the remaining TBB code entirely, as we'll no longer want or need it.

@lgritz
Copy link
Collaborator Author

lgritz commented May 4, 2013

If you do the compile/timing and post here, please don't forget to say which OS, HW type, number of cores, and compiler release, so that if there are any outliers, we can focus there.

@hobbes1069
Copy link
Contributor

w/ TBB
hw threads = 3
threads time (best of 5)
------- ----------
1  0.3   range 0.39 (40000000 iters/thread)
2  1.3   range 0.80 (20000000 iters/thread)
4  2.9   range 0.38 (10000000 iters/thread)
8  3.2   range 0.34 (5000000 iters/thread)

12 3.2 range 0.18 (3333333 iters/thread)
16 3.4 range 0.21 (2500000 iters/thread)
20 3.5 range 0.19 (2000000 iters/thread)
24 3.5 range 0.51 (1666666 iters/thread)
28 3.0 range 0.63 (1428571 iters/thread)
32 3.6 range 0.57 (1250000 iters/thread)

w/o TBB
hw threads = 3
threads time (best of 5)


1  0.3   range 0.60 (40000000 iters/thread)
2  1.1   range 0.13 (20000000 iters/thread)
4  1.5   range 0.24 (10000000 iters/thread)
8  1.5   range 0.18 (5000000 iters/thread)

12 1.5 range 0.18 (3333333 iters/thread)
16 1.4 range 0.30 (2500000 iters/thread)
20 1.6 range 0.11 (2000000 iters/thread)
24 1.6 range 0.12 (1666666 iters/thread)
28 1.4 range 0.19 (1428571 iters/thread)
32 1.4 range 0.43 (1250000 iters/thread)

Wow! Over twice as fast with thread overload!

OS: Fedora Linux 18 - x86_64
HW: AMD Athlon(tm) II X3 455 Processor (3.3 GHz), 8GB memory
GCC: 4.7.2

You could build for i686 on a x86_64 linux install, or do you worry that
the results would be skewed? Let me know and I'll try it.

Thanks,
Richard

@hobbes1069
Copy link
Contributor

I can test gcc 4.8 as well if you think it would be any different.

Thanks,
Richard

@hobbes1069
Copy link
Contributor

Same as before but with GCC 4.8.0:
w/ TBB
hw threads = 3
threads time (best of 5)


1 0.3 range 0.01 (40000000 iters/thread)
2 1.0 range 0.11 (20000000 iters/thread)
4 1.4 range 0.12 (10000000 iters/thread)
8 1.4 range 0.31 (5000000 iters/thread)
12 1.7 range 0.14 (3333333 iters/thread)
16 1.8 range 0.13 (2500000 iters/thread)
20 1.9 range 0.11 (2000000 iters/thread)
24 1.9 range 0.21 (1666666 iters/thread)
28 2.1 range 0.07 (1428571 iters/thread)
32 2.2 range 0.06 (1250000 iters/thread)

w/o TBB
hw threads = 3
threads time (best of 5)


1 0.2 range 0.01 (40000000 iters/thread)
2 0.4 range 0.02 (20000000 iters/thread)
4 0.6 range 0.08 (10000000 iters/thread)
8 0.6 range 0.55 (5000000 iters/thread)
12 0.6 range 0.27 (3333333 iters/thread)
16 0.6 range 0.25 (2500000 iters/thread)
20 0.5 range 0.64 (2000000 iters/thread)
24 0.5 range 0.64 (1666666 iters/thread)
28 0.6 range 0.31 (1428571 iters/thread)
32 0.6 range 0.34 (1250000 iters/thread)

Richard

@brechtvl
Copy link
Contributor

brechtvl commented May 8, 2013

Tested on Windows 7 64 bit, on a Intel Core i7 3615QM @ 2.30 GHz (4 cores with hyperthreading), compiler is Visual Studio 2008.

I've never had TBB working on Windows, gives all kinds of linking errors that I couldn't solve easily.

No TBB

hw threads = 8
threads time (best of 5)
------- ----------
 1      3.1s      3.1s, range 0.0       (160000000 iters/thread)
 2      3.7s      3.7s, range 0.1       (80000000 iters/thread)
 4      8.6s      8.6s, range 0.5       (40000000 iters/thread)
 8      14.2s    14.2s, range 0.1       (20000000 iters/thread)
12      11.8s    11.8s, range 1.9       (13333333 iters/thread)
...
This was taking too long but you get the idea

New

hw threads = 8                                        
threads time (best of 5)                              
------- ----------                                    
 1        0.4   range 0.02      (40000000 iters/thread)
 2        0.5   range 0.01      (20000000 iters/thread)
 4        0.7   range 0.11      (10000000 iters/thread)
 8        1.4   range 0.10      (5000000 iters/thread)
12        1.4   range 0.04      (3333333 iters/thread)
16        1.0   range 0.01      (2500000 iters/thread)
20        1.0   range 0.05      (2000000 iters/thread)
24        1.0   range 0.07      (1666666 iters/thread)
28        1.0   range 0.02      (1428571 iters/thread)
32        1.0   range 0.02      (1250000 iters/thread)

@lgritz
Copy link
Collaborator Author

lgritz commented May 8, 2013

Thanks so much.

Yay, looks like a success to me.

@lgritz
Copy link
Collaborator Author

lgritz commented May 8, 2013

Incidentally, this test has proper assertions and verifies that the locks guard the accumulator variable properly, as well as timing it.

@lgritz
Copy link
Collaborator Author

lgritz commented May 8, 2013

Augmented the patch with a change to change the default to USE_TBB=0 and will soon merge. For a short time we'll try this out, and if there are no complaints, I'll submit another patch that excises TBB entirely.

@lgritz lgritz merged commit 2d340df into AcademySoftwareFoundation:master May 8, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants