spinlock tweaks that finally make it as good or better than TBB.#589
spinlock tweaks that finally make it as good or better than TBB.#589lgritz merged 2 commits intoAcademySoftwareFoundation:masterfrom
Conversation
|
If you do the compile/timing and post here, please don't forget to say which OS, HW type, number of cores, and compiler release, so that if there are any outliers, we can focus there. |
12 3.2 range 0.18 (3333333 iters/thread) w/o TBB 12 1.5 range 0.18 (3333333 iters/thread) Wow! Over twice as fast with thread overload! OS: Fedora Linux 18 - x86_64 You could build for i686 on a x86_64 linux install, or do you worry that Thanks, |
|
I can test gcc 4.8 as well if you think it would be any different. Thanks, |
|
Same as before but with GCC 4.8.0: 1 0.3 range 0.01 (40000000 iters/thread) w/o TBB 1 0.2 range 0.01 (40000000 iters/thread) Richard |
|
Tested on Windows 7 64 bit, on a Intel Core i7 3615QM @ 2.30 GHz (4 cores with hyperthreading), compiler is Visual Studio 2008. I've never had TBB working on Windows, gives all kinds of linking errors that I couldn't solve easily. No TBB New |
|
Thanks so much. Yay, looks like a success to me. |
|
Incidentally, this test has proper assertions and verifies that the locks guard the accumulator variable properly, as well as timing it. |
|
Augmented the patch with a change to change the default to USE_TBB=0 and will soon merge. For a short time we'll try this out, and if there are no complaints, I'll submit another patch that excises TBB entirely. |
I dove really deep into the code and finally found the reason why (in particular, on my work machines which are Linux 64 bit gcc 4.4) my own spin locks continued to be not quite as good as TBB. Actually, it was worse... for many threads, mine were faster, but for <= 4 threads, TBB was faster, and of course those are important cases:
After a LOT of experimentation, I was able to narrow it down to one teeny thing... the gcc intrinsic __sync_lock_release (&m_locked) isn't anywhere near as efficient as "by hand" having a release barrier and then an assignment. I'm not sure why. I certainly don't care why. But there it is:
Crazy, no? Our home-spun spinlock is now as fast as TBB for <= 4 threads, and handily beats it for >= 8 threads. (It's a 12 core machine, incidentally, which is why performance flattens out at some point, but I still like to time it over-threaded.)
I also did some other cleanups, like removing the unused (badly performing) apple special cases and some other things that made debugging easier for me.
I'm really only able to test on 64 bit Linux and OSX.
So, dear readers, please please do me the favor of downloading this patch and doing the following on whatever platforms you have handy:
And post the timings here.
If it seems to hold for everybody, across various platforms and compilers (and I'm eager to especially hear about Windows [all flavors] and 32-bit Linux, none of which I'm unable to test myself), then I'll augment the patch to get rid of the remaining TBB code entirely, as we'll no longer want or need it.