Make atomic_test and spinlock_test run for many more iterations, and time their results. #273

lgritz · 2012-04-03T17:25:34Z

Make atomic_test and spinlock_test run for many more iterations, and time their results.

This allows us to rigorously compare the speed of our atomics and spin locks.
Also make their output a little neater by locking around the status printouts, and
eliminate the #if's that provide safety for Boost < 1.35, which is no longer supported.

hobbes1069 · 2012-04-03T18:30:39Z

Here's some interesting preliminary results on my system:

Linux x86_64 (3.3.0 kernel)
AMD 3.4GHz triple core
4GB system memory

atomic test w/ TBB 1m 24.3s
atomic test w/o TBB 0m 33.8s (yes, about 1/3 of the time)
spinlock test w/ TBB 1m 21.3s
spinlock test w/o TBB still running

I'm not sure what's going on but the spinlock test w/o TBB has easily gone over 5 minutes maxing out all cores on my system but still hasn't finished.

If you want the full logs let me know and I'll email them to you.

hobbes1069 · 2012-04-03T18:31:47Z

Forgot to mention I'm using system based TBB version 3.0.

hobbes1069 · 2012-04-03T18:55:04Z

spinlock test w/o TBB finally finished:
20m 6.2s

2bits · 2012-04-03T20:59:11Z

Here are my results on OSX Lion, 2.44 GHz Core i5 with 8GB RAM:

With USE_TBB=1 and TBB 4.0u3 installed (doesn't it use an internal TBB, though?):

33/40 Test #33: unit_atomic ......................   Passed   22.83 sec
34/40 Test #34: unit_spinlock ....................   Passed   19.21 sec

When setting USE_TBB=0:

33/40 Test #33: unit_atomic ......................   Passed   23.15 sec
34/40 Test #34: unit_spinlock ....................   Passed  135.88 sec

hobbes1069 · 2012-04-03T21:16:57Z

Since you mentioned system vs. bundled TBB I decided to try again with the bundled and pretty much got the same results as I got with the system TBB:

23/30 Test #23: unit_atomic ......................   Passed   83.91 sec
24/30 Test #24: unit_spinlock ....................   Passed   75.49 sec

lgritz · 2012-04-03T22:44:16Z

I'm seeing similar results myself, on both OSX and Linux: unit_atomic is about the same speed overall, but unit_spinlock is much slower. But I think I know why, and am working on a fix.

I really need to know if this is plausible for Windows.

On Apr 3, 2012, at 1:59 PM, 2bits wrote:

Here are my results on OSX Lion, 2.44 GHz Core i5 with 8GB RAM:

With USE_TBB=1 and TBB 4.0u3 installed (doesn't it use an internal TBB, though?):

33/40 Test #33: unit_atomic ...................... Passed 22.83 sec
34/40 Test #34: unit_spinlock .................... Passed 19.21 sec

When setting USE_TBB=0:

33/40 Test #33: unit_atomic ...................... Passed 23.15 sec
34/40 Test #34: unit_spinlock .................... Passed 135.88 sec

Reply to this email directly or view it on GitHub:
#273 (comment)

Larry Gritz
lg@larrygritz.com

hobbes1069 · 2012-04-04T00:26:17Z

My work machine is Windows XP but I can't install a build environment on it. If someone could supply a "ready-to-run" zip package I'd be more than happy to try it.

robert-matusewicz · 2012-04-04T06:32:41Z

Those are the result for atomic_test on my windows machine (2 hardware threads):
USE TBB: 1min54sec
NO TBB: 1min58sec

This was release build linked to tbb-21_20090511oss (quite old)

I will post the result for spinlock_test in the evening

lgritz · 2012-04-04T15:02:31Z

I pushed an update that, like TBB, uses pause and yield to prevent the spin loops from being quite so contentious. This gets us much closer to TBB performance for spin_lock. Can you guys give a try and see if it improves for you as well?

hobbes1069 · 2012-04-04T15:22:54Z

HW Threads: 3

atomic_test:   36.0s
spinlock_test: 7m 27.0s

Definitly a lot better but why is OSX so much faster than this?

lgritz · 2012-04-04T16:39:04Z

Here are my times:

OSX 10.6, 64 bit (2 cores, hyperthreaded):
atomic 21.7 TBB, 22.7 intrinsic Apple, 21.7 gcc intrinsics
spin 23.5 TBB, 1:10.9 intrinsic Apple, 33.5 gcc intrinsics

Linux 2.6.34 kernel, 64 bit (12 cores, hyperthreaded):
atomic 0:37.5 TBB, 0:36.9 gcc
spin 7:04.3 TBB, 3:46 gcc

So I'm also seeing MUCH slower times for spin with Linux than OSX. I don't know if that's due to the kernel, the particular type of CPU model, or just the fact that there are more HW cores and more CPUs (can that slow things down in cases of pathological contention?). Anybody have a good guess?

These two tests are very stressful, it's not a typical use pattern that we'd see in real software. So I'm mostly satisfied that removing TBB will not be a performance penalty for Linux or OSX (and, additionally, that there's no gain from using the OSX-specific spinlock code).

I still need hard numbers for spinlock on Windows from somebody. If those are not significantly worse without TBB, I will remove TBB support entirely.

edgarv · 2012-04-04T17:10:45Z

src/include/thread.h

+{
+    if (delay < 32) {
+#if USE_TBB
+        __TBB_Pause(count);


This does not compile due to a typo, it should be:

__TBB_Pause(delay);

edgarv · 2012-04-04T17:27:55Z

Windows 7 64-bit, MSVC 2010. 4 cores hyperthreaded (i7-2720QM)

USE TBB:

unit_atomic   51.3s
unit_spinlock 4m 7.5s

NO TBB:

unit_atomic   51.5s
unit_spinlock 6m 7.6s

edgarv · 2012-04-04T19:55:36Z

Linux 2.6.42 64-bit (Fedora 15) running as a VirtualBox guest with 4 cores. The host is the same system from the previous post.

USE TBB:

unit_atomic   1m 1.4s
unit_spinlock 26.6s

NO TBB:

unit_atomic   1m 4.2s
unit_spinlock 1m 38.4s

2bits · 2012-04-04T20:49:32Z

osx @ 660b36b, 2.44 GHz Core i5, 8 GB RAM

no tbb

33/40 Test #33: unit_atomic ......................   Passed   23.04 sec
34/40 Test #34: unit_spinlock ....................   Passed   48.33 sec

use_tbb=1

33/40 Test #33: unit_atomic ......................   Passed   22.83 sec
34/40 Test #34: unit_spinlock ....................   Passed   19.13 sec

lgritz · 2012-04-04T20:51:30Z

OK, so here's where I think we're at: unit_atomic seems to be the same speed with and without TBB on all platforms, though it strangely varies more than we'd expect with different CPU and different OS (even with the same CPU), and unit_spinlock seems to definitely be slower without TBB, but not by a huge amount of most platforms tested (varying even more by CPU and OS).

What do you guys think?

These tests pathologically induce the highest possible amount of contention. It's not clear that you'd see a practical difference in a real application.

I think we have a few choices: (1) totally rip out TBB; (2) keep the TBB code but turn it off by default; (3) turn it off by default for platforms where there doesn't seem to be a performance hit, but leave it enabled by default for those with big apparent performance differences.

hobbes1069 · 2012-04-04T21:10:46Z

I think option 2 is the best balance. It will quietly get the desired result without making it impossible for someone to change the behavior. I would leave it that way for a while and if no one complains then it could later be cleaned out entirely.

I'm currently only enabling it on 64bit systems for Fedora because for some reason OIIO fails to build against the system installed TBB on i686.

lgritz · 2012-04-04T21:17:51Z

Further refinement for #2:

2(a): Turn if off by default (USE_TBB=1 still re-enables it) and leave the tbb source code as it is.

2(b): Turn it off by default, remove all tbb source code from our repository, if somebody wants USE_TBB=1 they are expected to have TBB installed on their system.

hobbes1069 · 2012-04-04T21:22:59Z

Obviously 2b is fine with me since I'm not allowed to use bundled libs in Fedora anyway, and should oiio be shipping with something that's disabled by default?

I think the only thing that would need to be checked is to see if the USE_EXTERNAL_TBB option I patch in for Feodra work for non *nix systems or could be made to work easily. I also can supply a FindTBB.cmake I wrote but it will definitely need to be updated for windows.

hobbes1069 · 2012-04-04T21:25:12Z

Oops! Now that I look at it I forgot I got rid of my FindTBB config file in favor of using a more complete one I found online and it should work for pretty much everything.

http://code.google.com/p/findtbb/

robert-matusewicz · 2012-04-09T09:18:06Z

I like the 2b solution - if someone want to use TBB, then it is obvious the TBB should be installed on the system. This is especially true for Windows - even now, when using bundled TBB user need to explicity link to release or debug library.

FindTBB.cmake could be added to our cmake modules directory.

edit: accidentally I closed the pull request, reopening

lgritz · 2012-06-12T07:18:57Z

Hi, after some weekend reading, I had some ideas for how to improve the spin locks and just pushed an update in which, on my two machines (a Mac OS X laptop, and a 12-core Linux box), my non-TBB spinlock now outperforms the TBB spinlocks (by a significant margin on Linux!).

Could those of you on this thread please try this out? Check out the branch for this pull request, do "make nuke ; make USE_TBB=1 ; make test TEST=spinlock" and then the same thing again with USE_TBB=0 (be sure to do a full make nuke first).

If the rest of you find the same thing -- that the USE_TBB=0 benchmark is no slower than the USE_TBB=1 -- then perhaps we can finally retire TBB.

For the curious, the three improvements were:

(a) exponential backoff for spin lock contention in the same manner that TBB was doing;

(b) try_lock can spin more efficiency with read-only (non-bus-locking) check until there's a good chance that the lock was released. In particular, you should not spin like this:

while (! try_lock(&thelock)) ;

but instead

while (! try_lock(&thelock))
    while (thelock) ;

This is because the try_lock does a compare-and-swap, which is a writing operation, which will lock the bus. In the latter version, if the initial (bus-locking) try-lock fails, it will do a read-only (non-bus-locking!) check until the lock appears to be free, then do the safe/locking one again.

(c) on x86_64, it's safe for spin_lock::unlock() to do a normal unlocked write. The memory ordering of these chips is such that it doesn't need the memory fences of a full atomic write. I found numerous sources for this on the net, and it is a big win. Apparently, it also works on some 32 bit x86's, most of the recent chips, but I am not very interested in sorting out which x86 chips it's safe to do it on, especially since anybody heavily threaded enough to be concerned with this optimization is on a 64 bit system.

OK, so let me know! If it works, it's a performance improvement as well as a way for us to shed the pesky TBB dependency. Win, win.

hobbes1069 · 2012-06-12T13:15:40Z

With TBB
Test #25: unit_spinlock .................... Passed 82.20 sec

Without TBB i'm getting a failed test:

/home/build/rpmbuild/oiio-test/BUILD/lgritz-oiio-18410b6/src/libOpenImageIO/spinlock_test.cpp:93:
FAILED: a == (int)(numthreads * iterations)
values were '1598735224' and '1600000000'
Time: 20.7s

lgritz · 2012-06-12T16:37:24Z

Hmmm... maybe my unlock() optimization is not as safe as I thought.

Richard, does this happen consistently?

If you change the line in thread.h from

    *(int *)&m_locked = 0;

to

    m_locked = 0;

does that make it pass the test? (Though I expect it to be slower.)

What kind of machine is it? (CPU model, number of CPUs, number of cores)

lgritz · 2012-06-12T16:42:55Z

I hate this shit. How could C++ not define atomic counters, spin locks, and blocking mutexes as part of the language, and have any compiler worth its salt ensure that it's correct and using the fastest possible implementation? (I know, C++11 has atomics, but it's 15 years too late, and I think it still doesn't have a spin mutex.)

hobbes1069 · 2012-06-12T17:09:47Z

The test failed every time, I think I ran it 3 or 4 times.

Making the change you suggested (actually, I just took out the #if x86_64 clause) let the test complete:
Time: 2m 12.4s

Here's the cpu info, of course it's repeated 2 more times for the other two cores:

$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 5
model name : AMD Athlon(tm) II X3 455 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 3
core id : 0
cpu cores : 3
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6630.20
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

lgritz · 2012-06-12T17:47:42Z

I wonder if that's an Intel vs AMD issue. In any case, it doesn't seem safe to do that particular optimization. I will push an update to this patch that removes it.

hobbes1069 · 2012-06-12T17:53:33Z

I assume it's possible, but is it practical or advantageous to investigate how TBB handles this?

lgritz · 2012-06-15T00:10:52Z

I've looked at the TBB code extensively. It's very convoluted but I think I'm doing basically the same thing overall. They do not appear to use locking improvement (which seems to help and be safe) or the unlock "improvement" (which turned out not to be safe, so I have removed it).

lgritz · 2012-06-15T00:12:21Z

Anybody else want to comment or report timings?

Are there any objections to my committing this to the master (i.e. improve the non-TBB case, but not yet change the default for USE_TBB)?

I'm also still open to changing USE_TBB to default to 0, or even removing TBB entirely, if everybody's timings show that non-TBB is not significantly worse than TBB across a wide variety of hardware and OS configurations.

…time their results. This allows us to rigorously compare the speed of our atomics and spin locks. Also make their output a little neater by locking around the status printouts, and eliminate the #if's that provide safety for Boost < 1.35, which is no longer supported.

… spinning too tightly. Also, disable the Apple OSSpinLock calls, they seem even slower than our own code.

…use()

Two improvements: (a) exponential backoff for spin lock contention; (b) try_lock can spin more efficiency with read-only (non-bus-locking) checks until there's a good chance that the lock was released.

Substantial improvements to spin_mutex when USE_TBB=0.

lgritz · 2012-06-20T19:23:53Z

I've merged this set of changes. Note that the default is still USE_TBB=1. But this substantially improves the performance of the code if you compile USE_TBB=0. I'm still gathering benchmarks and testing; it's possible that very soon, we will reach consensus that the right thing to do is change the default to use the new code (USE_TBB=0). Also possible that eventually we will make that the only choice and rip out the TBB code entirely (though for now, I think it's useful to have both for comparison purposes).

Also note that if you use both OSL and OIIO, you will probably get lots of grief if you compile those packages with differing USE_TBB settings.

edgarv reviewed Apr 4, 2012
View reviewed changes

robert-matusewicz closed this Apr 9, 2012

robert-matusewicz reopened this Apr 9, 2012

lgritz added 5 commits June 18, 2012 21:01

Improve spinlock perf for non-TBB by using pause and yield instead of…

f77330e

… spinning too tightly. Also, disable the Apple OSSpinLock calls, they seem even slower than our own code.

Windows - use _MSC_VER rather than _WIN32, and better strategy for pa…

0f37e62

…use()

More spin_lock improvements for non-TBB case.

8c466a8

Two improvements: (a) exponential backoff for spin lock contention; (b) try_lock can spin more efficiency with read-only (non-bus-locking) checks until there's a good chance that the lock was released.

spinlock_test: allow command-line setting of threads and iterations.

7f665da

lgritz added a commit that referenced this pull request Jun 20, 2012

Merge pull request #273 from lgritz/lg-atomic

4337147

Substantial improvements to spin_mutex when USE_TBB=0.

lgritz merged commit 4337147 into AcademySoftwareFoundation:master Jun 20, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make atomic_test and spinlock_test run for many more iterations, and time their results. #273

Make atomic_test and spinlock_test run for many more iterations, and time their results. #273

lgritz commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

2bits commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

lgritz commented Apr 3, 2012

hobbes1069 commented Apr 4, 2012

robert-matusewicz commented Apr 4, 2012

lgritz commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

lgritz commented Apr 4, 2012

edgarv Apr 4, 2012

lgritz Apr 4, 2012

edgarv commented Apr 4, 2012

edgarv commented Apr 4, 2012

2bits commented Apr 4, 2012

lgritz commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

lgritz commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

robert-matusewicz commented Apr 9, 2012

lgritz commented Jun 12, 2012

hobbes1069 commented Jun 12, 2012

lgritz commented Jun 12, 2012

lgritz commented Jun 12, 2012

hobbes1069 commented Jun 12, 2012

lgritz commented Jun 12, 2012

hobbes1069 commented Jun 12, 2012

lgritz commented Jun 15, 2012

lgritz commented Jun 15, 2012

lgritz commented Jun 20, 2012

Make atomic_test and spinlock_test run for many more iterations, and time their results. #273

Make atomic_test and spinlock_test run for many more iterations, and time their results. #273

Conversation

lgritz commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

2bits commented Apr 3, 2012

hobbes1069 commented Apr 3, 2012

lgritz commented Apr 3, 2012

hobbes1069 commented Apr 4, 2012

robert-matusewicz commented Apr 4, 2012

lgritz commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

lgritz commented Apr 4, 2012

edgarv Apr 4, 2012

Choose a reason for hiding this comment

lgritz Apr 4, 2012

Choose a reason for hiding this comment

edgarv commented Apr 4, 2012

edgarv commented Apr 4, 2012

2bits commented Apr 4, 2012

lgritz commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

lgritz commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

hobbes1069 commented Apr 4, 2012

robert-matusewicz commented Apr 9, 2012

lgritz commented Jun 12, 2012

hobbes1069 commented Jun 12, 2012

lgritz commented Jun 12, 2012

lgritz commented Jun 12, 2012

hobbes1069 commented Jun 12, 2012

lgritz commented Jun 12, 2012

hobbes1069 commented Jun 12, 2012

lgritz commented Jun 15, 2012

lgritz commented Jun 15, 2012

lgritz commented Jun 20, 2012