Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make atomic_test and spinlock_test run for many more iterations, and time their results. #273

Merged
merged 5 commits into from Jun 20, 2012

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Apr 3, 2012

Make atomic_test and spinlock_test run for many more iterations, and time their results.

This allows us to rigorously compare the speed of our atomics and spin locks.
Also make their output a little neater by locking around the status printouts, and
eliminate the #if's that provide safety for Boost < 1.35, which is no longer supported.

@hobbes1069
Copy link
Contributor

Here's some interesting preliminary results on my system:

Linux x86_64 (3.3.0 kernel)
AMD 3.4GHz triple core
4GB system memory

atomic test w/ TBB 1m 24.3s
atomic test w/o TBB 0m 33.8s (yes, about 1/3 of the time)
spinlock test w/ TBB 1m 21.3s
spinlock test w/o TBB still running

I'm not sure what's going on but the spinlock test w/o TBB has easily gone over 5 minutes maxing out all cores on my system but still hasn't finished.

If you want the full logs let me know and I'll email them to you.

@hobbes1069
Copy link
Contributor

Forgot to mention I'm using system based TBB version 3.0.

@hobbes1069
Copy link
Contributor

spinlock test w/o TBB finally finished:
20m 6.2s

@2bits
Copy link

2bits commented Apr 3, 2012

Here are my results on OSX Lion, 2.44 GHz Core i5 with 8GB RAM:

With USE_TBB=1 and TBB 4.0u3 installed (doesn't it use an internal TBB, though?):

33/40 Test #33: unit_atomic ......................   Passed   22.83 sec
34/40 Test #34: unit_spinlock ....................   Passed   19.21 sec

When setting USE_TBB=0:

33/40 Test #33: unit_atomic ......................   Passed   23.15 sec
34/40 Test #34: unit_spinlock ....................   Passed  135.88 sec

@hobbes1069
Copy link
Contributor

Since you mentioned system vs. bundled TBB I decided to try again with the bundled and pretty much got the same results as I got with the system TBB:

23/30 Test #23: unit_atomic ......................   Passed   83.91 sec
24/30 Test #24: unit_spinlock ....................   Passed   75.49 sec

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 3, 2012

I'm seeing similar results myself, on both OSX and Linux: unit_atomic is about the same speed overall, but unit_spinlock is much slower. But I think I know why, and am working on a fix.

I really need to know if this is plausible for Windows.

On Apr 3, 2012, at 1:59 PM, 2bits wrote:

Here are my results on OSX Lion, 2.44 GHz Core i5 with 8GB RAM:

With USE_TBB=1 and TBB 4.0u3 installed (doesn't it use an internal TBB, though?):

33/40 Test #33: unit_atomic ...................... Passed 22.83 sec
34/40 Test #34: unit_spinlock .................... Passed 19.21 sec

When setting USE_TBB=0:

33/40 Test #33: unit_atomic ...................... Passed 23.15 sec
34/40 Test #34: unit_spinlock .................... Passed 135.88 sec


Reply to this email directly or view it on GitHub:
#273 (comment)

Larry Gritz
lg@larrygritz.com

@hobbes1069
Copy link
Contributor

My work machine is Windows XP but I can't install a build environment on it. If someone could supply a "ready-to-run" zip package I'd be more than happy to try it.

@robert-matusewicz
Copy link

Those are the result for atomic_test on my windows machine (2 hardware threads):
USE TBB: 1min54sec
NO TBB: 1min58sec

This was release build linked to tbb-21_20090511oss (quite old)

I will post the result for spinlock_test in the evening

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 4, 2012

I pushed an update that, like TBB, uses pause and yield to prevent the spin loops from being quite so contentious. This gets us much closer to TBB performance for spin_lock. Can you guys give a try and see if it improves for you as well?

@hobbes1069
Copy link
Contributor

HW Threads: 3

atomic_test:   36.0s
spinlock_test: 7m 27.0s

Definitly a lot better but why is OSX so much faster than this?

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 4, 2012

Here are my times:

OSX 10.6, 64 bit (2 cores, hyperthreaded):
atomic 21.7 TBB, 22.7 intrinsic Apple, 21.7 gcc intrinsics
spin 23.5 TBB, 1:10.9 intrinsic Apple, 33.5 gcc intrinsics

Linux 2.6.34 kernel, 64 bit (12 cores, hyperthreaded):
atomic 0:37.5 TBB, 0:36.9 gcc
spin 7:04.3 TBB, 3:46 gcc

So I'm also seeing MUCH slower times for spin with Linux than OSX. I don't know if that's due to the kernel, the particular type of CPU model, or just the fact that there are more HW cores and more CPUs (can that slow things down in cases of pathological contention?). Anybody have a good guess?

These two tests are very stressful, it's not a typical use pattern that we'd see in real software. So I'm mostly satisfied that removing TBB will not be a performance penalty for Linux or OSX (and, additionally, that there's no gain from using the OSX-specific spinlock code).

I still need hard numbers for spinlock on Windows from somebody. If those are not significantly worse without TBB, I will remove TBB support entirely.

{
if (delay < 32) {
#if USE_TBB
__TBB_Pause(count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not compile due to a typo, it should be:

__TBB_Pause(delay);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@edgarv
Copy link
Contributor

edgarv commented Apr 4, 2012

Windows 7 64-bit, MSVC 2010. 4 cores hyperthreaded (i7-2720QM)

USE TBB:

unit_atomic   51.3s
unit_spinlock 4m 7.5s

NO TBB:

unit_atomic   51.5s
unit_spinlock 6m 7.6s

@edgarv
Copy link
Contributor

edgarv commented Apr 4, 2012

Linux 2.6.42 64-bit (Fedora 15) running as a VirtualBox guest with 4 cores. The host is the same system from the previous post.

USE TBB:

unit_atomic   1m 1.4s
unit_spinlock 26.6s

NO TBB:

unit_atomic   1m 4.2s
unit_spinlock 1m 38.4s

@2bits
Copy link

2bits commented Apr 4, 2012

osx @ 660b36b, 2.44 GHz Core i5, 8 GB RAM

no tbb

33/40 Test #33: unit_atomic ......................   Passed   23.04 sec
34/40 Test #34: unit_spinlock ....................   Passed   48.33 sec

use_tbb=1

33/40 Test #33: unit_atomic ......................   Passed   22.83 sec
34/40 Test #34: unit_spinlock ....................   Passed   19.13 sec

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 4, 2012

OK, so here's where I think we're at: unit_atomic seems to be the same speed with and without TBB on all platforms, though it strangely varies more than we'd expect with different CPU and different OS (even with the same CPU), and unit_spinlock seems to definitely be slower without TBB, but not by a huge amount of most platforms tested (varying even more by CPU and OS).

What do you guys think?

These tests pathologically induce the highest possible amount of contention. It's not clear that you'd see a practical difference in a real application.

I think we have a few choices: (1) totally rip out TBB; (2) keep the TBB code but turn it off by default; (3) turn it off by default for platforms where there doesn't seem to be a performance hit, but leave it enabled by default for those with big apparent performance differences.

@hobbes1069
Copy link
Contributor

I think option 2 is the best balance. It will quietly get the desired result without making it impossible for someone to change the behavior. I would leave it that way for a while and if no one complains then it could later be cleaned out entirely.

I'm currently only enabling it on 64bit systems for Fedora because for some reason OIIO fails to build against the system installed TBB on i686.

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 4, 2012

Further refinement for #2:

2(a): Turn if off by default (USE_TBB=1 still re-enables it) and leave the tbb source code as it is.

2(b): Turn it off by default, remove all tbb source code from our repository, if somebody wants USE_TBB=1 they are expected to have TBB installed on their system.

@hobbes1069
Copy link
Contributor

Obviously 2b is fine with me since I'm not allowed to use bundled libs in Fedora anyway, and should oiio be shipping with something that's disabled by default?

I think the only thing that would need to be checked is to see if the USE_EXTERNAL_TBB option I patch in for Feodra work for non *nix systems or could be made to work easily. I also can supply a FindTBB.cmake I wrote but it will definitely need to be updated for windows.

@hobbes1069
Copy link
Contributor

Oops! Now that I look at it I forgot I got rid of my FindTBB config file in favor of using a more complete one I found online and it should work for pretty much everything.

http://code.google.com/p/findtbb/

@robert-matusewicz
Copy link

I like the 2b solution - if someone want to use TBB, then it is obvious the TBB should be installed on the system. This is especially true for Windows - even now, when using bundled TBB user need to explicity link to release or debug library.

FindTBB.cmake could be added to our cmake modules directory.

edit: accidentally I closed the pull request, reopening

@lgritz
Copy link
Collaborator Author

lgritz commented Jun 12, 2012

Hi, after some weekend reading, I had some ideas for how to improve the spin locks and just pushed an update in which, on my two machines (a Mac OS X laptop, and a 12-core Linux box), my non-TBB spinlock now outperforms the TBB spinlocks (by a significant margin on Linux!).

Could those of you on this thread please try this out? Check out the branch for this pull request, do "make nuke ; make USE_TBB=1 ; make test TEST=spinlock" and then the same thing again with USE_TBB=0 (be sure to do a full make nuke first).

If the rest of you find the same thing -- that the USE_TBB=0 benchmark is no slower than the USE_TBB=1 -- then perhaps we can finally retire TBB.

For the curious, the three improvements were:

(a) exponential backoff for spin lock contention in the same manner that TBB was doing;

(b) try_lock can spin more efficiency with read-only (non-bus-locking) check until there's a good chance that the lock was released. In particular, you should not spin like this:

while (! try_lock(&thelock)) ;

but instead

while (! try_lock(&thelock))
    while (thelock) ;

This is because the try_lock does a compare-and-swap, which is a writing operation, which will lock the bus. In the latter version, if the initial (bus-locking) try-lock fails, it will do a read-only (non-bus-locking!) check until the lock appears to be free, then do the safe/locking one again.

(c) on x86_64, it's safe for spin_lock::unlock() to do a normal unlocked write. The memory ordering of these chips is such that it doesn't need the memory fences of a full atomic write. I found numerous sources for this on the net, and it is a big win. Apparently, it also works on some 32 bit x86's, most of the recent chips, but I am not very interested in sorting out which x86 chips it's safe to do it on, especially since anybody heavily threaded enough to be concerned with this optimization is on a 64 bit system.

OK, so let me know! If it works, it's a performance improvement as well as a way for us to shed the pesky TBB dependency. Win, win.

@hobbes1069
Copy link
Contributor

With TBB
Test #25: unit_spinlock .................... Passed 82.20 sec

Without TBB i'm getting a failed test:

/home/build/rpmbuild/oiio-test/BUILD/lgritz-oiio-18410b6/src/libOpenImageIO/spinlock_test.cpp:93:
FAILED: a == (int)(numthreads * iterations)
values were '1598735224' and '1600000000'
Time: 20.7s

@lgritz
Copy link
Collaborator Author

lgritz commented Jun 12, 2012

Hmmm... maybe my unlock() optimization is not as safe as I thought.

Richard, does this happen consistently?

If you change the line in thread.h from

    *(int *)&m_locked = 0;

to

    m_locked = 0;

does that make it pass the test? (Though I expect it to be slower.)

What kind of machine is it? (CPU model, number of CPUs, number of cores)

@lgritz
Copy link
Collaborator Author

lgritz commented Jun 12, 2012

I hate this shit. How could C++ not define atomic counters, spin locks, and blocking mutexes as part of the language, and have any compiler worth its salt ensure that it's correct and using the fastest possible implementation? (I know, C++11 has atomics, but it's 15 years too late, and I think it still doesn't have a spin mutex.)

@hobbes1069
Copy link
Contributor

The test failed every time, I think I ran it 3 or 4 times.

Making the change you suggested (actually, I just took out the #if x86_64 clause) let the test complete:
Time: 2m 12.4s

Here's the cpu info, of course it's repeated 2 more times for the other two cores:

$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 5
model name : AMD Athlon(tm) II X3 455 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 3
core id : 0
cpu cores : 3
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips : 6630.20
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

@lgritz
Copy link
Collaborator Author

lgritz commented Jun 12, 2012

I wonder if that's an Intel vs AMD issue. In any case, it doesn't seem safe to do that particular optimization. I will push an update to this patch that removes it.

@hobbes1069
Copy link
Contributor

I assume it's possible, but is it practical or advantageous to investigate how TBB handles this?

@lgritz
Copy link
Collaborator Author

lgritz commented Jun 15, 2012

I've looked at the TBB code extensively. It's very convoluted but I think I'm doing basically the same thing overall. They do not appear to use locking improvement (which seems to help and be safe) or the unlock "improvement" (which turned out not to be safe, so I have removed it).

@lgritz
Copy link
Collaborator Author

lgritz commented Jun 15, 2012

Anybody else want to comment or report timings?

Are there any objections to my committing this to the master (i.e. improve the non-TBB case, but not yet change the default for USE_TBB)?

I'm also still open to changing USE_TBB to default to 0, or even removing TBB entirely, if everybody's timings show that non-TBB is not significantly worse than TBB across a wide variety of hardware and OS configurations.

…time their results.

This allows us to rigorously compare the speed of our atomics and spin locks.
Also make their output a little neater by locking around the status printouts, and
eliminate the #if's that provide safety for Boost < 1.35, which is no longer supported.
… spinning too tightly.

Also, disable the Apple OSSpinLock calls, they seem even slower than our own code.
Two improvements: (a) exponential backoff for spin lock contention;
(b) try_lock can spin more efficiency with read-only (non-bus-locking)
checks until there's a good chance that the lock was released.
lgritz added a commit that referenced this pull request Jun 20, 2012
Substantial improvements to spin_mutex when USE_TBB=0.
@lgritz lgritz merged commit 4337147 into AcademySoftwareFoundation:master Jun 20, 2012
@lgritz
Copy link
Collaborator Author

lgritz commented Jun 20, 2012

I've merged this set of changes. Note that the default is still USE_TBB=1. But this substantially improves the performance of the code if you compile USE_TBB=0. I'm still gathering benchmarks and testing; it's possible that very soon, we will reach consensus that the right thing to do is change the default to use the new code (USE_TBB=0). Also possible that eventually we will make that the only choice and rip out the TBB code entirely (though for now, I think it's useful to have both for comparison purposes).

Also note that if you use both OSL and OIIO, you will probably get lots of grief if you compile those packages with differing USE_TBB settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants