New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make atomic_test and spinlock_test run for many more iterations, and time their results. #273
Conversation
Here's some interesting preliminary results on my system: Linux x86_64 (3.3.0 kernel) atomic test w/ TBB 1m 24.3s I'm not sure what's going on but the spinlock test w/o TBB has easily gone over 5 minutes maxing out all cores on my system but still hasn't finished. If you want the full logs let me know and I'll email them to you. |
Forgot to mention I'm using system based TBB version 3.0. |
spinlock test w/o TBB finally finished: |
Here are my results on OSX Lion, 2.44 GHz Core i5 with 8GB RAM: With USE_TBB=1 and TBB 4.0u3 installed (doesn't it use an internal TBB, though?):
When setting USE_TBB=0:
|
Since you mentioned system vs. bundled TBB I decided to try again with the bundled and pretty much got the same results as I got with the system TBB:
|
I'm seeing similar results myself, on both OSX and Linux: unit_atomic is about the same speed overall, but unit_spinlock is much slower. But I think I know why, and am working on a fix. I really need to know if this is plausible for Windows. On Apr 3, 2012, at 1:59 PM, 2bits wrote:
Larry Gritz |
My work machine is Windows XP but I can't install a build environment on it. If someone could supply a "ready-to-run" zip package I'd be more than happy to try it. |
Those are the result for atomic_test on my windows machine (2 hardware threads): This was release build linked to tbb-21_20090511oss (quite old) I will post the result for spinlock_test in the evening |
I pushed an update that, like TBB, uses pause and yield to prevent the spin loops from being quite so contentious. This gets us much closer to TBB performance for spin_lock. Can you guys give a try and see if it improves for you as well? |
HW Threads: 3
Definitly a lot better but why is OSX so much faster than this? |
Here are my times: OSX 10.6, 64 bit (2 cores, hyperthreaded): Linux 2.6.34 kernel, 64 bit (12 cores, hyperthreaded): So I'm also seeing MUCH slower times for spin with Linux than OSX. I don't know if that's due to the kernel, the particular type of CPU model, or just the fact that there are more HW cores and more CPUs (can that slow things down in cases of pathological contention?). Anybody have a good guess? These two tests are very stressful, it's not a typical use pattern that we'd see in real software. So I'm mostly satisfied that removing TBB will not be a performance penalty for Linux or OSX (and, additionally, that there's no gain from using the OSX-specific spinlock code). I still need hard numbers for spinlock on Windows from somebody. If those are not significantly worse without TBB, I will remove TBB support entirely. |
{ | ||
if (delay < 32) { | ||
#if USE_TBB | ||
__TBB_Pause(count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not compile due to a typo, it should be:
__TBB_Pause(delay);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Windows 7 64-bit, MSVC 2010. 4 cores hyperthreaded (i7-2720QM) USE TBB:
NO TBB:
|
Linux 2.6.42 64-bit (Fedora 15) running as a VirtualBox guest with 4 cores. The host is the same system from the previous post. USE TBB:
NO TBB:
|
osx @ 660b36b, 2.44 GHz Core i5, 8 GB RAM no tbb
use_tbb=1
|
OK, so here's where I think we're at: unit_atomic seems to be the same speed with and without TBB on all platforms, though it strangely varies more than we'd expect with different CPU and different OS (even with the same CPU), and unit_spinlock seems to definitely be slower without TBB, but not by a huge amount of most platforms tested (varying even more by CPU and OS). What do you guys think? These tests pathologically induce the highest possible amount of contention. It's not clear that you'd see a practical difference in a real application. I think we have a few choices: (1) totally rip out TBB; (2) keep the TBB code but turn it off by default; (3) turn it off by default for platforms where there doesn't seem to be a performance hit, but leave it enabled by default for those with big apparent performance differences. |
I think option 2 is the best balance. It will quietly get the desired result without making it impossible for someone to change the behavior. I would leave it that way for a while and if no one complains then it could later be cleaned out entirely. I'm currently only enabling it on 64bit systems for Fedora because for some reason OIIO fails to build against the system installed TBB on i686. |
Further refinement for #2: 2(a): Turn if off by default (USE_TBB=1 still re-enables it) and leave the tbb source code as it is. 2(b): Turn it off by default, remove all tbb source code from our repository, if somebody wants USE_TBB=1 they are expected to have TBB installed on their system. |
Obviously 2b is fine with me since I'm not allowed to use bundled libs in Fedora anyway, and should oiio be shipping with something that's disabled by default? I think the only thing that would need to be checked is to see if the USE_EXTERNAL_TBB option I patch in for Feodra work for non *nix systems or could be made to work easily. I also can supply a FindTBB.cmake I wrote but it will definitely need to be updated for windows. |
Oops! Now that I look at it I forgot I got rid of my FindTBB config file in favor of using a more complete one I found online and it should work for pretty much everything. |
I like the 2b solution - if someone want to use TBB, then it is obvious the TBB should be installed on the system. This is especially true for Windows - even now, when using bundled TBB user need to explicity link to release or debug library. FindTBB.cmake could be added to our cmake modules directory. edit: accidentally I closed the pull request, reopening |
Hi, after some weekend reading, I had some ideas for how to improve the spin locks and just pushed an update in which, on my two machines (a Mac OS X laptop, and a 12-core Linux box), my non-TBB spinlock now outperforms the TBB spinlocks (by a significant margin on Linux!). Could those of you on this thread please try this out? Check out the branch for this pull request, do "make nuke ; make USE_TBB=1 ; make test TEST=spinlock" and then the same thing again with USE_TBB=0 (be sure to do a full make nuke first). If the rest of you find the same thing -- that the USE_TBB=0 benchmark is no slower than the USE_TBB=1 -- then perhaps we can finally retire TBB. For the curious, the three improvements were: (a) exponential backoff for spin lock contention in the same manner that TBB was doing; (b) try_lock can spin more efficiency with read-only (non-bus-locking) check until there's a good chance that the lock was released. In particular, you should not spin like this: while (! try_lock(&thelock)) ; but instead
This is because the try_lock does a compare-and-swap, which is a writing operation, which will lock the bus. In the latter version, if the initial (bus-locking) try-lock fails, it will do a read-only (non-bus-locking!) check until the lock appears to be free, then do the safe/locking one again. (c) on x86_64, it's safe for spin_lock::unlock() to do a normal unlocked write. The memory ordering of these chips is such that it doesn't need the memory fences of a full atomic write. I found numerous sources for this on the net, and it is a big win. Apparently, it also works on some 32 bit x86's, most of the recent chips, but I am not very interested in sorting out which x86 chips it's safe to do it on, especially since anybody heavily threaded enough to be concerned with this optimization is on a 64 bit system. OK, so let me know! If it works, it's a performance improvement as well as a way for us to shed the pesky TBB dependency. Win, win. |
With TBB Without TBB i'm getting a failed test: /home/build/rpmbuild/oiio-test/BUILD/lgritz-oiio-18410b6/src/libOpenImageIO/spinlock_test.cpp:93: |
Hmmm... maybe my unlock() optimization is not as safe as I thought. Richard, does this happen consistently? If you change the line in thread.h from
to
does that make it pass the test? (Though I expect it to be slower.) What kind of machine is it? (CPU model, number of CPUs, number of cores) |
I hate this shit. How could C++ not define atomic counters, spin locks, and blocking mutexes as part of the language, and have any compiler worth its salt ensure that it's correct and using the fastest possible implementation? (I know, C++11 has atomics, but it's 15 years too late, and I think it still doesn't have a spin mutex.) |
The test failed every time, I think I ran it 3 or 4 times. Making the change you suggested (actually, I just took out the #if x86_64 clause) let the test complete: Here's the cpu info, of course it's repeated 2 more times for the other two cores: $ cat /proc/cpuinfo |
I wonder if that's an Intel vs AMD issue. In any case, it doesn't seem safe to do that particular optimization. I will push an update to this patch that removes it. |
I assume it's possible, but is it practical or advantageous to investigate how TBB handles this? |
I've looked at the TBB code extensively. It's very convoluted but I think I'm doing basically the same thing overall. They do not appear to use locking improvement (which seems to help and be safe) or the unlock "improvement" (which turned out not to be safe, so I have removed it). |
Anybody else want to comment or report timings? Are there any objections to my committing this to the master (i.e. improve the non-TBB case, but not yet change the default for USE_TBB)? I'm also still open to changing USE_TBB to default to 0, or even removing TBB entirely, if everybody's timings show that non-TBB is not significantly worse than TBB across a wide variety of hardware and OS configurations. |
…time their results. This allows us to rigorously compare the speed of our atomics and spin locks. Also make their output a little neater by locking around the status printouts, and eliminate the #if's that provide safety for Boost < 1.35, which is no longer supported.
… spinning too tightly. Also, disable the Apple OSSpinLock calls, they seem even slower than our own code.
Two improvements: (a) exponential backoff for spin lock contention; (b) try_lock can spin more efficiency with read-only (non-bus-locking) checks until there's a good chance that the lock was released.
Substantial improvements to spin_mutex when USE_TBB=0.
I've merged this set of changes. Note that the default is still USE_TBB=1. But this substantially improves the performance of the code if you compile USE_TBB=0. I'm still gathering benchmarks and testing; it's possible that very soon, we will reach consensus that the right thing to do is change the default to use the new code (USE_TBB=0). Also possible that eventually we will make that the only choice and rip out the TBB code entirely (though for now, I think it's useful to have both for comparison purposes). Also note that if you use both OSL and OIIO, you will probably get lots of grief if you compile those packages with differing USE_TBB settings. |
Make atomic_test and spinlock_test run for many more iterations, and time their results.
This allows us to rigorously compare the speed of our atomics and spin locks.
Also make their output a little neater by locking around the status printouts, and
eliminate the #if's that provide safety for Boost < 1.35, which is no longer supported.