Design review: ImageInput: Add nthreads parameter#1258
Design review: ImageInput: Add nthreads parameter#1258lgritz wants to merge 1 commit intoAcademySoftwareFoundation:masterfrom
Conversation
…d_tiles.
This allows control over thread spawning, similar to the parameter on
the various ImageBufAlgo functions. A value of 0 means to use the global
OIIO::attribute("nthreads") value (which in turn defaults to have a
thread fan-out equal to the hardware concurrency). For a sole thread of
execution, you want a big fan-out for tasks that can be parallelized,
whereas when the calling function is one of many concurrent threads
already underway, you just want the work done directly by the calling
thread, without spawning additional threads that will oversubscribe the
hardware.
There are currently only a couple spots where ImageInput might spawn
threads. So this is primarily for future expansion, so that we might add
more multithreading and give appropriate control to the calling
application.
The reason that we want per-call overrides rather than relying solely on
the global "nthreads" attribute is that OIIO may be used by multiple
parts of an application (or multiple dependent libraries or plugins,
each unaware of the others), and so it is not particularly safe under
those circumstances to set the global value and assume that it will stay
that way.
This is a preliminary effort for design review. Obvious extensions, if
people like the way this is going, would be to add the nthreads
parameter to the "native" versions (read_native_tiles,
read_native_scanlines), the "deep" versions, and and corresponding
routines in ImageOutput. Also we should add a method to ImageBuf that
can set an "nthreads" paremeter for any underlying ImageInput or other
operations done internal to the IB.
It is worth discussing alternatives. I admit that the idea of scattering
"nthreads" parameters widely through the APIs is one that does give me
some pause. But it has worked well for IBA, and at the moment I don't
have a better idea for ImageInput/ImageOutput. Feedback appreciated.
|
I realize this is likely more complicated, but I think the ideal for me would be that if multiple threads are requesting the same resource, then instead of the extra threads blocking, they should all just join up together and help get that resource in parallel. If some of this parallel work is done outside of OIIO's direct control, such as in OpenEXR, then I realize this wouldn't work. One hack around that limitation would be for OIIO to have a counter of how many threads are actively blocked, and then use that as the number for how many helper threads to request. This of course has some obvious issues, but I would hope that in regular usage it would provide a benefit over both the 0 and 1 thread count options. I suppose this could be built on top of the proposed interface in this ticket. |
|
An ideal solution would probably involve some kind of smart app-wide thread pool that managed tasks and would ensure the cores available are never over-subscribed and also never idle as long as any currently-executing OIIO calls have work they can divide up, all the while cognizant of the idle core availability moment to moment and automatically adjudicating the "do I launch threads or do I do this work myself." But that's a big thing to bite off, and I'm certainly not enough of an expert on how to write a world-class thread pool and don't aim to put the time into becoming such an expert. Now, if there was a really nice open source solution to this that we can adopt without it being an onerous dependency or only existing on a subset of platforms we care about... but I'm not aware of one that fits the bill at this time, and also am keeping an eye on C++17 to see what thread pool facilities are adopted, if any. So currently I'm inclined to kick the long-term ideal solution down the road for a while and just take an incremental step from what we have now. If anybody has a suggestion for an already-built solution for this, I'm all ears. |
|
As an alternate to a wide swath of ImageInput and ImageOutput methods taking an 'nthreads' parameter, another potential design to this is simply to have the open() accept a hint ("open with config" as we call it) to what thread policy to use for this ImageInput or ImageOutput. It would simply store it somewhere in the ImageInput or ImageOutput, and internals (including parts of the plugins) would just check that value when they needed to know the thread policy. The "pro" side to this alternate design would be that we wouldn't need to change any of the method call signatures and wouldn't break the API or ABI at all, no 'nthreads' scattered around cluttering the methods' parameters, and if we ever changed the whole scheme (like using a persistent thread pool) we wouldn't be faced with future prospects of deprecating and removing the 'nthreads' which would no longer be necessary or meaningful. The "con" side is that you would need to specify the thread policy when you opened the file and it would apply to all the things you do with it until you close. You wouldn't particularly be able to change the thread policy from call to call. Then again, maybe there's no good reason why you'd want to do that? Alternate to the alternate: instead of open-with-config, we could add an actual method to set the thread policy, so you could set or change it at times other than when you first open the file. This would alter the ImageInput/ImageOutput API, but only VERY slightly. |
|
Alternate design shown in #1259. I am leaning toward that one rather than the one discussed so far in the present ticket, it's so much less intrusive to the API. Please chime in with feedback. |
|
We've been using Intel's TBB for this sort of "magical global thread pool and task scheduler for dummies" thing, and have been pretty happy with it. I guess that might be too much of a dependency for OIIO though? For what it's worth, of the alternatives so far I definitely prefer the ones which are less intrusive on the API - it definitely seems non-ideal to add parameters to a multitude of API calls when those parameters wouldn't exist at all in the ideal future scenario where there is a global thread pool. One further alternative along those lines might be to mimic something like TBB's task_scheduler_init, where an external object provides a separate API for specifying a thread limit for all thread-aware processes launched while it's active (there are examples in the link above). The pro of that one is that there's just a single place to specify the thread count for anything, with no further intrusion into any other APIs. But the con would be that the API is giving the impression of there being a global thread pool, which there isn't at present in OIIO. |
|
I've done a little more mocking up of this (which I haven't yet pushed to the PR), and it just keeps getting worse. This turns out to be much more intrusive to the API than I originally thought, I don't like it much. The alternate #1259 (just add the separate call to set/get nthreads) is really clean in comparison, I'm now strongly inclined to go that route. I have a fairly high opinion of TBB. We used to use TBB in OIIO a bit earlier -- mainly for the spin mutexes, and when I finally got our homemade ones working as efficiently as those in TBB, we abandoned it, to applause of many people who found the TBB dependency troubling. I dunno, I might be inclined to bring it back if it fixes the thread pool problem, but one issue is that it's Intel only. We know that people use OIIO on ARM, so I'm a bit worried about relying on TBB, I suppose it would have to be a "use TBB if available" where we would always need a reasonable fallback for systems that didn't have it. Anyway, I would say, let's take a hard look at TBB for the long-term thread pool issue, but for now that's a bit of an orthogonal issue. The IBA functions have worked reasonably well, and with little complaint, with our very crude thread launching, and the obvious incremental step is to do the same things for ImageInput/ImageOutput internals. But we'll definitely investigate TBB or other thread pool solutions in the longer term. As far as thread pools go, I also think that our first step is to freeze and release OIIO 1.6, and for 1.7 and beyond switch to C++11 as a baseline. The switch to C++11 will also clean up a lot of cruft in our current thread.h, as the new standard provides a lot of what we had to create from scratch. I want to evaluate the suitability of thread pool implementations in a world with lambdas, and also with an eye to what thread pools and other parallel support might look like in C++17. |
|
TBB in OIIO requires that the parent application also use TBB for its Also, yay for C++11 -- that's a "burden" I'm happy to accept. On Thu, Nov 12, 2015 at 10:44 AM, Larry Gritz notifications@github.com
|
|
The app certainly doesn't have to use or be aware of TBB just because OIIO is. But it is very likely that if it's not using TBB, then it still has to give OIIO a hint about how many threads are safe to spawn, if it wants to avoid oversubscription of the cores. I'm not entirely sure under what circumstances it's harmful to have a bit of oversubscription, especially for short intervals. Maybe it's not a problem for apps for which most of the runtime is not spent inside OIIO. So I don't know the true answer. It depends on whether TBB has a thread policy that examines load in some way. If not, then yeah, the calling app could spawn a bunch of threads (not using TBB) that do entirely non-OIIO work, in which case OIIO (and its underlying thread pool implementation) would have no idea that it would be oversubscribing the cores. It's something else to investigate in the future, if we look at TBB for this solution. This really should be a feature of the OS or standard C++ so it can be implemented app-wide in a sane way. There is likely not a one-size-fits-all solution that can be implemented from within one library like OIIO. BTW, for the scheme I'm now favoring (#1259), do you think that the proper default is 1 (just use the calling thread, unless instructed to do otherwise), or 0 (assume you're in a mostly single-threaded app and try to fill the cores, unless instructed otherwise)? |
|
If you're using spinlocks as opposed to a regular mutex that puts the Imagine you create two threads to do some work that takes one thread just I'd hate on a 64 logical core machine to have all 64 of my threads make In the end I think this will require per app tuning to decide what the The alternative I gave earlier on is to have OIIO count how many of the Finally, one big issue with TBB is that it must be dynamically linked (last On Thu, Nov 12, 2015 at 1:05 PM, Larry Gritz notifications@github.com
|
|
Yeah, agreed on all points, and you give some very helpful example scenarios. I think right now there are two common cases that we can deal with well:
Anything else is a much more difficult case, probably without a clean solution that we can solve on our own. So kick that down the road a bit, and in the mean time, the app either has to compromise or be proactive about giving the threading hints to individual OIIO calls or classes. As an aside, I would sure love something that gives me a TBB-like thread pool that is completely implemented within a header file / templates, and supports a build-time setting of custom namespace so that it wouldn't conflict between various dependent libraries within an app. But that is probably less likely than wishing for a pony. I do think that a proper thread pool will probably help, and I'll keep that as a long-term goal, but tackling that is outside the scope of what I need to do right now. |
|
I'm going to close this PR, because as time goes on, I feel more and more strongly that the approach proposed by this patch is not the right one. Let's redirect any remaining conversation about the approach of #1259 to that ticket. |
I'm not committed to this yet, just testing the waters and looking for feedback.
ImageInput: Add nthreads parameter to read_image, read_scanlines, read_tiles.
This allows control over thread spawning, similar to the parameter on the various ImageBufAlgo functions. A value of 0 means to use the global OIIO::attribute("nthreads") value (which in turn defaults to have a thread fan-out equal to the hardware concurrency). For a sole thread of execution, you want a big fan-out for tasks that can be parallelized, whereas when the calling function is one of many concurrent threads already underway, you just want the work done directly by the calling thread, without spawning additional threads that will oversubscribe the hardware.
There are currently only a couple spots where ImageInput might spawn threads. So this is primarily for future expansion, so that we might add more multithreading and give appropriate control to the calling application.
The reason that we want per-call overrides rather than relying solely on the global "nthreads" attribute is that OIIO may be used by multiple parts of an application (or multiple dependent libraries or plugins, each unaware of the others), and so it is not particularly safe under those circumstances to set the global value and assume that it will stay that way.
This is a preliminary effort for design review. Obvious extensions, if people like the way this is going, would be to add the nthreads parameter to the "native" versions (read_native_tiles, read_native_scanlines), the "deep" versions, and and corresponding routines in ImageOutput. Also we should add a method to ImageBuf that can set an "nthreads" paremeter for any underlying ImageInput or other operations done internal to the IB.
It is worth discussing alternatives. I admit that the idea of scattering "nthreads" parameters widely through the APIs is one that does give me some pause. But it has worked well for IBA, and at the moment I don't have a better idea for ImageInput/ImageOutput. Feedback appreciated.
Additional question: should the default be 0 (use the global "nthreads"), or should the default be 1 (just do the work with the calling thread)?