Skip to content

Design review: ImageInput: Add nthreads parameter#1258

Closed
lgritz wants to merge 1 commit intoAcademySoftwareFoundation:masterfrom
lgritz:lg-thread
Closed

Design review: ImageInput: Add nthreads parameter#1258
lgritz wants to merge 1 commit intoAcademySoftwareFoundation:masterfrom
lgritz:lg-thread

Conversation

@lgritz
Copy link
Collaborator

@lgritz lgritz commented Nov 11, 2015

I'm not committed to this yet, just testing the waters and looking for feedback.

ImageInput: Add nthreads parameter to read_image, read_scanlines, read_tiles.

This allows control over thread spawning, similar to the parameter on the various ImageBufAlgo functions. A value of 0 means to use the global OIIO::attribute("nthreads") value (which in turn defaults to have a thread fan-out equal to the hardware concurrency). For a sole thread of execution, you want a big fan-out for tasks that can be parallelized, whereas when the calling function is one of many concurrent threads already underway, you just want the work done directly by the calling thread, without spawning additional threads that will oversubscribe the hardware.

There are currently only a couple spots where ImageInput might spawn threads. So this is primarily for future expansion, so that we might add more multithreading and give appropriate control to the calling application.

The reason that we want per-call overrides rather than relying solely on the global "nthreads" attribute is that OIIO may be used by multiple parts of an application (or multiple dependent libraries or plugins, each unaware of the others), and so it is not particularly safe under those circumstances to set the global value and assume that it will stay that way.

This is a preliminary effort for design review. Obvious extensions, if people like the way this is going, would be to add the nthreads parameter to the "native" versions (read_native_tiles, read_native_scanlines), the "deep" versions, and and corresponding routines in ImageOutput. Also we should add a method to ImageBuf that can set an "nthreads" paremeter for any underlying ImageInput or other operations done internal to the IB.

It is worth discussing alternatives. I admit that the idea of scattering "nthreads" parameters widely through the APIs is one that does give me some pause. But it has worked well for IBA, and at the moment I don't have a better idea for ImageInput/ImageOutput. Feedback appreciated.

Additional question: should the default be 0 (use the global "nthreads"), or should the default be 1 (just do the work with the calling thread)?

…d_tiles.

This allows control over thread spawning, similar to the parameter on
the various ImageBufAlgo functions. A value of 0 means to use the global
OIIO::attribute("nthreads") value (which in turn defaults to have a
thread fan-out equal to the hardware concurrency). For a sole thread of
execution, you want a big fan-out for tasks that can be parallelized,
whereas when the calling function is one of many concurrent threads
already underway, you just want the work done directly by the calling
thread, without spawning additional threads that will oversubscribe the
hardware.

There are currently only a couple spots where ImageInput might spawn
threads. So this is primarily for future expansion, so that we might add
more multithreading and give appropriate control to the calling
application.

The reason that we want per-call overrides rather than relying solely on
the global "nthreads" attribute is that OIIO may be used by multiple
parts of an application (or multiple dependent libraries or plugins,
each unaware of the others), and so it is not particularly safe under
those circumstances to set the global value and assume that it will stay
that way.

This is a preliminary effort for design review. Obvious extensions, if
people like the way this is going, would be to add the nthreads
parameter to the "native" versions (read_native_tiles,
read_native_scanlines), the "deep" versions, and and corresponding
routines in ImageOutput. Also we should add a method to ImageBuf that
can set an "nthreads" paremeter for any underlying ImageInput or other
operations done internal to the IB.

It is worth discussing alternatives. I admit that the idea of scattering
"nthreads" parameters widely through the APIs is one that does give me
some pause. But it has worked well for IBA, and at the moment I don't
have a better idea for ImageInput/ImageOutput. Feedback appreciated.
@ThiagoIze
Copy link
Collaborator

I realize this is likely more complicated, but I think the ideal for me would be that if multiple threads are requesting the same resource, then instead of the extra threads blocking, they should all just join up together and help get that resource in parallel. If some of this parallel work is done outside of OIIO's direct control, such as in OpenEXR, then I realize this wouldn't work.

One hack around that limitation would be for OIIO to have a counter of how many threads are actively blocked, and then use that as the number for how many helper threads to request. This of course has some obvious issues, but I would hope that in regular usage it would provide a benefit over both the 0 and 1 thread count options. I suppose this could be built on top of the proposed interface in this ticket.

@lgritz
Copy link
Collaborator Author

lgritz commented Nov 12, 2015

An ideal solution would probably involve some kind of smart app-wide thread pool that managed tasks and would ensure the cores available are never over-subscribed and also never idle as long as any currently-executing OIIO calls have work they can divide up, all the while cognizant of the idle core availability moment to moment and automatically adjudicating the "do I launch threads or do I do this work myself."

But that's a big thing to bite off, and I'm certainly not enough of an expert on how to write a world-class thread pool and don't aim to put the time into becoming such an expert. Now, if there was a really nice open source solution to this that we can adopt without it being an onerous dependency or only existing on a subset of platforms we care about... but I'm not aware of one that fits the bill at this time, and also am keeping an eye on C++17 to see what thread pool facilities are adopted, if any. So currently I'm inclined to kick the long-term ideal solution down the road for a while and just take an incremental step from what we have now.

If anybody has a suggestion for an already-built solution for this, I'm all ears.

@lgritz
Copy link
Collaborator Author

lgritz commented Nov 12, 2015

As an alternate to a wide swath of ImageInput and ImageOutput methods taking an 'nthreads' parameter, another potential design to this is simply to have the open() accept a hint ("open with config" as we call it) to what thread policy to use for this ImageInput or ImageOutput. It would simply store it somewhere in the ImageInput or ImageOutput, and internals (including parts of the plugins) would just check that value when they needed to know the thread policy.

The "pro" side to this alternate design would be that we wouldn't need to change any of the method call signatures and wouldn't break the API or ABI at all, no 'nthreads' scattered around cluttering the methods' parameters, and if we ever changed the whole scheme (like using a persistent thread pool) we wouldn't be faced with future prospects of deprecating and removing the 'nthreads' which would no longer be necessary or meaningful.

The "con" side is that you would need to specify the thread policy when you opened the file and it would apply to all the things you do with it until you close. You wouldn't particularly be able to change the thread policy from call to call. Then again, maybe there's no good reason why you'd want to do that?

Alternate to the alternate: instead of open-with-config, we could add an actual method to set the thread policy, so you could set or change it at times other than when you first open the file. This would alter the ImageInput/ImageOutput API, but only VERY slightly.

@lgritz
Copy link
Collaborator Author

lgritz commented Nov 12, 2015

Alternate design shown in #1259. I am leaning toward that one rather than the one discussed so far in the present ticket, it's so much less intrusive to the API.

Please chime in with feedback.

@johnhaddon
Copy link
Contributor

We've been using Intel's TBB for this sort of "magical global thread pool and task scheduler for dummies" thing, and have been pretty happy with it. I guess that might be too much of a dependency for OIIO though?

For what it's worth, of the alternatives so far I definitely prefer the ones which are less intrusive on the API - it definitely seems non-ideal to add parameters to a multitude of API calls when those parameters wouldn't exist at all in the ideal future scenario where there is a global thread pool.

One further alternative along those lines might be to mimic something like TBB's task_scheduler_init, where an external object provides a separate API for specifying a thread limit for all thread-aware processes launched while it's active (there are examples in the link above). The pro of that one is that there's just a single place to specify the thread count for anything, with no further intrusion into any other APIs. But the con would be that the API is giving the impression of there being a global thread pool, which there isn't at present in OIIO.

@lgritz
Copy link
Collaborator Author

lgritz commented Nov 12, 2015

I've done a little more mocking up of this (which I haven't yet pushed to the PR), and it just keeps getting worse. This turns out to be much more intrusive to the API than I originally thought, I don't like it much. The alternate #1259 (just add the separate call to set/get nthreads) is really clean in comparison, I'm now strongly inclined to go that route.

I have a fairly high opinion of TBB. We used to use TBB in OIIO a bit earlier -- mainly for the spin mutexes, and when I finally got our homemade ones working as efficiently as those in TBB, we abandoned it, to applause of many people who found the TBB dependency troubling. I dunno, I might be inclined to bring it back if it fixes the thread pool problem, but one issue is that it's Intel only. We know that people use OIIO on ARM, so I'm a bit worried about relying on TBB, I suppose it would have to be a "use TBB if available" where we would always need a reasonable fallback for systems that didn't have it.

Anyway, I would say, let's take a hard look at TBB for the long-term thread pool issue, but for now that's a bit of an orthogonal issue. The IBA functions have worked reasonably well, and with little complaint, with our very crude thread launching, and the obvious incremental step is to do the same things for ImageInput/ImageOutput internals. But we'll definitely investigate TBB or other thread pool solutions in the longer term.

As far as thread pools go, I also think that our first step is to freeze and release OIIO 1.6, and for 1.7 and beyond switch to C++11 as a baseline. The switch to C++11 will also clean up a lot of cruft in our current thread.h, as the new standard provides a lot of what we had to create from scratch. I want to evaluate the suitability of thread pool implementations in a world with lambdas, and also with an eye to what thread pools and other parallel support might look like in C++17.

@ThiagoIze
Copy link
Collaborator

TBB in OIIO requires that the parent application also use TBB for its
threading, right? Otherwise we're back to the problem of more threads being
used than there are cores. That's putting a huge burden on the parent
application which might not want to use TBB for its threading.

Also, yay for C++11 -- that's a "burden" I'm happy to accept.

On Thu, Nov 12, 2015 at 10:44 AM, Larry Gritz notifications@github.com
wrote:

I've done a little more mocking up of this (which I haven't yet pushed to
the PR), and it just keeps getting worse. This turns out to be much more
intrusive to the API than I originally thought, I don't like it much. The
alternate #1259 #1259 (just add
the separate call to set/get nthreads) is really clean in comparison, I'm
now strongly inclined to go that route.

I have a fairly high opinion of TBB. We used to use TBB in OIIO a bit
earlier -- mainly for the spin mutexes, and when I finally got our homemade
ones working as efficiently as those in TBB, we abandoned it, to applause
of many people who found the TBB dependency troubling. I dunno, I might be
inclined to bring it back if it fixes the thread pool problem, but one
issue is that it's Intel only. We know that people use OIIO on ARM, so I'm
a bit worried about relying on TBB, I suppose it would have to be a "use
TBB if available" where we would always need a reasonable fallback for
systems that didn't have it.

Anyway, I would say, let's take a hard look at TBB for the long-term
thread pool issue, but for now that's a bit of an orthogonal issue. The IBA
functions have worked reasonably well, and with little complaint, with our
very crude thread launching, and the obvious incremental step is to do the
same things for ImageInput/ImageOutput internals. But we'll definitely
investigate TBB or other thread pool solutions in the longer term.

As far as thread pools go, I also think that our first step is to freeze
and release OIIO 1.6, and for 1.7 and beyond switch to C++11 as a baseline.
The switch to C++11 will also clean up a lot of cruft in our current
thread.h, as the new standard provides a lot of what we had to create from
scratch. I want to evaluate the suitability of thread pool implementations
in a world with lambdas, and also with an eye to what thread pools and
other parallel support might look like in C++17.


Reply to this email directly or view it on GitHub
#1258 (comment).

@lgritz
Copy link
Collaborator Author

lgritz commented Nov 12, 2015

The app certainly doesn't have to use or be aware of TBB just because OIIO is. But it is very likely that if it's not using TBB, then it still has to give OIIO a hint about how many threads are safe to spawn, if it wants to avoid oversubscription of the cores.

I'm not entirely sure under what circumstances it's harmful to have a bit of oversubscription, especially for short intervals. Maybe it's not a problem for apps for which most of the runtime is not spent inside OIIO.

So I don't know the true answer. It depends on whether TBB has a thread policy that examines load in some way. If not, then yeah, the calling app could spawn a bunch of threads (not using TBB) that do entirely non-OIIO work, in which case OIIO (and its underlying thread pool implementation) would have no idea that it would be oversubscribing the cores. It's something else to investigate in the future, if we look at TBB for this solution.

This really should be a feature of the OS or standard C++ so it can be implemented app-wide in a sane way. There is likely not a one-size-fits-all solution that can be implemented from within one library like OIIO.

BTW, for the scheme I'm now favoring (#1259), do you think that the proper default is 1 (just use the calling thread, unless instructed to do otherwise), or 0 (assume you're in a mostly single-threaded app and try to fill the cores, unless instructed otherwise)?

@ThiagoIze
Copy link
Collaborator

If you're using spinlocks as opposed to a regular mutex that puts the
thread to sleep, then oversubscription can be very bad as you'll have
threads that are spinning hogging up the cores. If the parent app uses
lots of spinlocks, then oversubscription in OIIO could slow things down
both for OIIO and the parent app. There are also costs, such as the actual
context switching and increased cache evictions. Some moderate amounts of
oversubscription are probably fine and can give a net speedup if the
alternative is to undersubscribe the cores. Lots of oversubscription can
cause noticeable slowdowns.

Imagine you create two threads to do some work that takes one thread just
.1ms to accomplish and a timeslice is 10ms (I believe that's what you'd get
with the linux CFS scheduler with two threads per core). Then if one of
those threads starts working but gets preempted, you'll end up having to
wait 10.1ms for the result. That's a 100x slowdown. Granted, if all the
cores are always doing useful work, that's probably fine if we just care
about overall throughput, but if maybe this preemption happened because of
a spinlock, then that's a lot of wasted CPU cycles.

I'd hate on a 64 logical core machine to have all 64 of my threads make
OIIO calls at the same time which in turn each spawn 64 threads -- 4096
threads total can't be good. If instead OIIO has a 64 thread pool of its
own, having 128 threads running probably won't be too bad. Even if you
don't use TBB, I'd still advocate a custom thread pool for this reason.

In the end I think this will require per app tuning to decide what the
proper hint is for number of threads to use. As for the default to use,
I'd pick whatever seems to usually give the better results. If you find
that oversubscription is helping in all the tests you do, then go with that
and let that odd use case override the default.

The alternative I gave earlier on is to have OIIO count how many of the
host application threads are blocked inside of OIIO on file IO and then use
that to help it determine what its active thread pool size should be at
that moment in time. I still think this is the best option if TBB isn't
being used everywhere.

Finally, one big issue with TBB is that it must be dynamically linked (last
I checked at least). This can cause headaches when different linked
together libraries and processes use different versions of TBB. I realize
the VFX reference platform tries to get around this by advocating everyone
using the same version, but in practice I think this won't always work.
For instance, what if I'm still supporting an older DCC based on the 2014
VFX platorm, so it uses an older version of TBB, but my library that the
DCC loads is built to use the TBB from the current VFX platform 2015. That
won't work. I'd have to create lots of OIIO builds for each version of the
DCCs I support (or drop support for older DCC versions).

On Thu, Nov 12, 2015 at 1:05 PM, Larry Gritz notifications@github.com
wrote:

The app certainly doesn't have to use or be aware of TBB just because
OIIO is. But it is very likely that if it's not using TBB, then it still
has to give OIIO a hint about how many threads are safe to spawn, if it
wants to avoid oversubscription of the cores.

I'm not entirely sure under what circumstances it's harmful to have a bit
of oversubscription, especially for short intervals. Maybe it's not a
problem for apps for which most of the runtime is not spent inside OIIO.

So I don't know the true answer. It depends on whether TBB has a thread
policy that examines load in some way. If not, then yeah, the calling app
could spawn a bunch of threads (not using TBB) that do entirely non-OIIO
work, in which case OIIO (and its underlying thread pool implementation)
would have no idea that it would be oversubscribing the cores. It's
something else to investigate in the future, if we look at TBB for this
solution.

This really should be a feature of the OS or standard C++ so it can be
implemented app-wide in a sane way. There is likely not a one-size-fits-all
solution that can be implemented from within one library like OIIO.

BTW, for the scheme I'm now favoring (#1259
#1259), do you think that the
proper default is 1 (just use the calling thread, unless instructed to do
otherwise), or 0 (assume you're in a mostly single-threaded app and try to
fill the cores, unless instructed otherwise)?


Reply to this email directly or view it on GitHub
#1258 (comment).

@lgritz
Copy link
Collaborator Author

lgritz commented Nov 12, 2015

Yeah, agreed on all points, and you give some very helpful example scenarios.

I think right now there are two common cases that we can deal with well:

  1. App that is essentially single threaded, and has to block on an expensive OIIO operation. You want the OIIO op to break into sub-tasks that use as many cores as are available. This is for something like oiiotool, maketx, any Python script using the OIIO Python APIs, and a wide range of other simple apps, and the most important operations to parallelize are the expensive ones like fully reading or doing IBA-like processing on an entire image. For these, you want to set OIIO's "nthreads" attribute to 0 and let it have a thread fan-out that uses all the cores.
  2. An app that is already highly multithreaded in a way that spawns the "right" number of threads overall, those individual threads may make relatively inexpensive OIIO calls (e.g. reading a tile, not doing something to a whole 8k image), which may occur concurrently but other work is happening at all times as well, and those other threads doing non-OIIO things usually don't block. An example of that is a renderer that uses TextureSystem. For these, you want to set OIIO's "nthreads" attribute to 1 so that work is done by the calling thread.

Anything else is a much more difficult case, probably without a clean solution that we can solve on our own. So kick that down the road a bit, and in the mean time, the app either has to compromise or be proactive about giving the threading hints to individual OIIO calls or classes.

As an aside, I would sure love something that gives me a TBB-like thread pool that is completely implemented within a header file / templates, and supports a build-time setting of custom namespace so that it wouldn't conflict between various dependent libraries within an app. But that is probably less likely than wishing for a pony.

I do think that a proper thread pool will probably help, and I'll keep that as a long-term goal, but tackling that is outside the scope of what I need to do right now.

@lgritz
Copy link
Collaborator Author

lgritz commented Nov 12, 2015

I'm going to close this PR, because as time goes on, I feel more and more strongly that the approach proposed by this patch is not the right one. Let's redirect any remaining conversation about the approach of #1259 to that ticket.

@lgritz lgritz closed this Nov 12, 2015
@lgritz lgritz deleted the lg-thread branch November 14, 2015 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants