ImageCache/ImageBuf read performance by lgritz · Pull Request #480 · AcademySoftwareFoundation/OpenImageIO

lgritz · 2012-12-31T07:33:05Z

I spent some time carefully profiling image read performance of scanline OpenEXR images in OIIO. I was particularly interested in making sure that reading via an ImageBuf or ImageCache::get_pixels did not have too much overhead compared to a raw ImageInput::read_image (presumed to be "speed of light", the barest wrapping of the underlying library calls), and how autotile/autoscanline affected things.

First thing I noticed is a big flaw in my libOpenImageIO/imagespeed_test, I had neglected to fully flush the ImageCache between sub-tests, and that was throwing off my numbers, making IB and IC look rosier than they really were. I also augmented imagespeed_test to explicitly cover several more of the combinations I describe above. Here is example output from my Macbook Pro on a 2336x1198 4-channel float OpenEXR file:

read_image speed: 0.12s
read_scanline (1 at a time) speed: 0.33s
read_scanlines (64 at a time) speed: 0.12s
ImageBuf read speed: 0.30s
ImageCache get_pixels speed: 0.28s
With autotile = 64:
ImageBuf read speed: 0.41s
ImageCache get_pixels speed: 0.39s
With autotile = 64, autoscanline = 1:
ImageBuf read speed: 0.48s
ImageCache get_pixels speed: 0.47s

Some interesting things we can say already:

Using read_scanlines in 64-scanline chunks is equally fast as a read_image (which internally uses read_scanlines in 256-scanline chunks). But read_scanline individually reading each line takes almost 3x longer! This is because when reading multiple scanlines at once, the OpenEXR library (libIlmImf) uses multiple threads to pipeline the reads and decompresses. When you read just one scanline, it doesn't have the opportunity to do that.
Reading the same file with either ImageBuf or IC::get_pixels is a high penalty -- about 2.5x slower than speed of light.
When autotile is turned on, it's even worse, and when autoscanline=1 (which is supposed to be an improvement, by making the "virtual tiles" the full width of the image), it's even slower, 4x SOL.

OK, so I carefully profiled those code paths and made a number of improvements. I'll give you the results first:

read_image speed: 0.12s
read_scanline (1 at a time) speed: 0.33s
read_scanlines (64 at a time) speed: 0.12s
ImageBuf read speed: 0.15s
ImageCache get_pixels speed: 0.14s
With autotile = 64:
ImageBuf read speed: 0.33s
ImageCache get_pixels speed: 0.32s
With autotile = 64, autoscanline = 1:
ImageBuf read speed: 0.18s
ImageCache get_pixels speed: 0.17s

Bottom line is that I've DOUBLED the image reading speed of using ImageBuf and ImageCache::get_pixels when autotile is off, they are now almost as fast as raw calls to ImageInput::read_scanlines or read_image. Autotile is still much slower, though much improved compared to before. And the combination of autotile and autoscanline has been nearly tripled in performance, and is now only incrementally slower than when autotile is off (that is, when the ImageCache reads the whole image as one tile).

So, how did I do it? Basically it broke down to three main improvements:

I found some places (including ImageBuf's local storage and the tile memory in ImageCacheTile) where I'd used std::vector as a simple memory-managed buffer. This turned out to be unwise, because vector zeroes out the memory when you resize() to allocate (only to be I immediately filled with the real data afterwards). On modern architectures, touching memory unnecessarily (especially every byte of a buffer that's bigger than L2 cache!) is one of the most expensive things you can do. I made a custom scoped_array template in imagebuf.h that does the simple memory management I was after originally. (N.B. When we can count on C++11 everywhere, we can use unique_ptr<>, but for now I just rolled my own).
The code that read scanlines for the autotile was using individual read_scanline calls, not read_scanlines. For OpenEXR in particular, this makes a big difference.
The ImageCache was rounding tile sizes up to powers of 2 (including the full-image-tile, when autotile was off, and the virtual tile size when autotile was on, and the scanline width, when autotile was on and autoscanline was on). This is a historical artifact from the time when TextureSystem required power-of-2 tile sizes. It no longer does, so I removed the rounding. Note that for my file, which coincidentally was 2336 x 1198 (slightly above a power of 2 in each dimension), that was quite a lot of extra data and memory when rounded up, as well as causing various routines that has special fast cases for "contiguous stride" data layouts to end up on the slower paths for non-contiguous strides.

And there were a few other minor refactors with smaller effects, and some minor bug fixes for things that broke with these changes. Look at the individual commit comments if you care.

The final takeaway is:

After this is committed, you should expect much faster image read speeds if you are going through the ImageBuf or ImageCache interfaces, particularly for scanline (non-tiled) images, and especially for OpenEXR where there is an actual advantage for read_scanlines versus read_scanline.
If you are using autotile, you should also be using autoscanline.

…oced array

… than std::vector in order to reduce needless initialization of the buffers.

…::vector in order to avoid useless initialization of the memory when it's allocated.

… temp memory

…tead of read_scanline

This was an artifact of an old restriction of TextureSystem which is no longer necessary.

jeremyselan · 2013-01-02T17:08:41Z

This is awesome! Great speedups, and a great write up as well.

To confirm, this is binary incompatible, correct? (I note the change to scoped_array inside the ImageBuf public header). So is the assumptions this would be rolled into a new major version? (or the next major version).

(It's also interesting to note that if ImageBuf were implemented using the pimp pattern, this probably could have been done in a binary compatible manner).

lgritz · 2013-01-02T22:45:21Z

You're right: Because this breaks binary compatibility of public types, it just has to be a 1.2 feature, it just can't be backported to 1.1.

You're also correct that ImageBuf has become too complex to have the implementation in the public headers; it really needs PIMPL. I had avoided that just to make it as performant as possible rather than needing a function call and a pointer indirection for absolutely every method, even trivial get and sets, but maybe I shouldn't be worried about that in the grand scheme of things.

Should I make PIMPL-ifying IB a part of this review, or do it separately?

jeremyselan · 2013-01-02T23:11:04Z

This commit looks good to me. (master branch only)

Definitely a separate review for pimpl-ing. We dont necessarily have to pimpl everything, so we should look at real performance numbers and see if the indirection hurts us on the little functions (getpixel, setpixel, etc). But there are a whole bunch of classes in OIIO which are not performance limited, we should pimpl those as well.

lgritz · 2013-01-02T23:15:02Z

OK, I will merge this and look into pmpling separately. Maybe there's a hybrid approach where a small number of POD-and-unlikely-to-change fields can be public and make the frequently-called access routines as fast as possible, and the tricky stuff can be pimpled without any performance impact.

ImageCache/ImageBuf read performance improvements

…ion#480) Fixes AcademySoftwareFoundation#473

lgritz added 9 commits December 30, 2012 22:49

More diagnostics in imagespeed_test to pinpoint bottlenecks

7e0f4f1

scoped_array template for simple memory management of dynamically-all…

29290ff

…oced array

ImageBuf perf improvement -- hold local pixels in scoped_array rather…

333fbb9

… than std::vector in order to reduce needless initialization of the buffers.

ImageCache perf: for tile pixel mem, use scoped_array rather than std…

e5cb84e

…::vector in order to avoid useless initialization of the memory when it's allocated.

ImageCache perf: use scoped_array rather than vector for read_untiled…

38774bf

… temp memory

ImageCache perf: read_untiled autotile case -- use read_scanlines ins…

f17cb0d

…tead of read_scanline

ImageCache perf: refactor get_pixels to do fewer tile queries

eabe9c1

ImageCache speedups -- remove pow2 roundups of autotile tile sizes.

c0d4adc

This was an artifact of an old restriction of TextureSystem which is no longer necessary.

oiiotool - fix minor bug where tiled files were output inappropriately.

6f5744e

lgritz added a commit that referenced this pull request Jan 2, 2013

Merge pull request #480 from lgritz/lg-icperf

15deabd

ImageCache/ImageBuf read performance improvements

lgritz merged commit 15deabd into AcademySoftwareFoundation:master Jan 2, 2013

lgritz deleted the lg-icperf branch January 2, 2013 23:16

GerHobbelt pushed a commit to GerHobbelt/oiio that referenced this pull request Dec 10, 2024

Fix AllocationTransform to use a proper enum. (AcademySoftwareFoundat…

07f81c5

…ion#480) Fixes AcademySoftwareFoundation#473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImageCache/ImageBuf read performance#480

ImageCache/ImageBuf read performance#480
lgritz merged 9 commits intoAcademySoftwareFoundation:masterfrom
lgritz:lg-icperf

lgritz commented Dec 31, 2012

Uh oh!

jeremyselan commented Jan 2, 2013

Uh oh!

lgritz commented Jan 2, 2013

Uh oh!

jeremyselan commented Jan 2, 2013

Uh oh!

lgritz commented Jan 2, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lgritz commented Dec 31, 2012

Uh oh!

jeremyselan commented Jan 2, 2013

Uh oh!

lgritz commented Jan 2, 2013

Uh oh!

jeremyselan commented Jan 2, 2013

Uh oh!

lgritz commented Jan 2, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants