Skip to content

ImageCache/ImageBuf read performance#480

Merged
lgritz merged 9 commits intoAcademySoftwareFoundation:masterfrom
lgritz:lg-icperf
Jan 2, 2013
Merged

ImageCache/ImageBuf read performance#480
lgritz merged 9 commits intoAcademySoftwareFoundation:masterfrom
lgritz:lg-icperf

Conversation

@lgritz
Copy link
Collaborator

@lgritz lgritz commented Dec 31, 2012

I spent some time carefully profiling image read performance of scanline OpenEXR images in OIIO. I was particularly interested in making sure that reading via an ImageBuf or ImageCache::get_pixels did not have too much overhead compared to a raw ImageInput::read_image (presumed to be "speed of light", the barest wrapping of the underlying library calls), and how autotile/autoscanline affected things.

First thing I noticed is a big flaw in my libOpenImageIO/imagespeed_test, I had neglected to fully flush the ImageCache between sub-tests, and that was throwing off my numbers, making IB and IC look rosier than they really were. I also augmented imagespeed_test to explicitly cover several more of the combinations I describe above. Here is example output from my Macbook Pro on a 2336x1198 4-channel float OpenEXR file:

read_image speed: 0.12s
read_scanline (1 at a time) speed: 0.33s
read_scanlines (64 at a time) speed: 0.12s
ImageBuf read speed: 0.30s
ImageCache get_pixels speed: 0.28s
With autotile = 64:
ImageBuf read speed: 0.41s
ImageCache get_pixels speed: 0.39s
With autotile = 64, autoscanline = 1:
ImageBuf read speed: 0.48s
ImageCache get_pixels speed: 0.47s

Some interesting things we can say already:

  • Using read_scanlines in 64-scanline chunks is equally fast as a read_image (which internally uses read_scanlines in 256-scanline chunks). But read_scanline individually reading each line takes almost 3x longer! This is because when reading multiple scanlines at once, the OpenEXR library (libIlmImf) uses multiple threads to pipeline the reads and decompresses. When you read just one scanline, it doesn't have the opportunity to do that.
  • Reading the same file with either ImageBuf or IC::get_pixels is a high penalty -- about 2.5x slower than speed of light.
  • When autotile is turned on, it's even worse, and when autoscanline=1 (which is supposed to be an improvement, by making the "virtual tiles" the full width of the image), it's even slower, 4x SOL.

OK, so I carefully profiled those code paths and made a number of improvements. I'll give you the results first:

read_image speed: 0.12s
read_scanline (1 at a time) speed: 0.33s
read_scanlines (64 at a time) speed: 0.12s
ImageBuf read speed: 0.15s
ImageCache get_pixels speed: 0.14s
With autotile = 64:
ImageBuf read speed: 0.33s
ImageCache get_pixels speed: 0.32s
With autotile = 64, autoscanline = 1:
ImageBuf read speed: 0.18s
ImageCache get_pixels speed: 0.17s

Bottom line is that I've DOUBLED the image reading speed of using ImageBuf and ImageCache::get_pixels when autotile is off, they are now almost as fast as raw calls to ImageInput::read_scanlines or read_image. Autotile is still much slower, though much improved compared to before. And the combination of autotile and autoscanline has been nearly tripled in performance, and is now only incrementally slower than when autotile is off (that is, when the ImageCache reads the whole image as one tile).

So, how did I do it? Basically it broke down to three main improvements:

  1. I found some places (including ImageBuf's local storage and the tile memory in ImageCacheTile) where I'd used std::vector as a simple memory-managed buffer. This turned out to be unwise, because vector zeroes out the memory when you resize() to allocate (only to be I immediately filled with the real data afterwards). On modern architectures, touching memory unnecessarily (especially every byte of a buffer that's bigger than L2 cache!) is one of the most expensive things you can do. I made a custom scoped_array template in imagebuf.h that does the simple memory management I was after originally. (N.B. When we can count on C++11 everywhere, we can use unique_ptr<>, but for now I just rolled my own).
  2. The code that read scanlines for the autotile was using individual read_scanline calls, not read_scanlines. For OpenEXR in particular, this makes a big difference.
  3. The ImageCache was rounding tile sizes up to powers of 2 (including the full-image-tile, when autotile was off, and the virtual tile size when autotile was on, and the scanline width, when autotile was on and autoscanline was on). This is a historical artifact from the time when TextureSystem required power-of-2 tile sizes. It no longer does, so I removed the rounding. Note that for my file, which coincidentally was 2336 x 1198 (slightly above a power of 2 in each dimension), that was quite a lot of extra data and memory when rounded up, as well as causing various routines that has special fast cases for "contiguous stride" data layouts to end up on the slower paths for non-contiguous strides.

And there were a few other minor refactors with smaller effects, and some minor bug fixes for things that broke with these changes. Look at the individual commit comments if you care.

The final takeaway is:

  • After this is committed, you should expect much faster image read speeds if you are going through the ImageBuf or ImageCache interfaces, particularly for scanline (non-tiled) images, and especially for OpenEXR where there is an actual advantage for read_scanlines versus read_scanline.
  • If you are using autotile, you should also be using autoscanline.

@jeremyselan
Copy link
Contributor

This is awesome! Great speedups, and a great write up as well.

To confirm, this is binary incompatible, correct? (I note the change to scoped_array inside the ImageBuf public header). So is the assumptions this would be rolled into a new major version? (or the next major version).

(It's also interesting to note that if ImageBuf were implemented using the pimp pattern, this probably could have been done in a binary compatible manner).

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 2, 2013

You're right: Because this breaks binary compatibility of public types, it just has to be a 1.2 feature, it just can't be backported to 1.1.

You're also correct that ImageBuf has become too complex to have the implementation in the public headers; it really needs PIMPL. I had avoided that just to make it as performant as possible rather than needing a function call and a pointer indirection for absolutely every method, even trivial get and sets, but maybe I shouldn't be worried about that in the grand scheme of things.

Should I make PIMPL-ifying IB a part of this review, or do it separately?

@jeremyselan
Copy link
Contributor

This commit looks good to me. (master branch only)

Definitely a separate review for pimpl-ing. We dont necessarily have to pimpl everything, so we should look at real performance numbers and see if the indirection hurts us on the little functions (getpixel, setpixel, etc). But there are a whole bunch of classes in OIIO which are not performance limited, we should pimpl those as well.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 2, 2013

OK, I will merge this and look into pmpling separately. Maybe there's a hybrid approach where a small number of POD-and-unlikely-to-change fields can be public and make the frequently-called access routines as fast as possible, and the tricky stuff can be pimpled without any performance impact.

lgritz added a commit that referenced this pull request Jan 2, 2013
ImageCache/ImageBuf read performance improvements
@lgritz lgritz merged commit 15deabd into AcademySoftwareFoundation:master Jan 2, 2013
@lgritz lgritz deleted the lg-icperf branch January 2, 2013 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants