Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the downsampling algorithm #15

Merged
merged 3 commits into from
Feb 20, 2024
Merged

Improve the downsampling algorithm #15

merged 3 commits into from
Feb 20, 2024

Conversation

KYovchevski
Copy link
Collaborator

@KYovchevski KYovchevski commented Jul 7, 2022

While preparing the talk for UU, I noticed that the quality of images made with our downsampler is very low compared to other downsamplers. I took the time to research why that is, it turned out that we aren't taking nearly enough samples when sampling down. For example, sampling from 2048x2048 down to 512x512, we would always take a 6x6 kernel to sample, while other samplers would take 12x12, and adapt that number further depending on the ratio between the source and target dimensions.

I also took inspiration from how other downsamplers handle working with large numbers of samples by caching some of the math for reuse.

The result is a new implementation which preserves image quality much better, but is about twice slower. The performance can probably be improved by splitting the ISPC kernel into two - one for 3 channels and one for 4 channels, and doing the branch in rust instead of relying on function pointers in ISPC. We might be able to squeeze out more performance with cache optimizations, but it'd need further looking into.

The old implementation is kept in both ISPC and Rust, and be invoked using downsample_fast.

ispc_module!(downsample_ispc);

// `WeightVariables` is a generated struct, so we cannot realistically add derivable traits to it.
// Because of this we disable the clippy warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do that with bindgen, iirc even for specific structs.

However, that requires extending the bindgen opts currently passed to ispc-rs, see also Twinklebear/ispc-rs#20.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twinklebear/ispc-rs#22 you can now poke at bindgen::Builder!

Comment on lines +25 to +29
// Keep these because we need to keep them in memory
_starts: Vec<u32>,
_weight_counts: Vec<u32>,
_weights: Vec<Rc<Vec<f32>>>,
_weights_ptrs: Vec<*const f32>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be Pin for that to be allowed?

src/ispc/mod.rs Outdated
})
}

pub(crate) fn ispc_representation(&self) -> downsample_ispc::WeightCollection {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning a borrow here the borrow checker can help you preserve the lifetime of the pointers a bit better, since WeightCollection can never outlive &self.

src/lib.rs Outdated Show resolved Hide resolved
src/ispc/kernels/lanczos3.ispc Outdated Show resolved Hide resolved
}

uniform WeightCollection * uniform vertical_weight_collection = &cache->vertical_weights;
uniform WeightCollection * uniform horizontal_weight_collection = &cache->horizontal_weights;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If grabbing pointers of these anyway, it is perhaps easier to just pass them as two loose pointer arguments? That'll help with lifetime management on the Rust side too :)

src/lib.rs Outdated Show resolved Hide resolved

let mut res = Vec::with_capacity(target as usize);

let mut reuse_heap = HashMap::<_, Rc<Vec<f32>>>::with_capacity(target as usize / 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this something we should statically cache (https://crates.io/crates/once_cell) so that subsequent downsample() calls benefit from it? And/or put it in the public signature behind an opaque function so that the caller is in control over how to share - and when to destroy to free up memory - the cache?

.gitignore Outdated
Comment on lines 1 to 3
# Generated by Cargo
# will have compiled files and executables
/.vscode/
/target/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This comment doesn't really apply to /.vscode/ 😬

export void resample_with_cache(uniform uint src_width, uniform uint src_height, uniform uint target_width, uniform uint target_height, uniform uint8 num_channels,
export void resample_with_cache_3(uniform uint src_width, uniform uint src_height, uniform uint target_width, uniform uint target_height,
uniform const Cache * uniform cache, uniform uint8 scratch_space[], uniform const uint8 src_data[], uniform uint8 out_data[]) {
// TODO[#Kamen]: Ideally, we should split this function into two versions depending channel count, and branch only once in Rust than twice per sample.
Copy link
Member

@MarijnS95 MarijnS95 Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you did that now.

Anyway, is this even faster than the:

void resample_with_cache(uniform const uint num_channels, ...) {}

void resample_with_cache_3(...) {
   resample_with_cache(3, ...);
}

void resample_with_cache_4(...) {
   resample_with_cache(4, ...);
}

We discussed and prototyped before?

And instead have to duplicate the entire body of resample_with_cache()? Implying ISPC doesn't do constant-folding? What was the perf difference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, when we did the testing, the results remained the same. Duplicating the function with just changing the number of channels we read from/write to does result in a ~20% performance gain however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's sad but a worthy tradeoff for this duplication.

@KYovchevski
Copy link
Collaborator Author

I did some testing with what affects the performance for the test case we have (3 channels, 2048x2048 -> 512x512), and have made some interesting observations.

  • The branching when reading/writing to the pixel buffers does not seem to cause as significant of a performance drop as first thought. Instead, the drop was caused by the memory reinterpretation that was in sample_3_channels and clean_and_write_3_channels, where a float<4> was interpreted as a float<3> to use a single write. Skipping this reinterpretation and giving the 3-channel and 4-channel versions different signatures causes a significant speed up. This however means we need to have both a float<3> and a float<4> in the function that we can write to in the branch.
  • Making resample_with_cache inline causes a performance decrease of about 10% for our test case. This is including the usage of ISPC's assume hints, which would make sure that the branches are removed.

c.bench_function("Downsample `square_test.png` using resize", |b| {
b.iter(|| {
let mut dst = vec![RGB::new(0, 0, 0); target_width * target_height];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't really want to benchmark allocation as part of the resize test? Maybe not even instantiating the resizer (which we might want to benchmark separately as it iirc pre-computes weight tables?).

struct Image {
uniform const uint8* data;
uniform int<2> size;
struct WeightVariables {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a more descriptive name than "variables"? This has to do with image size/resolution, right?

float absf = abs(f);
return absf - floor(absf);
export void calculate_weights(uniform float image_scale, uniform float filter_scale, uniform const WeightVariables * uniform vars, uniform float * uniform weights) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


col += w * texel;
weight += w;
void clean_and_write_3_channels(varying float<3> color, varying uint64 write_address, uniform uint8* varying dst) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below: write_address and dst aren't used separately: can you combine them into one argument and let the caller perform scratch_space + scratch_write_address?


col += w * texel;
weight += w;
void clean_and_write_3_channels(varying float<3> color, varying uint64 write_address, uniform uint8* varying dst) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/clean/clamp?

(v.src_center - v.src_start).to_ne_bytes(),
);

let reused = reuse_heap.get(&reuse_key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: there should be a bunch of functions that help with efficiently retrieving or inserting a new item, without doing the lookup multiple times.

///
/// Will panic if the target width or height are higher than that of the source image.
/// For a more fine-tunable version of this function, see [downsample_with_custom_scale].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// For a more fine-tunable version of this function, see [downsample_with_custom_scale].
/// For a more fine-tunable version of this function, see [`downsample_with_custom_scale()`].

downsample_with_custom_scale(src, target_width, target_height, 3.0)
}

/// Version of [downsample] which allows for a custom filter scale, thus trading between speed and final image quality.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Version of [downsample] which allows for a custom filter scale, thus trading between speed and final image quality.
/// Version of [`downsample()`] which allows for a custom filter scale, thus trading between speed and final image quality.

/// The higher the scale, the more detail is preserved, but the slower the downsampling is. Note that the effect on the detail becomes smaller the higher the scale is.
///
/// As a guideline, a `filter_scale` of 3.0 preserves detail well.
/// A scale of 1.0 preserves is good if speed is necessary, but still preserves a decent amount of detail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preserves or is good?

src/lib.rs Outdated
Comment on lines 159 to 164
// The new implementation needs a src_height * target_width intermediate buffer.
let mut scratch_space = Vec::new();
scratch_space.resize(
(src.height * target_width * src.format.num_channels() as u32) as usize,
0u8,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec![...; 0u8]?

@MarijnS95
Copy link
Member

MarijnS95 commented Feb 28, 2023

Since the gist of this PR is improving quality - and if I understand correctly using weight ""caching"" to get there without outlandish times - how does it fare against resize from a quality perspective? main is already slower than resize (45ms vs 37ms) and this PR bumps us to 61ms:

Downsample `square_test.png` using ispc_downsampler                                                                            
                        time:   [61.593 ms 61.710 ms 61.838 ms]
                        change: [+36.286% +36.552% +36.809%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Downsample `square_test.png` using resize                                                                            
                        time:   [37.770 ms 37.803 ms 37.839 ms]
                        change: [-0.2602% -0.1511% -0.0328%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
  7 (7.00%) high mild
  12 (12.00%) high severe

EDIT: The win is mostly in debug/dev profiles:

$ cargo bench --profile dev
...
Downsample `square_test.png` using ispc_downsampler                                                                            
                        time:   [108.09 ms 108.13 ms 108.17 ms]
                        change: [+74.846% +75.217% +75.549%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

Benchmarking Downsample `square_test.png` using resize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 120.5s, or reduce sample count to 10.
Downsample `square_test.png` using resize                                                                            
                        time:   [1.1973 s 1.1994 s 1.2013 s]
                        change: [+3066.5% +3072.6% +3078.3%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
  8 (8.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

src/lib.rs Outdated

pub(crate) fn calculate_weights(src: u32, target: u32, filter_scale: f32) -> Vec<CachedWeight> {
assert!(
src > target,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow resampling in only width or height, but this assertion seems to break this principle

Add code to calculate the coefficients for the filter separate from the sampling

Add unit test for coefficient correctness. Improvements and fixes

Implement resampling based on cached weights

Optimize buffer read/writes for 3-channels

Remove debug code from lib.rs

Fix incorrect color clamping

Remove commented out code

Wrapped ispc pointer code into functions

Add getter and setter functions for 4 channels

Add support for 4 channel images using function pointers

Also tested it out with branching, but the performance was the same

Remove debug code. Add documentation for new downsampling function

Change resize crate dependency

Change back to previous test output name

Update build.rs to include new functions

Replace function pointers with branching

Split downsample function into two versions depending on channel count

Remove old downsampling functions. Add filter_scale variable that allows the kernel to be scaled.

This way the user can trade performance for detail, and the other way around

Remove old ISPC function anme from build.rs

Remove duplicate function

Remove a skippable write in the clean_and_write ISPC functions

Cargo fmt

Fix incorrect weight

Fix incorrect output address calculation

Fix benches

Update binaries

Some cleanup

More cleanup

Missed
@KYovchevski KYovchevski merged commit bcd423a into main Feb 20, 2024
10 checks passed
@MarijnS95 MarijnS95 deleted the caching branch March 5, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants