Improve the downsampling algorithm #15

KYovchevski · 2022-07-07T09:26:41Z

While preparing the talk for UU, I noticed that the quality of images made with our downsampler is very low compared to other downsamplers. I took the time to research why that is, it turned out that we aren't taking nearly enough samples when sampling down. For example, sampling from 2048x2048 down to 512x512, we would always take a 6x6 kernel to sample, while other samplers would take 12x12, and adapt that number further depending on the ratio between the source and target dimensions.

I also took inspiration from how other downsamplers handle working with large numbers of samples by caching some of the math for reuse.

The result is a new implementation which preserves image quality much better, but is about twice slower. The performance can probably be improved by splitting the ISPC kernel into two - one for 3 channels and one for 4 channels, and doing the branch in rust instead of relying on function pointers in ISPC. We might be able to squeeze out more performance with cache optimizations, but it'd need further looking into.

The old implementation is kept in both ISPC and Rust, and be invoked using downsample_fast.

MarijnS95 · 2022-07-07T09:37:17Z

src/ispc/mod.rs

 ispc_module!(downsample_ispc);
+
+// `WeightVariables` is a generated struct, so we cannot realistically add derivable traits to it.
+// Because of this we disable the clippy warning.


You can do that with bindgen, iirc even for specific structs.

However, that requires extending the bindgen opts currently passed to ispc-rs, see also Twinklebear/ispc-rs#20.

Twinklebear/ispc-rs#22 you can now poke at bindgen::Builder!

MarijnS95 · 2022-07-07T09:40:00Z

src/ispc/mod.rs

+    // Keep these because we need to keep them in memory
+    _starts: Vec<u32>,
+    _weight_counts: Vec<u32>,
+    _weights: Vec<Rc<Vec<f32>>>,
+    _weights_ptrs: Vec<*const f32>,


I think these should be Pin for that to be allowed?

MarijnS95 · 2022-07-07T09:45:47Z

src/ispc/mod.rs

+        })
+    }
+
+    pub(crate) fn ispc_representation(&self) -> downsample_ispc::WeightCollection {


By returning a borrow here the borrow checker can help you preserve the lifetime of the pointers a bit better, since WeightCollection can never outlive &self.

src/lib.rs

src/ispc/kernels/lanczos3.ispc

MarijnS95 · 2022-07-07T09:52:18Z

src/ispc/kernels/lanczos3.ispc

+    }
+
+    uniform WeightCollection * uniform vertical_weight_collection = &cache->vertical_weights;
+    uniform WeightCollection * uniform horizontal_weight_collection = &cache->horizontal_weights;


If grabbing pointers of these anyway, it is perhaps easier to just pass them as two loose pointer arguments? That'll help with lifetime management on the Rust side too :)

src/lib.rs

MarijnS95 · 2022-07-07T10:07:40Z

src/lib.rs

+
+    let mut res = Vec::with_capacity(target as usize);
+
+    let mut reuse_heap = HashMap::<_, Rc<Vec<f32>>>::with_capacity(target as usize / 2);


It this something we should statically cache (https://crates.io/crates/once_cell) so that subsequent downsample() calls benefit from it? And/or put it in the public signature behind an opaque function so that the caller is in control over how to share - and when to destroy to free up memory - the cache?

MarijnS95 · 2022-07-07T13:20:32Z

.gitignore

 # Generated by Cargo
 # will have compiled files and executables
+/.vscode/
 /target/


Nit: This comment doesn't really apply to /.vscode/ 😬

MarijnS95 · 2022-07-15T13:30:44Z

src/ispc/kernels/lanczos3.ispc

-export void resample_with_cache(uniform uint src_width, uniform uint src_height, uniform uint target_width, uniform uint target_height, uniform uint8 num_channels,
+export void resample_with_cache_3(uniform uint src_width, uniform uint src_height, uniform uint target_width, uniform uint target_height,
+    uniform const Cache * uniform cache, uniform uint8 scratch_space[], uniform const uint8 src_data[], uniform uint8 out_data[]) {
+    // TODO[#Kamen]: Ideally, we should split this function into two versions depending channel count, and branch only once in Rust than twice per sample.


Looks like you did that now.

Anyway, is this even faster than the:

void resample_with_cache(uniform const uint num_channels, ...) {} void resample_with_cache_3(...) { resample_with_cache(3, ...); } void resample_with_cache_4(...) { resample_with_cache(4, ...); }

We discussed and prototyped before?

And instead have to duplicate the entire body of resample_with_cache()? Implying ISPC doesn't do constant-folding? What was the perf difference?

No, when we did the testing, the results remained the same. Duplicating the function with just changing the number of channels we read from/write to does result in a ~20% performance gain however.

Wow, that's sad but a worthy tradeoff for this duplication.

KYovchevski · 2022-07-21T09:02:43Z

I did some testing with what affects the performance for the test case we have (3 channels, 2048x2048 -> 512x512), and have made some interesting observations.

The branching when reading/writing to the pixel buffers does not seem to cause as significant of a performance drop as first thought. Instead, the drop was caused by the memory reinterpretation that was in sample_3_channels and clean_and_write_3_channels, where a float<4> was interpreted as a float<3> to use a single write. Skipping this reinterpretation and giving the 3-channel and 4-channel versions different signatures causes a significant speed up. This however means we need to have both a float<3> and a float<4> in the function that we can write to in the branch.
Making resample_with_cache inline causes a performance decrease of about 10% for our test case. This is including the usage of ISPC's assume hints, which would make sure that the branches are removed.

MarijnS95 · 2023-02-28T10:19:02Z

benches/basic.rs

        c.bench_function("Downsample `square_test.png` using resize", |b| {
            b.iter(|| {
+                let mut dst = vec![RGB::new(0, 0, 0); target_width * target_height];


Don't really want to benchmark allocation as part of the resize test? Maybe not even instantiating the resizer (which we might want to benchmark separately as it iirc pre-computes weight tables?).

MarijnS95 · 2023-02-28T10:21:02Z

src/ispc/kernels/lanczos3.ispc

-struct Image {
-    uniform const uint8* data;
-    uniform int<2> size;
+struct WeightVariables {


Do we have a more descriptive name than "variables"? This has to do with image size/resolution, right?

MarijnS95 · 2023-02-28T10:21:42Z

src/ispc/kernels/lanczos3.ispc

-    float absf = abs(f);
-    return absf - floor(absf);
+export void calculate_weights(uniform float image_scale, uniform float filter_scale, uniform const WeightVariables * uniform vars, uniform float * uniform weights) {
+


Suggested change

MarijnS95 · 2023-02-28T10:23:23Z

src/ispc/kernels/lanczos3.ispc

-
-            col += w * texel;
-            weight += w;
+void clean_and_write_3_channels(varying float<3> color, varying uint64 write_address, uniform uint8* varying dst) {


Here and below: write_address and dst aren't used separately: can you combine them into one argument and let the caller perform scratch_space + scratch_write_address?

MarijnS95 · 2023-02-28T10:23:46Z

src/ispc/kernels/lanczos3.ispc

-
-            col += w * texel;
-            weight += w;
+void clean_and_write_3_channels(varying float<3> color, varying uint64 write_address, uniform uint8* varying dst) {


s/clean/clamp?

MarijnS95 · 2023-02-28T10:47:55Z

src/lib.rs

+            (v.src_center - v.src_start).to_ne_bytes(),
+        );
+
+        let reused = reuse_heap.get(&reuse_key);


Nit: there should be a bunch of functions that help with efficiently retrieving or inserting a new item, without doing the lookup multiple times.

MarijnS95 · 2023-02-28T10:51:36Z

src/lib.rs

 ///
-/// Will panic if the target width or height are higher than that of the source image.
+/// For a more fine-tunable version of this function, see [downsample_with_custom_scale].


Suggested change

/// For a more fine-tunable version of this function, see [downsample_with_custom_scale].

/// For a more fine-tunable version of this function, see [`downsample_with_custom_scale()`].

MarijnS95 · 2023-02-28T10:51:52Z

src/lib.rs

+    downsample_with_custom_scale(src, target_width, target_height, 3.0)
+}
+
+/// Version of [downsample] which allows for a custom filter scale, thus trading between speed and final image quality.


Suggested change

/// Version of [downsample] which allows for a custom filter scale, thus trading between speed and final image quality.

/// Version of [`downsample()`] which allows for a custom filter scale, thus trading between speed and final image quality.

MarijnS95 · 2023-02-28T10:52:32Z

src/lib.rs

+/// The higher the scale, the more detail is preserved, but the slower the downsampling is. Note that the effect on the detail becomes smaller the higher the scale is.
+///
+/// As a guideline, a `filter_scale` of 3.0 preserves detail well.
+/// A scale of 1.0 preserves is good if speed is necessary, but still preserves a decent amount of detail.


preserves or is good?

MarijnS95 · 2023-02-28T10:53:54Z

src/lib.rs

+    // The new implementation needs a src_height * target_width intermediate buffer.
+    let mut scratch_space = Vec::new();
+    scratch_space.resize(
+        (src.height * target_width * src.format.num_channels() as u32) as usize,
+        0u8,
+    );


vec![...; 0u8]?

MarijnS95 · 2023-02-28T11:09:29Z

Since the gist of this PR is improving quality - and if I understand correctly using weight ""caching"" to get there without outlandish times - how does it fare against resize from a quality perspective? main is already slower than resize (45ms vs 37ms) and this PR bumps us to 61ms:

Downsample `square_test.png` using ispc_downsampler                                                                            
                        time:   [61.593 ms 61.710 ms 61.838 ms]
                        change: [+36.286% +36.552% +36.809%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Downsample `square_test.png` using resize                                                                            
                        time:   [37.770 ms 37.803 ms 37.839 ms]
                        change: [-0.2602% -0.1511% -0.0328%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
  7 (7.00%) high mild
  12 (12.00%) high severe

EDIT: The win is mostly in debug/dev profiles:

$ cargo bench --profile dev
...
Downsample `square_test.png` using ispc_downsampler                                                                            
                        time:   [108.09 ms 108.13 ms 108.17 ms]
                        change: [+74.846% +75.217% +75.549%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

Benchmarking Downsample `square_test.png` using resize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 120.5s, or reduce sample count to 10.
Downsample `square_test.png` using resize                                                                            
                        time:   [1.1973 s 1.1994 s 1.2013 s]
                        change: [+3066.5% +3072.6% +3078.3%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
  8 (8.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

Athosvk · 2024-02-19T15:48:03Z

src/lib.rs

+
+pub(crate) fn calculate_weights(src: u32, target: u32, filter_scale: f32) -> Vec<CachedWeight> {
+    assert!(
+        src > target,


We allow resampling in only width or height, but this assertion seems to break this principle

Add code to calculate the coefficients for the filter separate from the sampling Add unit test for coefficient correctness. Improvements and fixes Implement resampling based on cached weights Optimize buffer read/writes for 3-channels Remove debug code from lib.rs Fix incorrect color clamping Remove commented out code Wrapped ispc pointer code into functions Add getter and setter functions for 4 channels Add support for 4 channel images using function pointers Also tested it out with branching, but the performance was the same Remove debug code. Add documentation for new downsampling function Change resize crate dependency Change back to previous test output name Update build.rs to include new functions Replace function pointers with branching Split downsample function into two versions depending on channel count Remove old downsampling functions. Add filter_scale variable that allows the kernel to be scaled. This way the user can trade performance for detail, and the other way around Remove old ISPC function anme from build.rs Remove duplicate function Remove a skippable write in the clean_and_write ISPC functions Cargo fmt Fix incorrect weight Fix incorrect output address calculation Fix benches Update binaries Some cleanup More cleanup Missed

MarijnS95 reviewed Jul 7, 2022

View reviewed changes

KYovchevski force-pushed the caching branch from 7d390b9 to 8f977d2 Compare July 7, 2022 12:25

MarijnS95 reviewed Jul 7, 2022

View reviewed changes

KYovchevski force-pushed the caching branch from e15398a to d364550 Compare July 15, 2022 09:43

MarijnS95 reviewed Jul 15, 2022

View reviewed changes

KYovchevski force-pushed the caching branch from 77d7ed4 to 0decab7 Compare July 20, 2022 10:31

MarijnS95 mentioned this pull request Nov 11, 2022

Expand formats support with R8 and RG8 #20

Draft

KYovchevski force-pushed the caching branch from 75982e4 to 22af04d Compare November 28, 2022 10:47

MarijnS95 reviewed Feb 28, 2023

View reviewed changes

MarijnS95 mentioned this pull request Aug 23, 2023

lanczos3: Const-fold offset calculation to hardcoded 0.5 #28

Merged

Jasper-Bekkers approved these changes Jan 23, 2024

View reviewed changes

KYovchevski force-pushed the caching branch from 8003c62 to be36093 Compare January 23, 2024 17:13

KYovchevski force-pushed the caching branch from cc3d2d8 to 399204d Compare February 1, 2024 14:11

Athosvk reviewed Feb 19, 2024

View reviewed changes

KYovchevski added 2 commits February 19, 2024 17:14

Fix incorrect assert

0c96328

KYovchevski force-pushed the caching branch from 399204d to 0c96328 Compare February 19, 2024 16:16

Update binaries

5e20338

KYovchevski merged commit bcd423a into main Feb 20, 2024
10 checks passed

MarijnS95 deleted the caching branch March 5, 2024 21:49

KYovchevski mentioned this pull request Mar 15, 2024

📏 Add renormalization based on pixel format to Lanczos kernel #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the downsampling algorithm #15

Improve the downsampling algorithm #15

KYovchevski commented Jul 7, 2022 •

edited

Loading

MarijnS95 Jul 7, 2022

MarijnS95 Feb 28, 2023

MarijnS95 Jul 7, 2022

MarijnS95 Jul 7, 2022

MarijnS95 Jul 7, 2022

MarijnS95 Jul 7, 2022

MarijnS95 Jul 7, 2022

MarijnS95 Jul 15, 2022 •

edited

Loading

KYovchevski Jul 20, 2022

MarijnS95 Jul 20, 2022

KYovchevski commented Jul 21, 2022

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 Feb 28, 2023

MarijnS95 commented Feb 28, 2023 •

edited

Loading

Athosvk Feb 19, 2024


		let mut res = Vec::with_capacity(target as usize);

		let mut reuse_heap = HashMap::<_, Rc<Vec<f32>>>::with_capacity(target as usize / 2);

	/// For a more fine-tunable version of this function, see [downsample_with_custom_scale].
	/// For a more fine-tunable version of this function, see [`downsample_with_custom_scale()`].

	/// Version of [downsample] which allows for a custom filter scale, thus trading between speed and final image quality.
	/// Version of [`downsample()`] which allows for a custom filter scale, thus trading between speed and final image quality.

Improve the downsampling algorithm #15

Improve the downsampling algorithm #15

Conversation

KYovchevski commented Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarijnS95 Jul 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KYovchevski commented Jul 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarijnS95 commented Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

KYovchevski commented Jul 7, 2022 •

edited

Loading

MarijnS95 Jul 15, 2022 •

edited

Loading

MarijnS95 commented Feb 28, 2023 •

edited

Loading