-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the downsampling algorithm #15
Conversation
ispc_module!(downsample_ispc); | ||
|
||
// `WeightVariables` is a generated struct, so we cannot realistically add derivable traits to it. | ||
// Because of this we disable the clippy warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do that with bindgen, iirc even for specific structs.
However, that requires extending the bindgen opts currently passed to ispc-rs, see also Twinklebear/ispc-rs#20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Twinklebear/ispc-rs#22 you can now poke at bindgen::Builder
!
// Keep these because we need to keep them in memory | ||
_starts: Vec<u32>, | ||
_weight_counts: Vec<u32>, | ||
_weights: Vec<Rc<Vec<f32>>>, | ||
_weights_ptrs: Vec<*const f32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be Pin
for that to be allowed?
src/ispc/mod.rs
Outdated
}) | ||
} | ||
|
||
pub(crate) fn ispc_representation(&self) -> downsample_ispc::WeightCollection { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By returning a borrow here the borrow checker can help you preserve the lifetime of the pointers a bit better, since WeightCollection
can never outlive &self
.
src/ispc/kernels/lanczos3.ispc
Outdated
} | ||
|
||
uniform WeightCollection * uniform vertical_weight_collection = &cache->vertical_weights; | ||
uniform WeightCollection * uniform horizontal_weight_collection = &cache->horizontal_weights; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If grabbing pointers of these anyway, it is perhaps easier to just pass them as two loose pointer arguments? That'll help with lifetime management on the Rust side too :)
|
||
let mut res = Vec::with_capacity(target as usize); | ||
|
||
let mut reuse_heap = HashMap::<_, Rc<Vec<f32>>>::with_capacity(target as usize / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It this something we should static
ally cache (https://crates.io/crates/once_cell) so that subsequent downsample()
calls benefit from it? And/or put it in the public signature behind an opaque function so that the caller is in control over how to share - and when to destroy to free up memory - the cache?
.gitignore
Outdated
# Generated by Cargo | ||
# will have compiled files and executables | ||
/.vscode/ | ||
/target/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This comment doesn't really apply to /.vscode/
😬
src/ispc/kernels/lanczos3.ispc
Outdated
export void resample_with_cache(uniform uint src_width, uniform uint src_height, uniform uint target_width, uniform uint target_height, uniform uint8 num_channels, | ||
export void resample_with_cache_3(uniform uint src_width, uniform uint src_height, uniform uint target_width, uniform uint target_height, | ||
uniform const Cache * uniform cache, uniform uint8 scratch_space[], uniform const uint8 src_data[], uniform uint8 out_data[]) { | ||
// TODO[#Kamen]: Ideally, we should split this function into two versions depending channel count, and branch only once in Rust than twice per sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you did that now.
Anyway, is this even faster than the:
void resample_with_cache(uniform const uint num_channels, ...) {}
void resample_with_cache_3(...) {
resample_with_cache(3, ...);
}
void resample_with_cache_4(...) {
resample_with_cache(4, ...);
}
We discussed and prototyped before?
And instead have to duplicate the entire body of resample_with_cache()
? Implying ISPC doesn't do constant-folding? What was the perf difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, when we did the testing, the results remained the same. Duplicating the function with just changing the number of channels we read from/write to does result in a ~20% performance gain however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's sad but a worthy tradeoff for this duplication.
I did some testing with what affects the performance for the test case we have (3 channels, 2048x2048 -> 512x512), and have made some interesting observations.
|
c.bench_function("Downsample `square_test.png` using resize", |b| { | ||
b.iter(|| { | ||
let mut dst = vec![RGB::new(0, 0, 0); target_width * target_height]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't really want to benchmark allocation as part of the resize test? Maybe not even instantiating the resizer (which we might want to benchmark separately as it iirc pre-computes weight tables?).
src/ispc/kernels/lanczos3.ispc
Outdated
struct Image { | ||
uniform const uint8* data; | ||
uniform int<2> size; | ||
struct WeightVariables { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a more descriptive name than "variables"? This has to do with image size/resolution, right?
src/ispc/kernels/lanczos3.ispc
Outdated
float absf = abs(f); | ||
return absf - floor(absf); | ||
export void calculate_weights(uniform float image_scale, uniform float filter_scale, uniform const WeightVariables * uniform vars, uniform float * uniform weights) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/ispc/kernels/lanczos3.ispc
Outdated
|
||
col += w * texel; | ||
weight += w; | ||
void clean_and_write_3_channels(varying float<3> color, varying uint64 write_address, uniform uint8* varying dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below: write_address
and dst
aren't used separately: can you combine them into one argument and let the caller perform scratch_space + scratch_write_address
?
src/ispc/kernels/lanczos3.ispc
Outdated
|
||
col += w * texel; | ||
weight += w; | ||
void clean_and_write_3_channels(varying float<3> color, varying uint64 write_address, uniform uint8* varying dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/clean/clamp?
(v.src_center - v.src_start).to_ne_bytes(), | ||
); | ||
|
||
let reused = reuse_heap.get(&reuse_key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: there should be a bunch of functions that help with efficiently retrieving or inserting a new item, without doing the lookup multiple times.
/// | ||
/// Will panic if the target width or height are higher than that of the source image. | ||
/// For a more fine-tunable version of this function, see [downsample_with_custom_scale]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// For a more fine-tunable version of this function, see [downsample_with_custom_scale]. | |
/// For a more fine-tunable version of this function, see [`downsample_with_custom_scale()`]. |
downsample_with_custom_scale(src, target_width, target_height, 3.0) | ||
} | ||
|
||
/// Version of [downsample] which allows for a custom filter scale, thus trading between speed and final image quality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Version of [downsample] which allows for a custom filter scale, thus trading between speed and final image quality. | |
/// Version of [`downsample()`] which allows for a custom filter scale, thus trading between speed and final image quality. |
/// The higher the scale, the more detail is preserved, but the slower the downsampling is. Note that the effect on the detail becomes smaller the higher the scale is. | ||
/// | ||
/// As a guideline, a `filter_scale` of 3.0 preserves detail well. | ||
/// A scale of 1.0 preserves is good if speed is necessary, but still preserves a decent amount of detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preserves or is good?
src/lib.rs
Outdated
// The new implementation needs a src_height * target_width intermediate buffer. | ||
let mut scratch_space = Vec::new(); | ||
scratch_space.resize( | ||
(src.height * target_width * src.format.num_channels() as u32) as usize, | ||
0u8, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vec![...; 0u8]
?
Since the gist of this PR is improving quality - and if I understand correctly using weight ""caching"" to get there without outlandish times - how does it fare against
EDIT: The win is mostly in debug/dev profiles: $ cargo bench --profile dev
...
Downsample `square_test.png` using ispc_downsampler
time: [108.09 ms 108.13 ms 108.17 ms]
change: [+74.846% +75.217% +75.549%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high mild
Benchmarking Downsample `square_test.png` using resize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 120.5s, or reduce sample count to 10.
Downsample `square_test.png` using resize
time: [1.1973 s 1.1994 s 1.2013 s]
change: [+3066.5% +3072.6% +3078.3%] (p = 0.00 < 0.05)
Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
8 (8.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
10 (10.00%) high severe |
src/lib.rs
Outdated
|
||
pub(crate) fn calculate_weights(src: u32, target: u32, filter_scale: f32) -> Vec<CachedWeight> { | ||
assert!( | ||
src > target, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allow resampling in only width or height, but this assertion seems to break this principle
Add code to calculate the coefficients for the filter separate from the sampling Add unit test for coefficient correctness. Improvements and fixes Implement resampling based on cached weights Optimize buffer read/writes for 3-channels Remove debug code from lib.rs Fix incorrect color clamping Remove commented out code Wrapped ispc pointer code into functions Add getter and setter functions for 4 channels Add support for 4 channel images using function pointers Also tested it out with branching, but the performance was the same Remove debug code. Add documentation for new downsampling function Change resize crate dependency Change back to previous test output name Update build.rs to include new functions Replace function pointers with branching Split downsample function into two versions depending on channel count Remove old downsampling functions. Add filter_scale variable that allows the kernel to be scaled. This way the user can trade performance for detail, and the other way around Remove old ISPC function anme from build.rs Remove duplicate function Remove a skippable write in the clean_and_write ISPC functions Cargo fmt Fix incorrect weight Fix incorrect output address calculation Fix benches Update binaries Some cleanup More cleanup Missed
While preparing the talk for UU, I noticed that the quality of images made with our downsampler is very low compared to other downsamplers. I took the time to research why that is, it turned out that we aren't taking nearly enough samples when sampling down. For example, sampling from 2048x2048 down to 512x512, we would always take a 6x6 kernel to sample, while other samplers would take 12x12, and adapt that number further depending on the ratio between the source and target dimensions.
I also took inspiration from how other downsamplers handle working with large numbers of samples by caching some of the math for reuse.
The result is a new implementation which preserves image quality much better, but is about twice slower. The performance can probably be improved by splitting the ISPC kernel into two - one for 3 channels and one for 4 channels, and doing the branch in rust instead of relying on function pointers in ISPC. We might be able to squeeze out more performance with cache optimizations, but it'd need further looking into.
The old implementation is kept in both ISPC and Rust, and be invoked using
downsample_fast
.