Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cmyk_to_rgb causing off by one rounding errors. #1136

Merged
merged 17 commits into from Feb 12, 2020
Merged

Fix cmyk_to_rgb causing off by one rounding errors. #1136

merged 17 commits into from Feb 12, 2020

Conversation

philippeitis
Copy link
Contributor

Running cmyk_to_rgb(&[0, 0, 0, 254]) produces the RGB color [0, 0, 0] when it should in fact produce RGB [1, 1, 1]:
R = 255 * (1 - C/255) * (1-K/255)
R = 255 * (1-0/255) * (1-254/255)
R = 255 * 1 * 1/ 255 = 1.

This bug occurs with a range of other values as well - 159 / 255 of the values [0, 0, 0, 0..255] are affected.
The fix involves reducing the number of multiplications and divisions, and also yields a ~15-25% speedup (results are from randomly generated images, using cargo build --release).

250 px
Orig took 0.0000020s
Fast took 0.0000015s
2500 px
Orig took 0.0000154s
Fast took 0.0000121s
25000 px
Orig took 0.0002961s
Fast took 0.0002334s
250000 px
Orig took 0.0017928s
Fast took 0.0014527s
2500000 px
Orig took 0.0255831s
Fast took 0.0213677s

I license past and future contributions under the dual MIT/Apache-2.0 license,
allowing licensees to chose either at their option.

(Resubmitted PR against next branch, not master - it's not specified which branch to specify pull requests against.)

Running cmyk_to_rgb(&[0, 0, 0, 254]) produces the RGB color [0, 0, 0] when it should in fact produce RGB [1, 1, 1]:
R = 255 * (1 - C/255) * (1-K/255) 
R = 255 * (1-0/255) * (1-254/255)
R = 255 * 1 * 1/ 255 = 1.

This bug occurs with a range of other values as well - 159 / 255 of the values [0, 0, 0, 0..255] are affected.
The fix involves reducing the number of multiplications and divisions, and also yields a ~25% speedup (results are from randomly generated images). 
```
250 px
Orig took 0.0000020s
Fast took 0.0000015s
2500 px
Orig took 0.0000154s
Fast took 0.0000121s
25000 px
Orig took 0.0002961s
Fast took 0.0002334s
250000 px
Orig took 0.0017928s
Fast took 0.0014527s
2500000 px
Orig took 0.0255831s
Fast took 0.0213677s
```
@HeroicKatora
Copy link
Member

If it's a non-breaking change, it should be against master. If it is breaking then against next. You can change the target by clicking the Edit button on the top, it is not necessary to close and reopen another pull request.

@philippeitis
Copy link
Contributor Author

philippeitis commented Feb 10, 2020

Sorry - that's a pretty obvious feature to have, and I should've checked before closing and opening a new PR. I do think clarifying the purpose of master/next in the Contributors section would be helpful for future contributors.

I'm not sure if you would consider this change as breaking, since it does change output for certain values (from incorrect results to correct results).

@HeroicKatora
Copy link
Member

Ah, 'breaking' as in a SemVer sense of changes to the interface. I don't think there are many value changes that would be considered breaking unless there was an explicit promise of additional guarantees in the documentation. (Maybe in exceptional circumstances such as format detection). Fixing the correctness of color conversion is not.

@philippeitis philippeitis changed the base branch from next to master February 10, 2020 19:08
@philippeitis
Copy link
Contributor Author

philippeitis commented Feb 10, 2020

It's also possible to further optimize the code like so, but the difference is marginal (~ 1-5% on my computer) and requires specifying "target-cpu=native", otherwise performance regressions occur.

fn cmyk_to_rgb(input: &[u8]) -> Vec<u8> {
    let size = input.len() - input.len() / 4;
    let mut output = Vec::with_capacity(size);

    for pixel in input.chunks(4) {
        let c = f32::from(pixel[0]);
        let m = f32::from(pixel[1]);
        let y = f32::from(pixel[2]);
        let k = 255.0 - f32::from(pixel[3]);

        let k_sub = -k / 255.0;

        // CMY -> RGB
        let r = c.mul_add(k_sub, k) as u8;
        let g = m.mul_add(k_sub, k) as u8;
        let b = y.mul_add(k_sub, k) as u8;

        output.push(r);
        output.push(g);
        output.push(b);
    }

    output
}

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

I wanted to remark about how we might want to do the fix now, and performance later but then it occurred to me: Can we not avoid going through floating point entirely and stick to precise integer arithmetic? Your reformulation of the conversion should make it much more clear how to avoid the floating point steps by treating all intermediate computations as fractional arithmetic over the fixed denominator 255.

(e.g. let r = k - c * k_sub; is equivalent to (255*k - c*k)/255 = ((255 - c)*k)/255, is it not?)

@HeroicKatora
Copy link
Member

Regardless: Can you push some tests and the benchmark as a separate commit so one can verify before and after the change?

This test checks all possible colours produced by CYMK (in one channel), and checks that the RGB conversion is correct.
@philippeitis
Copy link
Contributor Author

(e.g. let r = k - c * k_sub; is equivalent to (255*k - c*k)/255 = ((255 - c)*k)/255, is it not?)

This is equivalent, but in general, division is slower than multiplication for all types, which is what makes my solution a lot faster (uses 1 div per step instead of 4), while the multiplies are pretty cheap.

@HeroicKatora
Copy link
Member

This is equivalent, but in general, division is slower than multiplication for all types, which is what makes my solution a lot faster (uses 1 div per step instead of 4), while the multiplies are pretty cheap.

Please benchmark this before claiming it. Those are fixed integer divisions of 16-bit length which will get optimized to a multiplication with the inverse (or a shift). The floating point solution has three separate u8->fp->u8 conversions. Furthermore, an integer implementation has a much better chance of being compiled to SIMD instructions.

src/jpeg/decoder.rs Show resolved Hide resolved
@philippeitis
Copy link
Contributor Author

The following function

fn cmyk_to_rgb_fast_u16(input: &[u8]) -> Vec<u8> {
    let size = input.len() - input.len() / 4;
    let mut output = Vec::with_capacity(size);

    for pixel in input.chunks(4) {
        let c = u16::from(pixel[0]);
        let m = u16::from(pixel[1]);
        let y = u16::from(pixel[2]);
        let k = 255 - u16::from(pixel[3]);

        let k_sub = k;
        let k = k * 255;
        // CMY -> RGB
        let r = (k - c * k_sub) / 255;
        let g = (k - m * k_sub) / 255;
        let b = (k - y * k_sub) / 255;

        output.push(r as u8);
        output.push(g as u8);
        output.push(b as u8);
    }

    output
}

Benchmarked, best of 10, built with cargo build --release. Orig is the original implementation, fast is the current implementation which I've pushed, and fast_u16 is the one above.

250 pixels
Orig took 0.0000035s
Fast took 0.0000027s
fast_u16 took 0.000003s
2500 pixels
Orig took 0.0000312s
Fast took 0.000024s
fast_u16 took 0.0000285s
25000 pixels
Orig took 0.0001611s
Fast took 0.0001411s
fast_u16 took 0.0001605s
250000 pixels
Orig took 0.0015028s
Fast took 0.0013628s
fast_u16 took 0.0015633s
2500000 pixels
Orig took 0.017322s
Fast took 0.013907s
Fast, no fma took 0.0160462s
25000000 pixels
Orig took 0.17552221s
Fast took 0.14268969s
fast_u16 took 0.15868449s

As we can see, the current implementation using f32 and multiplies is faster than both the original solution and proposed u16 division solution.

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

Can we try a seriously optimized variant? Note the elided bounds checks if you check the assembly and the resulting use of SIMD (the main point to prefer short integer arithmetic). (Prediction: 4× to 16× speedup).

pub fn cmyk_to_rgb_int(input: &[u8]) -> Vec<u8> {
    let count = input.len()/4;
    let mut output = vec![0; 3*count];
    
    let in_pixels = input[..4*count].chunks_exact(4);
    let out_pixels = output[..3*count].chunks_exact_mut(3);

    for (pixel, outp) in in_pixels.zip(out_pixels) {
        let c = u16::from(pixel[0]);
        let m = u16::from(pixel[1]);
        let y = u16::from(pixel[2]);
        let k = 255u16 - u16::from(pixel[3]);

        // CMY -> RGB
        let r = (((255 - c)*k)/255) as u8;
        let g = (((255 - m)*k)/255) as u8;
        let b = (((255 - y)*k)/255) as u8;

        outp[0] = r;
        outp[1] = g;
        outp[2] = b;
    }

    output
}

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

That is actually faster than the original version

But you have previously written:

25000000 pixels
Orig took 0.17552221s
Fast took 0.14268969s
fast_u16 took 0.15868449s

And now we only have a tenth of the pixels in more time:

2500000 pixels
Orig took 0.4389242
float took 0.2150764
int took 0.2374431

What are we even benchmarking here?

@philippeitis
Copy link
Contributor Author

Sorry, didn't include the release flag - it seems that the performance is in favour of your implementation, so I pushed that - correct benchmarks below:

250 pixels
Orig took 0.0000029s
float took 0.0000007s
int took 0.0000005s
2500 pixels
Orig took 0.0000154s
float took 0.0000035s
int took 0.0000025s
25000 pixels
Orig took 0.0001522s
float took 0.0000566s
int took 0.0000473s
250000 pixels
Orig took 0.001548s
float took 0.0005909s
int took 0.0004657s
2500000 pixels
Orig took 0.0172378s
float took 0.0053335s
int took 0.004726s
25000000 pixels
Orig took 0.1756037s
float took 0.0544639s
int took 0.0443116s

@HeroicKatora
Copy link
Member

HeroicKatora commented Feb 10, 2020

What hardware are you running? On two machines I could test it on, the floating point turned out faster by ~5%. Those are an i5-4690S and an i5-6200U. If you have a newer generation and that is faster for the integer version, I would use the integer version. Otherwise use your floating point division.

@philippeitis
Copy link
Contributor Author

I'm running an i5-8350U - the integer solution is 5% faster when I compile with cargo build --release.

@HeroicKatora
Copy link
Member

Then it may just flipflop depending on memory timings and exact CPU. I personally find integer code easier to reason about (don't have to remember rounding rules etc.) but if performance is so close, then it is a matter of taste.

Let's just get it in and maybe we can follow up on a more exact perf later.

@philippeitis
Copy link
Contributor Author

philippeitis commented Feb 11, 2020

I've added tests to cover the colors that the previous version failed to get correct. From what I've tried, I'd have to make cmyk_to_rgb public in order to add code in bench/ to benchmark it, which I don't know if I should do, or add a #![feature(test)] to decoder.rs directly, which also seems like the wrong choice. Other than that, the code should be ready to go.

@HeroicKatora
Copy link
Member

You can add a benchmark within the usual tests as well, just need to put it behind a cfg with the benchmarks feature, like so:

mod test {
    #[cfg(feature = "benchmarks")]
    #[bench]
    fn cmyk(b: &mut test::Bencher) {}
}

@philippeitis
Copy link
Contributor Author

I've added benchmarks - everything should be ready to merge now.

src/jpeg/decoder.rs Show resolved Hide resolved
src/jpeg/decoder.rs Show resolved Hide resolved
@HeroicKatora HeroicKatora merged commit 5a84e4c into image-rs:master Feb 12, 2020
@HeroicKatora
Copy link
Member

Thank you, and special appreciation for compairing all the various implementations.

@philippeitis philippeitis deleted the patch-1 branch February 27, 2020 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants