Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize Jpeg Encoder Color Conversion #1508

Merged
merged 5 commits into from
Jan 18, 2021
Merged

Vectorize Jpeg Encoder Color Conversion #1508

merged 5 commits into from
Jan 18, 2021

Conversation

tkp1n
Copy link
Contributor

@tkp1n tkp1n commented Jan 17, 2021

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Touches #1476 with a vectorized implementation of the RGB -> YCbCr conversions.

It is worth noting, that the following benchmarks don't show the entire picture. The current lookup table converter uses 3kB of lookup tables that pollute the cache and likely negatively impact subsequent operations in the encoding chain.

Benchmark 🚀

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
  DefaultJob : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT

Method Mean Error StdDev Ratio
ConvertLut 224.68 ns 1.059 ns 0.938 ns 1.00
ConvertVectorized 81.07 ns 0.404 ns 0.337 ns 0.36

@tkp1n
Copy link
Contributor Author

tkp1n commented Jan 17, 2021

Looks like some of my final touches broke the tests.. I'll look into it and mark this as a draft in the meantime. All good now and ready for review 😃

@tkp1n tkp1n marked this pull request as draft January 17, 2021 13:39
@codecov
Copy link

codecov bot commented Jan 17, 2021

Codecov Report

Merging #1508 (91b18b1) into master (e2961dc) will increase coverage by 0.01%.
The diff coverage is 94.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1508      +/-   ##
==========================================
+ Coverage   83.52%   83.53%   +0.01%     
==========================================
  Files         741      742       +1     
  Lines       32672    32732      +60     
  Branches     3662     3665       +3     
==========================================
+ Hits        27289    27344      +55     
- Misses       4669     4672       +3     
- Partials      714      716       +2     
Flag Coverage Δ
unittests 83.53% <94.52%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...omponents/Encoder/YCbCrForwardConverter{TPixel}.cs 64.28% <20.00%> (-35.72%) ⬇️
.../Jpeg/Components/Encoder/RgbToYCbCrConverterLut.cs 100.00% <100.00%> (ø)
...omponents/Encoder/RgbToYCbCrConverterVectorized.cs 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2961dc...91b18b1. Read the comment docs.

@tkp1n tkp1n marked this pull request as ready for review January 17, 2021 19:25
Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkp1n Thanks for having a crack at this! Very much appreciated! The deinterleaving code had me beat the other week when I had a look so I was hoping someone smarter than me would have a go.

I'm gonna leave this to the brains trust to properly review because I'm still very much a SIMD noob but I've added a few comments.

@antonfirsov @saucecontrol If you have ideas and time please comment. I would have though we could have sped up the operation a bit more than the current (welcome) improvement.

@saucecontrol
Copy link
Contributor

saucecontrol commented Jan 18, 2021

Rather than having two methods with a round-trip through memory, I'd try to get this all done in one step. The best option would actually be to do the conversion fixed-point if AVX2 is available, but I assume that's a much larger change if the encoder is already taking the YCbCr input in floating point.

This version could be simplified using something like this:

// read 32 bytes, of which 24 will be used
rgb = vmovdqu(rgbptr)
RGBRGBRGBRGBRGBR|GBRGBRGBxxxxxxxx

// move 12 good bytes into each lane
rgb = vpermd(rgb, mask 0,1,2,6,3,4,5,7)
RGBRGBRGBRGBxxxx|RGBRGBRGBRGBxxxx

// group channel values together
rgb = vpshufb(rgb, mask 0,3,6,9,1,4,7,10,2,5,8,11,-1,-1,-1,-1)
RRRRGGGGBBBBxxxx|RRRRGGGGBBBBxxxx

// unpack to int16 width
rg = vpunpcklbw(rgb, zero)
R0R0R0R0G0G0G0G0|R0R0R0R0G0G0G0G0
bx = vpunpckhbw(rgb, zero)
B0B0B0B0x0x0x0x0|B0B0B0B0x0x0x0x0

// then to int32 width
r = vpunpcklwd(rg, zero)
R000R000R000R000|R000R000R000R000
g = vpunpckhwd(rg, zero)
G000G000G000G000|G000G000G000G000
b = vpunpcklwd(bx, zero)
B000B000B000B000|B000B000B000B000

// convert to float, then use the existing ConvertInternal math
// values will already be properly ordered for 32-byte stores

That's untested but should get you going in the right direction. You'll need to allow for 8 bytes of overrun on the last read, or write it such that the 8th iteration reads the last 32 bytes and uses an alternate permute mask to shuffle the 12-byte chunks into position.

@antonfirsov
Copy link
Member

I'm super curious how much the algorithm in #1508 (comment) could give us, but if there's no time to implement it, we can merge the PR as is after changing the Vector128.Create-s.

In any case let's not close #1476 with the PR, there is definitely space left for improvements by rearranging the pipeline steps within the Encode420 / Encode444 loops. (Needs further analysis to determine what exactly has to be done.)

@tkp1n
Copy link
Contributor Author

tkp1n commented Jan 18, 2021

@antonfirsov I'm already on it.. Seems to reduce times roughly from 170ns to 80ns 🚀
@saucecontrol Thanks for the input!

@JimBobSquarePants
Copy link
Member

I’ll do a final pass tonight but this is looking great! 👍

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@JimBobSquarePants JimBobSquarePants merged commit eab04e4 into SixLabors:master Jan 18, 2021
@tkp1n tkp1n deleted the feature/vectorize-rgb2ycbcr-conversion branch January 20, 2021 20:09
JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021
…rsion

Vectorize Jpeg Encoder Color Conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants