Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX2 optimizations to NLMeans. #1825

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bradleysepos
Copy link
Contributor

@bradleysepos bradleysepos commented Jan 21, 2019

Building this code will produce a HandBrake binary that only runs on AVX2 capable processors. We need to do some work to the build system to improve this, so that only CPU feature-specific code gets compiled this way, and the rest generically. Kona started on this a few years back; the code is a bit dated now but perhaps useful for study: https://github.com/KonaBlend/HandBrake/commits/arch

Unfortunately, I have not been able to demonstrate a significant speedup. Feel free to benchmark and post results here. Do not benchmark against 1.2.0, because it will be slower due to the number of threads used with high core counts. Benchmark master with and without this patch. Takk.

Huge thanks to @maximd33 for working out the cross-lane maths.
2*border is already aligned, so in most places this is really aligning only the width. It's simpler and clearer to write NLMEANS_ALIGN(bw).
@bradleysepos bradleysepos added this to the 1.3.0 milestone Jan 21, 2019
@bradleysepos bradleysepos self-assigned this Jan 21, 2019
@Nomis101
Copy link
Contributor

First of all, thanks for the patch. But this does not build for me on Mac.

clang: error: unknown argument: '-mno-sse2avx'====== ] 74.8%

I will build without this flag and generate some benchmarks on Mac.

@bradleysepos
Copy link
Contributor Author

bradleysepos commented Jan 21, 2019

Ah, yes, I forgot to remove that. gcc only for that one. It's fine to build without it.

@Nomis101
Copy link
Contributor

Feel free to benchmark and post results here. Do not benchmark against 1.2.0, because it will be slower due to the number of threads used with high core counts. Benchmark master with and without this patch. Takk.

I just used a random video to test encoding speed (fps) of one build without vs. one build with the patch, different video codecs and different NLMeans settings. In both cases I used my own build to avoid artifacts from e.g. different build systems (so only the patch is the difference). All patches applied properly (without -mno-sse2avx). This was on macOS 10.14 on a MacBook Pro with Intel Core i7. Test Video: hevc (Main) (hvc1 / 0x31637668), yuv420p(tv, bt709), 1920x824 [SAR 1:1 DAR 240:103], 1945 kb/s, 24.94 fps, 100 tbr, 90k tbn, 25 tbc. Preset always was medium and RF=19.
cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

Results:

First row: Settings
Second row: own build 20190120190730-3b31b3d-master (2019012101)
Third row: own build 20190120190730-3b31b3d-master (2019012101) + patch

NLMeans - Light - Film - x265
222.85s (7.88 fps)
219.26s (8.01 fps)

NLMeans - Light - Animation - x265
876.37s (2.00 fps)
887.34s (1.98 fps)

NLMeans - Strong - Film - x265
223.68s (7.85 fps)
232.01s (7.57 fps)

NLMeans - Light - Film - x264
average encoding speed for job is 12.870839 fps
average encoding speed for job is 12.625109 fps

NLMeans - Light - Animation - x264
average encoding speed for job is 2.143706 fps
average encoding speed for job is 2.263086 fps

NLMeans - Strong - Film - x264
average encoding speed for job is 12.638898 fps
average encoding speed for job is 12.863046 fps

As you can see, in my test on macOS I did not see a big difference. I tried it all again with a second video, but the result was similar (even with the official nightly it is the same). I have no idea, why I did not see a great speed improvement. Do you have any idea?

Clang be like what.
@sr55
Copy link
Contributor

sr55 commented Jul 12, 2019

@bradleysepos planning on continuing with this?

@bradleysepos
Copy link
Contributor Author

Yes, waiting for @jstebbins to have a look at the arch branch. Can't merge without those improvements, as it would break on non-AVX2 CPUs.

While there isn't a significant speed improvement here, I may be able to optimize it further by ensuring the ordering across lanes is correct at the beginning, rather than blending at the end. Also, this paves the way for AVX-512 which @maximd33 might help me with.

@bradleysepos bradleysepos modified the milestones: 1.3.0, Unscheduled Aug 5, 2019
@Nomis101
Copy link
Contributor

Nomis101 commented Nov 2, 2019

Because I did see a slight speed increase around 2 % for H.265 if I build with -mavx -mavx2 for GCC.args.extra, I did give this another try. I've tested this time a bit more scientifically and made an average from 5 measurements (without patch = basal, with patch applied = patch). I tested NLMeans Light and Strong settings for H.264 and H.265. But sadly, also in my refined test, I didn't see a speed increase. Not even the 2 % increase I was seeing for -mavx -mavx2 alone. :-(

nlmeans

@bradleysepos
Copy link
Contributor Author

Thanks for testing this further. Your results seem to confirm what I've seen as well. For any gain to be had on AVX2, it seems we'll have to rewrite this to ensure ordering across lanes is correct at the beginning to avoid the blend at the end as I mentioned before.

@sr55
Copy link
Contributor

sr55 commented Mar 31, 2021

This something we want to close for now and re-visit another time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants