Add AVX2 optimizations to NLMeans. #1825

bradleysepos · 2019-01-21T10:01:44Z

Building this code will produce a HandBrake binary that only runs on AVX2 capable processors. We need to do some work to the build system to improve this, so that only CPU feature-specific code gets compiled this way, and the rest generically. Kona started on this a few years back; the code is a bit dated now but perhaps useful for study: https://github.com/KonaBlend/HandBrake/commits/arch

Unfortunately, I have not been able to demonstrate a significant speedup. Feel free to benchmark and post results here. Do not benchmark against 1.2.0, because it will be slower due to the number of threads used with high core counts. Benchmark master with and without this patch. Takk.

@maximd33

Huge thanks to @maximd33 for working out the cross-lane maths.

2*border is already aligned, so in most places this is really aligning only the width. It's simpler and clearer to write NLMEANS_ALIGN(bw).

Nomis101 · 2019-01-21T20:51:35Z

First of all, thanks for the patch. But this does not build for me on Mac.

clang: error: unknown argument: '-mno-sse2avx'====== ] 74.8%

I will build without this flag and generate some benchmarks on Mac.

bradleysepos · 2019-01-21T20:59:32Z

Ah, yes, I forgot to remove that. gcc only for that one. It's fine to build without it.

Nomis101 · 2019-01-21T23:52:28Z

Feel free to benchmark and post results here. Do not benchmark against 1.2.0, because it will be slower due to the number of threads used with high core counts. Benchmark master with and without this patch. Takk.

I just used a random video to test encoding speed (fps) of one build without vs. one build with the patch, different video codecs and different NLMeans settings. In both cases I used my own build to avoid artifacts from e.g. different build systems (so only the patch is the difference). All patches applied properly (without -mno-sse2avx). This was on macOS 10.14 on a MacBook Pro with Intel Core i7. Test Video: hevc (Main) (hvc1 / 0x31637668), yuv420p(tv, bt709), 1920x824 [SAR 1:1 DAR 240:103], 1945 kb/s, 24.94 fps, 100 tbr, 90k tbn, 25 tbc. Preset always was medium and RF=19.
cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

Results:

First row: Settings
Second row: own build 20190120190730-3b31b3d-master (2019012101)
Third row: own build 20190120190730-3b31b3d-master (2019012101) + patch

NLMeans - Light - Film - x265
222.85s (7.88 fps)
219.26s (8.01 fps)

NLMeans - Light - Animation - x265
876.37s (2.00 fps)
887.34s (1.98 fps)

NLMeans - Strong - Film - x265
223.68s (7.85 fps)
232.01s (7.57 fps)

NLMeans - Light - Film - x264
average encoding speed for job is 12.870839 fps
average encoding speed for job is 12.625109 fps

NLMeans - Light - Animation - x264
average encoding speed for job is 2.143706 fps
average encoding speed for job is 2.263086 fps

NLMeans - Strong - Film - x264
average encoding speed for job is 12.638898 fps
average encoding speed for job is 12.863046 fps

As you can see, in my test on macOS I did not see a big difference. I tried it all again with a second video, but the result was similar (even with the official nightly it is the same). I have no idea, why I did not see a great speed improvement. Do you have any idea?

Clang be like what.

sr55 · 2019-07-12T20:27:17Z

@bradleysepos planning on continuing with this?

bradleysepos · 2019-07-12T20:30:27Z

Yes, waiting for @jstebbins to have a look at the arch branch. Can't merge without those improvements, as it would break on non-AVX2 CPUs.

While there isn't a significant speed improvement here, I may be able to optimize it further by ensuring the ordering across lanes is correct at the beginning, rather than blending at the end. Also, this paves the way for AVX-512 which @maximd33 might help me with.

Nomis101 · 2019-11-02T11:55:48Z

Because I did see a slight speed increase around 2 % for H.265 if I build with -mavx -mavx2 for GCC.args.extra, I did give this another try. I've tested this time a bit more scientifically and made an average from 5 measurements (without patch = basal, with patch applied = patch). I tested NLMeans Light and Strong settings for H.264 and H.265. But sadly, also in my refined test, I didn't see a speed increase. Not even the 2 % increase I was seeing for -mavx -mavx2 alone. :-(

bradleysepos · 2019-11-05T23:04:25Z

Thanks for testing this further. Your results seem to confirm what I've seen as well. For any gain to be had on AVX2, it seems we'll have to rewrite this to ensure ordering across lanes is correct at the beginning to avoid the blend at the end as I mentioned before.

sr55 · 2021-03-31T16:13:16Z

This something we want to close for now and re-visit another time?

bradleysepos added 3 commits January 21, 2019 04:50

libhb: NLMeans AVX2 optimization.

add6f30

Huge thanks to @maximd33 for working out the cross-lane maths.

libhb: Implement and use aligned stride throughout NLMeans.

a8f46f3

2*border is already aligned, so in most places this is really aligning only the width. It's simpler and clearer to write NLMEANS_ALIGN(bw).

libhb: Use av_malloc/av_calloc/av_free in NLMeans to ensure alignment.

3f5f3c0

bradleysepos added the Enhancement label Jan 21, 2019

bradleysepos added this to the 1.3.0 milestone Jan 21, 2019

bradleysepos self-assigned this Jan 21, 2019

libhb: Remove -mno-sse2avx.

097257d

Clang be like what.

bradleysepos modified the milestones: 1.3.0, Unscheduled Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 optimizations to NLMeans. #1825

Add AVX2 optimizations to NLMeans. #1825

bradleysepos commented Jan 21, 2019 •

edited

Nomis101 commented Jan 21, 2019

bradleysepos commented Jan 21, 2019 •

edited

Nomis101 commented Jan 21, 2019

sr55 commented Jul 12, 2019

bradleysepos commented Jul 12, 2019

Nomis101 commented Nov 2, 2019

bradleysepos commented Nov 5, 2019

sr55 commented Mar 31, 2021

Add AVX2 optimizations to NLMeans. #1825

Are you sure you want to change the base?

Add AVX2 optimizations to NLMeans. #1825

Conversation

bradleysepos commented Jan 21, 2019 • edited

Nomis101 commented Jan 21, 2019

bradleysepos commented Jan 21, 2019 • edited

Nomis101 commented Jan 21, 2019

sr55 commented Jul 12, 2019

bradleysepos commented Jul 12, 2019

Nomis101 commented Nov 2, 2019

bradleysepos commented Nov 5, 2019

sr55 commented Mar 31, 2021

bradleysepos commented Jan 21, 2019 •

edited

bradleysepos commented Jan 21, 2019 •

edited