New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX2 optimizations to NLMeans. #1825
base: master
Are you sure you want to change the base?
Conversation
Huge thanks to @maximd33 for working out the cross-lane maths.
2*border is already aligned, so in most places this is really aligning only the width. It's simpler and clearer to write NLMEANS_ALIGN(bw).
First of all, thanks for the patch. But this does not build for me on Mac.
I will build without this flag and generate some benchmarks on Mac. |
Ah, yes, I forgot to remove that. |
I just used a random video to test encoding speed (fps) of one build without vs. one build with the patch, different video codecs and different NLMeans settings. In both cases I used my own build to avoid artifacts from e.g. different build systems (so only the patch is the difference). All patches applied properly (without -mno-sse2avx). This was on macOS 10.14 on a MacBook Pro with Intel Core i7. Test Video: hevc (Main) (hvc1 / 0x31637668), yuv420p(tv, bt709), 1920x824 [SAR 1:1 DAR 240:103], 1945 kb/s, 24.94 fps, 100 tbr, 90k tbn, 25 tbc. Preset always was medium and RF=19. Results: First row: Settings NLMeans - Light - Film - x265 NLMeans - Light - Animation - x265 NLMeans - Strong - Film - x265 NLMeans - Light - Film - x264 NLMeans - Light - Animation - x264 NLMeans - Strong - Film - x264 As you can see, in my test on macOS I did not see a big difference. I tried it all again with a second video, but the result was similar (even with the official nightly it is the same). I have no idea, why I did not see a great speed improvement. Do you have any idea? |
Clang be like what.
@bradleysepos planning on continuing with this? |
Yes, waiting for @jstebbins to have a look at the arch branch. Can't merge without those improvements, as it would break on non-AVX2 CPUs. While there isn't a significant speed improvement here, I may be able to optimize it further by ensuring the ordering across lanes is correct at the beginning, rather than blending at the end. Also, this paves the way for AVX-512 which @maximd33 might help me with. |
Because I did see a slight speed increase around 2 % for H.265 if I build with |
Thanks for testing this further. Your results seem to confirm what I've seen as well. For any gain to be had on AVX2, it seems we'll have to rewrite this to ensure ordering across lanes is correct at the beginning to avoid the blend at the end as I mentioned before. |
This something we want to close for now and re-visit another time? |
Building this code will produce a HandBrake binary that only runs on AVX2 capable processors. We need to do some work to the build system to improve this, so that only CPU feature-specific code gets compiled this way, and the rest generically. Kona started on this a few years back; the code is a bit dated now but perhaps useful for study: https://github.com/KonaBlend/HandBrake/commits/arch
Unfortunately, I have not been able to demonstrate a significant speedup. Feel free to benchmark and post results here. Do not benchmark against 1.2.0, because it will be slower due to the number of threads used with high core counts. Benchmark master with and without this patch. Takk.