-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized SSE2 routines for bottleneck functions #42
Conversation
Taking the tests out of the branch condition results in a 50% speedup with GCC when compiled with -O3.
This file implements the masking algorithm we will use in the SSE2 code. It demonstrates identical output when compared to the plain function. It's a lot slower than the regular routine when used on individual pixels, but is speedier when vectorized. Instead of branching, we use masks to composite the output values. The masks were found by breaking down the branch conditions into boolean operations, and repeatedly applying De Morgan's laws to simplify them. Since SSE has the 'andnot' operator, optimize for that form.
To convert the algorithms to SSE2, we need to know exactly which width and type of int we're dealing with. Make this an uint16_t; the type looks large enough for a counter that updates every frame. At 10 hz, it will take almost 2 hours for the counter to saturate; enough time to finally accept a static object.
Directly run two functions on the same input and check whether they give the same output; don't just rely on printing a few numbers to the screen and eyeballing the results.
First, awesome writeup on the code and changes. I will admit that SSE optimization isn't my niche and would like to make sure there isn't any cross platform conflicts before a commit. I can test it on my pi but are there any other known conflicts that could occur with say BSD, Arch, CYGWIN, etc? |
It is also on my list. I will do test on x86 Atom machine and also check on some ARM devices. The timeframe - next week (Oct 20) |
Also, I might have broken this with the lastest change to the ffmpeg.c. In that module there was the following code. #if !defined(SSE_MATH) && (defined(i386) || defined(x86_64)) This was placed into a non functional routine and I've been trying to read up on its use and placement but haven't yet figured out where is the "right" place for it. |
The way this patch is written, the SSE2 code path will only be compiled when the The There should be no conflicts based on UNIX flavor (as mentioned: BSD, Arch, Cygwin), because all the SSE2 code does, is perform the same routine with a different kind of variable. A 16-lane vector int type instead of an unsigned char. Platform differences don't enter into it. Here's the two reasons why I wouldn't be jumping to pull this code just now:
|
Here's a quote from the Intel intrinsics manual about the EMMS instruction:
So it's included on x86 platforms which don't use SSE math. That's basically anything older than the Pentium 3, which introduced the SSE instructions. The code won't break this patch, because by definition my code will only be compiled when SSE math is enabled. |
I would appreciate if someone could optimize "mjpegtoyuv420p" and "decode_jpeg_raw" for ARM :) Those are currently bottlenecks for ARM |
Not to derail the discussion, but this code in
Couldn't this be replaced with:
I mean, I didn't test this or anything, and maybe I overlooked something obvious here, but that looks like an easy win. The |
Also, why the
Or am I missing something obvious? So many questions... |
Tosiara, how's this for a
|
@aklomp thanks a lot, looks promising, I will give a try! Sorry ofr offtoping |
Ok. I'm going to leave this for the moment and instead focus on the long list of bugs, leaks, lost functionality and documentation kind of fixes. There are a lot of gold nuggets in here both in the base pull as well as the offtopic for performance improvement. |
Care to comment on why this should be closed? |
Ah. It wasn't deliberately closed. It seems GitHub automatically closed it because I deleted the (unused since 2015) unstable branch which this pull request is trying to merge into. I think we should update this one to be merging into master. |
I see, that explains it. In fairness, I also haven't done anything to further this issue in the past two years. The patches themselves probably won't even fit on the latest codebase. Looking back, I'd probably take a different approach to what I tried to do here. Start with building a test framework, insert tests for the functions in question, then carefully add SIMD versions of those functions. As it stands, I can understand the reluctance to merge it. I'll think about what to do with this patchset, it's been a while since I was involved in Motion development. |
Would be nice to implement SSE2/SIMD/NEON (whatever possible) optimizations in the new codebase |
Upstream merge
This pull request provides SSE2 vectorized implementations of
alg_update_reference_frame()
andalg_noise_tune()
. Profiling with Callgrind on my Atom server showed that the first was the most expensive single function call. Rewriting it in branchless SSE2 code cuts Motion's load average roughly in half for me. Per-function benchmarking shows a speedup of around 2× foralg_update_reference_frame()
, and around 4× foralg_noise_tune()
. Results differ across hardware and compilers, but always show significant speedup.The plain functions have been lifted out of
alg.c
and placed in the newalg/
subdirectory, along with their SSE2 versions. Inalg.c
, the preprocessor chooses between including plain or SSE2 functions at compile time. This is perhaps not in line with the rest of the codebase, but made it possible to build a test harness around the functions that checks their correctness and does performance benchmarks. This harness can be found inalg/tests
.My website has a writeup of how I converted
alg_update_reference_frame()
to branchless code. It demonstrates how to derive the logic step by step, and is a kind of giant comment on the code. Hopefully it will help in reviewing the code for correctness.This code took quite some time to write, but probably deserves more real-world testing than it's had so far. All comments welcome!