Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized SSE2 routines for bottleneck functions #42

Closed
wants to merge 11 commits into from

Conversation

aklomp
Copy link
Contributor

@aklomp aklomp commented Oct 6, 2014

This pull request provides SSE2 vectorized implementations of alg_update_reference_frame() and alg_noise_tune(). Profiling with Callgrind on my Atom server showed that the first was the most expensive single function call. Rewriting it in branchless SSE2 code cuts Motion's load average roughly in half for me. Per-function benchmarking shows a speedup of around 2× for alg_update_reference_frame(), and around 4× for alg_noise_tune(). Results differ across hardware and compilers, but always show significant speedup.

The plain functions have been lifted out of alg.c and placed in the new alg/ subdirectory, along with their SSE2 versions. In alg.c, the preprocessor chooses between including plain or SSE2 functions at compile time. This is perhaps not in line with the rest of the codebase, but made it possible to build a test harness around the functions that checks their correctness and does performance benchmarks. This harness can be found in alg/tests.

My website has a writeup of how I converted alg_update_reference_frame() to branchless code. It demonstrates how to derive the logic step by step, and is a kind of giant comment on the code. Hopefully it will help in reviewing the code for correctness.

This code took quite some time to write, but probably deserves more real-world testing than it's had so far. All comments welcome!

Taking the tests out of the branch condition results in a 50% speedup
with GCC when compiled with -O3.
This file implements the masking algorithm we will use in the SSE2 code.
It demonstrates identical output when compared to the plain function.
It's a lot slower than the regular routine when used on individual
pixels, but is speedier when vectorized.

Instead of branching, we use masks to composite the output values. The
masks were found by breaking down the branch conditions into boolean
operations, and repeatedly applying De Morgan's laws to simplify them.
Since SSE has the 'andnot' operator, optimize for that form.
To convert the algorithms to SSE2, we need to know exactly which width
and type of int we're dealing with. Make this an uint16_t; the type
looks large enough for a counter that updates every frame. At 10 hz, it
will take almost 2 hours for the counter to saturate; enough time to
finally accept a static object.
Directly run two functions on the same input and check whether they give
the same output; don't just rely on printing a few numbers to the screen
and eyeballing the results.
@Mr-Dave
Copy link
Member

Mr-Dave commented Oct 12, 2014

First, awesome writeup on the code and changes. I will admit that SSE optimization isn't my niche and would like to make sure there isn't any cross platform conflicts before a commit. I can test it on my pi but are there any other known conflicts that could occur with say BSD, Arch, CYGWIN, etc?

@tosiara
Copy link
Member

tosiara commented Oct 12, 2014

It is also on my list. I will do test on x86 Atom machine and also check on some ARM devices. The timeframe - next week (Oct 20)

@Mr-Dave
Copy link
Member

Mr-Dave commented Oct 12, 2014

Also, I might have broken this with the lastest change to the ffmpeg.c. In that module there was the following code.

#if !defined(SSE_MATH) && (defined(i386) || defined(x86_64))
asm volatile ( "emms");
#endif

This was placed into a non functional routine and I've been trying to read up on its use and placement but haven't yet figured out where is the "right" place for it.

@aklomp
Copy link
Contributor Author

aklomp commented Oct 12, 2014

The way this patch is written, the SSE2 code path will only be compiled when the __SSE2__ symbol is defined by the compiler. You can see the compile-time dispatcher for alg_noise_tune here. If __SSE2__ is not defined, it uses the old code. I took this idea from the existing MMX optimized code, which uses a similar scheme to selectively inline MMX code only for supported platforms.

The __SSE2__ symbol is defined only when compiling this code on x86 platforms with the -msse* compiler switch. All other platforms (ARM, sparc, etc) won't define the symbol and will use the same old routine as ever. No benefit for them. (I did rewrite most of this code in platform-agnostic GCC vector intrinsics, just to see how it would perform, but that gave a slowdown of 4×, not a speedup. I can post the code if you like, so you can see if there's benefits for ARM.)

There should be no conflicts based on UNIX flavor (as mentioned: BSD, Arch, Cygwin), because all the SSE2 code does, is perform the same routine with a different kind of variable. A 16-lane vector int type instead of an unsigned char. Platform differences don't enter into it.

Here's the two reasons why I wouldn't be jumping to pull this code just now:

  • Though I've been running it privately for weeks and have seen no regressions, I don't use smartmasks or other fancy config options. I'm a programmer, hence my interest in Motion, but functionally I have the blandest, most uninteresting setup. One V4L webcam, no exotic tuning. More testing in more scenarios by Motion power users with loads of experience with how Motion should behave would be welcome. As Kevin remarked on the mailing list, it's sort of important that this stuff doesn't break, and since this is a large code graft, I'd prefer to be cautious.
  • The code is a bit of a strange fit in the Motion codebase: it adds a few new subdirectories, which is something the rest of the code has seemed to resist. This is something for a maintainer to opine about.

@aklomp
Copy link
Contributor Author

aklomp commented Oct 12, 2014

Here's a quote from the Intel intrinsics manual about the EMMS instruction:

Using EMMS is like emptying a container to accommodate new content. The EMMS instruction clears the MMX registers and sets the value of the floating-point tag word to empty. Because floating-point convention specifies that the floating-point stack be cleared after use, you should clear the MMX registers before issuing a floating-point instruction. You should insert the EMMS instruction at the end of all MMX code segments to avoid a floating-point overflow exception.

So it's included on x86 platforms which don't use SSE math. That's basically anything older than the Pentium 3, which introduced the SSE instructions. The code won't break this patch, because by definition my code will only be compiled when SSE math is enabled.

@tosiara
Copy link
Member

tosiara commented Oct 12, 2014

I would appreciate if someone could optimize "mjpegtoyuv420p" and "decode_jpeg_raw" for ARM :) Those are currently bottlenecks for ARM

@aklomp
Copy link
Contributor Author

aklomp commented Oct 12, 2014

Not to derail the discussion, but this code in mjpegtoyuv420p looks inefficient:

for(loop = 0; loop < width * height; loop++)
    *map++ = yuv[0][loop];

for(loop = 0; loop < width * height / 4; loop++)
    *map++ = yuv[1][loop];

for(loop = 0; loop < width * height / 4; loop++)
    *map++ = yuv[2][loop];

Couldn't this be replaced with:

memcpy(map, yuv[0], width * height);
map += width * height;

memcpy(map, yuv[1], width * height / 4);
map += width * height / 4;

memcpy(map, yuv[2], width * height / 4);
map += width * height / 4;

I mean, I didn't test this or anything, and maybe I overlooked something obvious here, but that looks like an easy win. The memcpy is probably highly optimized.

@aklomp
Copy link
Contributor Author

aklomp commented Oct 12, 2014

Also, why the memset to zero, if you're going to copy over it anyway? Hell, why allocate all these temp buffers if you can just invoke decode_jpeg_raw like so:

decode_jpeg_raw(cap_map, size, 0, 420, width, height,
    map,
    map + width * height,
    map + width * height + (width * height / 4)
);

Or am I missing something obvious? So many questions...

@aklomp
Copy link
Contributor Author

aklomp commented Oct 12, 2014

Tosiara, how's this for a mjpegtoyuv420p replacement. I did no testing on this, but if it works, it should be a big improvement.

int mjpegtoyuv420p(unsigned char *map, unsigned char *cap_map, int width, int height, unsigned int size)
{
    int ret = decode_jpeg_raw(cap_map, size, 0, 420, width, height,
                map,
                map + (width * height),
                map + (width * height) + (width * height) / 4);

    if (ret == 1) {
        MOTION_LOG(CRT, TYPE_VIDEO, NO_ERRNO, "%s: Corrupt image ... continue");
        ret = 2;
    }
    return ret;
}

@tosiara
Copy link
Member

tosiara commented Oct 12, 2014

@aklomp thanks a lot, looks promising, I will give a try! Sorry ofr offtoping

@Mr-Dave
Copy link
Member

Mr-Dave commented Oct 12, 2014

Ok. I'm going to leave this for the moment and instead focus on the long list of bugs, leaks, lost functionality and documentation kind of fixes. There are a lot of gold nuggets in here both in the base pull as well as the offtopic for performance improvement.

@tosiara
Copy link
Member

tosiara commented Oct 13, 2014

@aklomp , I have moved the discussion on ARM and MJPEG here - #49

@jogu jogu closed this Nov 3, 2016
@aklomp
Copy link
Contributor Author

aklomp commented Nov 3, 2016

Care to comment on why this should be closed?

@jogu
Copy link
Member

jogu commented Nov 3, 2016

Ah. It wasn't deliberately closed. It seems GitHub automatically closed it because I deleted the (unused since 2015) unstable branch which this pull request is trying to merge into.

I think we should update this one to be merging into master.

@aklomp
Copy link
Contributor Author

aklomp commented Nov 3, 2016

I see, that explains it. In fairness, I also haven't done anything to further this issue in the past two years. The patches themselves probably won't even fit on the latest codebase.

Looking back, I'd probably take a different approach to what I tried to do here. Start with building a test framework, insert tests for the functions in question, then carefully add SIMD versions of those functions. As it stands, I can understand the reluctance to merge it.

I'll think about what to do with this patchset, it's been a while since I was involved in Motion development.

@tosiara
Copy link
Member

tosiara commented Nov 3, 2016

Would be nice to implement SSE2/SIMD/NEON (whatever possible) optimizations in the new codebase

ilmich pushed a commit to ilmich/motion that referenced this pull request Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants