Add optimized SSE2 routines for bottleneck functions #42

aklomp · 2014-10-06T21:55:13Z

This pull request provides SSE2 vectorized implementations of alg_update_reference_frame() and alg_noise_tune(). Profiling with Callgrind on my Atom server showed that the first was the most expensive single function call. Rewriting it in branchless SSE2 code cuts Motion's load average roughly in half for me. Per-function benchmarking shows a speedup of around 2× for alg_update_reference_frame(), and around 4× for alg_noise_tune(). Results differ across hardware and compilers, but always show significant speedup.

The plain functions have been lifted out of alg.c and placed in the new alg/ subdirectory, along with their SSE2 versions. In alg.c, the preprocessor chooses between including plain or SSE2 functions at compile time. This is perhaps not in line with the rest of the codebase, but made it possible to build a test harness around the functions that checks their correctness and does performance benchmarks. This harness can be found in alg/tests.

My website has a writeup of how I converted alg_update_reference_frame() to branchless code. It demonstrates how to derive the logic step by step, and is a kind of giant comment on the code. Hopefully it will help in reviewing the code for correctness.

This code took quite some time to write, but probably deserves more real-world testing than it's had so far. All comments welcome!

Taking the tests out of the branch condition results in a 50% speedup with GCC when compiled with -O3.

This file implements the masking algorithm we will use in the SSE2 code. It demonstrates identical output when compared to the plain function. It's a lot slower than the regular routine when used on individual pixels, but is speedier when vectorized. Instead of branching, we use masks to composite the output values. The masks were found by breaking down the branch conditions into boolean operations, and repeatedly applying De Morgan's laws to simplify them. Since SSE has the 'andnot' operator, optimize for that form.

To convert the algorithms to SSE2, we need to know exactly which width and type of int we're dealing with. Make this an uint16_t; the type looks large enough for a counter that updates every frame. At 10 hz, it will take almost 2 hours for the counter to saturate; enough time to finally accept a static object.

Directly run two functions on the same input and check whether they give the same output; don't just rely on printing a few numbers to the screen and eyeballing the results.

Mr-Dave · 2014-10-12T17:09:32Z

First, awesome writeup on the code and changes. I will admit that SSE optimization isn't my niche and would like to make sure there isn't any cross platform conflicts before a commit. I can test it on my pi but are there any other known conflicts that could occur with say BSD, Arch, CYGWIN, etc?

tosiara · 2014-10-12T17:26:00Z

It is also on my list. I will do test on x86 Atom machine and also check on some ARM devices. The timeframe - next week (Oct 20)

Mr-Dave · 2014-10-12T17:46:32Z

Also, I might have broken this with the lastest change to the ffmpeg.c. In that module there was the following code.

#if !defined(SSE_MATH) && (defined(i386) || defined(x86_64))
asm volatile ( "emms");
#endif

This was placed into a non functional routine and I've been trying to read up on its use and placement but haven't yet figured out where is the "right" place for it.

aklomp · 2014-10-12T17:56:33Z

The way this patch is written, the SSE2 code path will only be compiled when the __SSE2__ symbol is defined by the compiler. You can see the compile-time dispatcher for alg_noise_tune here. If __SSE2__ is not defined, it uses the old code. I took this idea from the existing MMX optimized code, which uses a similar scheme to selectively inline MMX code only for supported platforms.

The __SSE2__ symbol is defined only when compiling this code on x86 platforms with the -msse* compiler switch. All other platforms (ARM, sparc, etc) won't define the symbol and will use the same old routine as ever. No benefit for them. (I did rewrite most of this code in platform-agnostic GCC vector intrinsics, just to see how it would perform, but that gave a slowdown of 4×, not a speedup. I can post the code if you like, so you can see if there's benefits for ARM.)

There should be no conflicts based on UNIX flavor (as mentioned: BSD, Arch, Cygwin), because all the SSE2 code does, is perform the same routine with a different kind of variable. A 16-lane vector int type instead of an unsigned char. Platform differences don't enter into it.

Here's the two reasons why I wouldn't be jumping to pull this code just now:

Though I've been running it privately for weeks and have seen no regressions, I don't use smartmasks or other fancy config options. I'm a programmer, hence my interest in Motion, but functionally I have the blandest, most uninteresting setup. One V4L webcam, no exotic tuning. More testing in more scenarios by Motion power users with loads of experience with how Motion should behave would be welcome. As Kevin remarked on the mailing list, it's sort of important that this stuff doesn't break, and since this is a large code graft, I'd prefer to be cautious.
The code is a bit of a strange fit in the Motion codebase: it adds a few new subdirectories, which is something the rest of the code has seemed to resist. This is something for a maintainer to opine about.

aklomp · 2014-10-12T18:05:17Z

Here's a quote from the Intel intrinsics manual about the EMMS instruction:

Using EMMS is like emptying a container to accommodate new content. The EMMS instruction clears the MMX registers and sets the value of the floating-point tag word to empty. Because floating-point convention specifies that the floating-point stack be cleared after use, you should clear the MMX registers before issuing a floating-point instruction. You should insert the EMMS instruction at the end of all MMX code segments to avoid a floating-point overflow exception.

So it's included on x86 platforms which don't use SSE math. That's basically anything older than the Pentium 3, which introduced the SSE instructions. The code won't break this patch, because by definition my code will only be compiled when SSE math is enabled.

tosiara · 2014-10-12T18:10:15Z

I would appreciate if someone could optimize "mjpegtoyuv420p" and "decode_jpeg_raw" for ARM :) Those are currently bottlenecks for ARM

aklomp · 2014-10-12T18:24:44Z

Not to derail the discussion, but this code in mjpegtoyuv420p looks inefficient:

for(loop = 0; loop < width * height; loop++)
    *map++ = yuv[0][loop];

for(loop = 0; loop < width * height / 4; loop++)
    *map++ = yuv[1][loop];

for(loop = 0; loop < width * height / 4; loop++)
    *map++ = yuv[2][loop];

Couldn't this be replaced with:

memcpy(map, yuv[0], width * height);
map += width * height;

memcpy(map, yuv[1], width * height / 4);
map += width * height / 4;

memcpy(map, yuv[2], width * height / 4);
map += width * height / 4;

I mean, I didn't test this or anything, and maybe I overlooked something obvious here, but that looks like an easy win. The memcpy is probably highly optimized.

aklomp · 2014-10-12T18:32:26Z

Also, why the memset to zero, if you're going to copy over it anyway? Hell, why allocate all these temp buffers if you can just invoke decode_jpeg_raw like so:

decode_jpeg_raw(cap_map, size, 0, 420, width, height,
    map,
    map + width * height,
    map + width * height + (width * height / 4)
);

Or am I missing something obvious? So many questions...

aklomp · 2014-10-12T18:42:50Z

Tosiara, how's this for a mjpegtoyuv420p replacement. I did no testing on this, but if it works, it should be a big improvement.

int mjpegtoyuv420p(unsigned char *map, unsigned char *cap_map, int width, int height, unsigned int size)
{
    int ret = decode_jpeg_raw(cap_map, size, 0, 420, width, height,
                map,
                map + (width * height),
                map + (width * height) + (width * height) / 4);

    if (ret == 1) {
        MOTION_LOG(CRT, TYPE_VIDEO, NO_ERRNO, "%s: Corrupt image ... continue");
        ret = 2;
    }
    return ret;
}

tosiara · 2014-10-12T18:48:51Z

@aklomp thanks a lot, looks promising, I will give a try! Sorry ofr offtoping

Mr-Dave · 2014-10-12T19:46:44Z

Ok. I'm going to leave this for the moment and instead focus on the long list of bugs, leaks, lost functionality and documentation kind of fixes. There are a lot of gold nuggets in here both in the base pull as well as the offtopic for performance improvement.

tosiara · 2014-10-13T11:13:49Z

@aklomp , I have moved the discussion on ARM and MJPEG here - #49

aklomp · 2016-11-03T08:33:24Z

Care to comment on why this should be closed?

jogu · 2016-11-03T08:37:10Z

Ah. It wasn't deliberately closed. It seems GitHub automatically closed it because I deleted the (unused since 2015) unstable branch which this pull request is trying to merge into.

I think we should update this one to be merging into master.

aklomp · 2016-11-03T08:44:09Z

I see, that explains it. In fairness, I also haven't done anything to further this issue in the past two years. The patches themselves probably won't even fit on the latest codebase.

Looking back, I'd probably take a different approach to what I tried to do here. Start with building a test framework, insert tests for the functions in question, then carefully add SIMD versions of those functions. As it stands, I can understand the reluctance to merge it.

I'll think about what to do with this patchset, it's been a while since I was involved in Motion development.

tosiara · 2016-11-03T09:02:57Z

Would be nice to implement SSE2/SIMD/NEON (whatever possible) optimizations in the new codebase

Upstream merge

aklomp added 11 commits October 7, 2014 09:59

alg_noise_tune: split off into separate file to enable testing

529b456

alg_noise_tune: add test pattern framework

f532182

alg_noise_tune: change ints to unsigned for 10% speed improvement

e1db59a

alg_noise_tune: add optimized SSE2 routine

80f7791

alg_update_reference_frame: factor out existing function

4d37fe3

alg_update_reference_frame: add test harness

59f2658

alg_update_reference_frame: factor test masks out of branch condition

3319a11

Taking the tests out of the branch condition results in a 50% speedup with GCC when compiled with -O3.

alg_update_reference_frame: add optimized SSE2 routine

36d45bc

alg_update_reference_frame: test harness: directly compare functions

0a2582d

Directly run two functions on the same input and check whether they give the same output; don't just rely on printing a few numbers to the screen and eyeballing the results.

aklomp force-pushed the mrdave-sse branch from 5a2e121 to 0a2582d Compare October 7, 2014 08:00

tosiara mentioned this pull request Oct 13, 2014

Optimization for mjpegtoyuv420p function #49

Closed

jogu closed this Nov 3, 2016

ilmich pushed a commit to ilmich/motion that referenced this pull request Jul 19, 2023

Merge pull request Motion-Project#42 from Motion-Project/master

9d342a3

Upstream merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized SSE2 routines for bottleneck functions #42

Add optimized SSE2 routines for bottleneck functions #42

aklomp commented Oct 6, 2014

Mr-Dave commented Oct 12, 2014

tosiara commented Oct 12, 2014

Mr-Dave commented Oct 12, 2014

aklomp commented Oct 12, 2014

aklomp commented Oct 12, 2014

tosiara commented Oct 12, 2014

aklomp commented Oct 12, 2014

aklomp commented Oct 12, 2014

aklomp commented Oct 12, 2014

tosiara commented Oct 12, 2014

Mr-Dave commented Oct 12, 2014

tosiara commented Oct 13, 2014

aklomp commented Nov 3, 2016

jogu commented Nov 3, 2016

aklomp commented Nov 3, 2016

tosiara commented Nov 3, 2016

Add optimized SSE2 routines for bottleneck functions #42

Add optimized SSE2 routines for bottleneck functions #42

Conversation

aklomp commented Oct 6, 2014

Mr-Dave commented Oct 12, 2014

tosiara commented Oct 12, 2014

Mr-Dave commented Oct 12, 2014

aklomp commented Oct 12, 2014

aklomp commented Oct 12, 2014

tosiara commented Oct 12, 2014

aklomp commented Oct 12, 2014

aklomp commented Oct 12, 2014

aklomp commented Oct 12, 2014

tosiara commented Oct 12, 2014

Mr-Dave commented Oct 12, 2014

tosiara commented Oct 13, 2014

aklomp commented Nov 3, 2016

jogu commented Nov 3, 2016

aklomp commented Nov 3, 2016

tosiara commented Nov 3, 2016