Skip to content

[TASK] x86 optimizations for motion feature extractor #562

@kylophone

Description

@kylophone

As of 5eacc87, the motion feature extractor has been reimplemented to operate on integer buffers. While this has already resulted in a considerable speed up (see below), there is still room for more optimizations. While motion is already the fastest feature extractor, it is a good first feature extractor to optimize since it is relatively simple and will also lay the foundations for further optimizations on the VIF and ADM feature extractors in the coming weeks.

commit 5eacc87ba1eb8c4c7ce543333de5b56e75bc6218
Author: Kyle Swanson <kswanson@netflix.com>
Date:   Thu Apr 30 10:37:40 2020 -0700

    libvmaf: implement fixed point motion feature extractor
    
    Compared to the previous implementation, this is a ~4x speedup
    
    float_motion    motion
    fps="37.07"     fps="153.00"
    
    Co-authored-by: IttiamVijayakumarGR <62744303+IttiamVijayakumarGR@users.noreply.github.com>

The algorithm for this feature extractor is as follows:

  • Blur the input buffers using a 5-tap Gaussian convolution.
  • Calculate the SAD between these blurred buffers.
  • Normalize the SAD score.

In terms of time spent in this code, it mostly convolution (64.5%), and in second place the SAD calculation (12.7%). Please note that for the convolution there is a separate path for 8-bit inputs and 10-bit inputs. Please download and have a look at the SVG flame graph in a web browser, you may click through call stacks and see time spent in each individual function.

Both the SAD function and the convolution function are set in the feature extractor initialization via function pointers here. These functions should be re-implemented using AVX2 and/or AVX-512 and these function pointers should be set if those instruction sets are available and not masked. For historical reasons, there is a global variable called cpu where you may read these cpu flags (this global variable will be moved into the feature extractor context soon, but for now use the global variable).

As for implementation, please leave src/feature/integer_motion.c pure C. This file should serve as a C reference implementation, but also be cross-platform, free of #ifdefs, assembly, and intrinsics. If you were to write a AVX-512 version of one of these functions, please do so in a new file src/feature/x86/motion_avx512.c and provide a header which exposes your functions.

To run vmaf_rc with only the motion feature extractor enabled, use the following command. To manually mask the CPU instruction set and force the C versions of the functions additionally set the --cpumask flag to -1.

./build/tools/vmaf_rc \
    --no_prediction \
    --reference ./y4ms/ducks.y4m \
    --distorted ./y4ms/ducks_dist.y4m \
    --output output.xml \
    --feature motion

To measure speedups, I've been relying on /usr/bin/time, but you may also read the fps from the output XML.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions