x64 support in core filters, part 1 #12

tp7 · 2013-11-16T11:34:02Z

This is the first and the largest part of x64 support in core filters.

Important changes

Almost all filters now have plain C, MMX/ISSE and SSE2 versions, implemented using intrinsic functions (old code used a mix of inline asm, Softwire and external asm). Some filters also optimized for SSSE3 where it provided noticeable speadup and was easy enough to add.

Exceptions: YUY2 Tweak path doesn't have an SSE2 version. It is triggered only in certain conditions and disabled by default and I believe it should be removed.

RGB24<->RGB32 conversions are optimized only for MMX and SSSE3 because fast SSE2 implementation is not trivial and most likely will not be faster than the optimized C path.

MMX support on the core level is dropped if there's a faster ISSE version. This affects mostly Pentium II and maybe some latest models of Pentium I.
Affected filters: YUY2 <-> YV12 conversion, Overlay, all Layer modes on RGB32 and some modes on YUY2.
Mode 1 is removed from TemporalSoften. Mode parameter is simply ignored.
MMX parameter in Blur is ignored.
ISSE version of HorizontalReduceBy2 YUY2 filter was removed because it was slower than the C code.
Due to SSSE3 optimizations, minimal required version of Visual Studio is 2008.

Notes on performance

All P4-specific optimizations were removed. The general rule was "not slower than original on Nehalem+ CPUs", which holds true for all modified filters. MMX code is not always faster than the original but SSE2 is. New SSE2 implementation is 10-150% faster than the best old version, depending on the particular filter. Turn functions got additional optimizations so you can expect 4-6 times better performance for those on planar colorspaces and RGB32.

I did not test it on CPUs older than Nehalem. I expect performance to get worse on P4 and I'm not sure about some filters on Core 2. Additional testing is needed but I can't do that myself and I don't think we should spend time optimizing for marginal profit on older than 5 years CPUs.

To do

Slight refactoring when we accept some kind of a coding guidelines. Right now the codebase is not consistent at all.
MMX code of audio filters is not ported. C path should still work.
Softwire dependency cannot be removed yet since resizers depend on it.
Some Turn functions could be optimized a bit more if it is needed for resizers.

Keep the parameter for compatibility with older scripts. Also remove isse version of accumulate_line for mode 2. The only difference was quite useless prefetch instruction.

Also fix compilation in 64bit mode.

Remove mmx parameter

Also fix a bug with C version of blur not processing the leftmost pixel

Rewrite MMX version to not be in-place. This simplifies code a lot, speed things un a bit on newer processors but might make it slower on the old ones where unaligned reads are slow.

Remove two useless instructions in SSE2 and MMX versions, add SSSE3 version.

…ion directly instead of shuffling the register afterwards. Also remove SSSE3 version since SSSE3 is not needed anywhere.

It's pretty much useless, just for completeness.

tp7 added 30 commits October 18, 2013 22:06

Rewrite Limiter asm, remove SoftWire dependency.

024aa4b

Rewrite SwapUV YUY2 asm

38e5727

Rewrite VerticalReduceBy2 asm

b282d67

Disable SwapUV and HorizontalReduceBy2 asm for small clips

9dea99d

Remove TemporalSoften mode 1.

b880b7a

Keep the parameter for compatibility with older scripts. Also remove isse version of accumulate_line for mode 2. The only difference was quite useless prefetch instruction.

Rewrite TemporalSoften asm

48d22b1

Enable TemporalSoften on x64

9eebd6d

Simplify Limiter SSE2 path.

8182daa

Also fix compilation in 64bit mode.

Replace uc define with BYTE for consistency in Blur

07fc19b

Clean up Blur code a bit

3c3661d

Remove mmx parameter

Rewrite vertical blur/sharpen asm

b66f132

Rewrite horizontal blur YV12 asm

f76904a

Reduce some code duplication in Blur, cleanup.

f49dbe9

Also fix a bug with C version of blur not processing the leftmost pixel

Rewrite RGB32 Blur MMX code to intrinsics

a4945cc

Add horizontal blur SSE2 implementation

e03bb94

Rewrite MMX version to not be in-place. This simplifies code a lot, speed things un a bit on newer processors but might make it slower on the old ones where unaligned reads are slow.

Rewrite Blur YUY2 asm. Both SSE2 and MMX versions.

c2a62ef

Rewrite convert YUY2 -> Y8 asm. Both SSE2 and MMX.

39d625a

Rewrite RGB32 -> Y8 conversion asm. MMX only.

4f45ccb

Rewrite RGB24 -> Y8 conversion asm. MMX only.

71afc37

SSE2 versions of RGB32 -> Y8 and RGB24 -> Y8 conversions

0cdeac7

Rewrite YUY2 Grayscale asm. Both MMX and SSE2.

5e06908

Rewrite Grayscale RGB32 asm. Both MMX and SSE2.

8d92c8b

Rewrite AveragePlane asm. Both MMX and SSE2.

2394e4e

Rewrite ComparePlane asm. Both MMX and SSE2.

66e8884

Rewrite RGB32 -> RGB24 conversion asm. MMX and SSSE3 versions.

77105af

Rewrite RGB24 -> RGB32 conversion asm. MMX and SSSE3 versions.

e13944a

Change planar Grayscale to use memset instead of looping manually

a4f75ee

Rewrite YUY2 -> YV16 conversion asm. MMX and SSE2 versions.

8c612e8

Rewrite YV16 -> YUY2 conversion asm. MMX and SSE2 versions.

0ed02d9

Add C version of YV12 -> YUY2 conversion

4955fa0

tp7 added 28 commits November 9, 2013 18:40

Fix a typo in a comment in convert_planar.

1c8f407

Remove inline_rgbtoyuy2 from ConvertToYUY2 class. Also rename.

d59f793

Rewrite ConvertToYUY2 from RGB. MMX only.

a0ace3b

Rewrite ConvertToYUY2 from RGB. SSE2 version.

a574e33

Remove Softwire code from RGB->YUY2 conversions.

fd437fb

Make C version of YUY2 -> RGB conversion colormatrix-aware

794ca09

Add ISSE YUY2->RGB conversion

4d6d88b

Remove unused convert_a.asm

c484243

Add SSE2 YUY2->RGB conversion

e3f1d87

Fix compilation on x64

ed4c91b

Refactor Turn filter

3abd944

Add SSE2/SSSE3 optimizations from FTurn

4948395

Add SSE2 versions of RGB32 Turn functions

69cc03e

Optimize YUY2 -> YV16 conversion a bit.

97ec159

Remove two useless instructions in SSE2 and MMX versions, add SSSE3 version.

Optimize/simplify planar TurnRight by transposing to the right direct…

982f743

…ion directly instead of shuffling the register afterwards. Also remove SSSE3 version since SSSE3 is not needed anywhere.

Add C versions of Layer YUY2 functions

b7c5fa4

Add C versions of Layer RGB32 functions

e8d1299

Rewrite Layer YUY2 MMX

6bc6368

Template Layer YUY2 C functions

a5e1dd1

Add SSE2 versions of Layer YUY2 functions

f563149

Rewrite Layer RGB32 MMX

9213ac7

Add SSE2 versions of Layer RGB32 functions

432fb70

Enable layer in x64 mode

3aaaae8

Remove some leftovers and alignment check on most RGB32 Layer routines

7390833

Rewrite tweak ISSE asm to intrinsics.

c95100f

It's pretty much useless, just for completeness.

Rewrite Compare ISSE code

51d0745

Add Compare SSE2 routine

25e23d6

Optimize RGB24<->32 conversion C code

fef5ab7

pylorak merged commit fef5ab7 into AviSynth:master Nov 17, 2013

tp7 deleted the asm-rewrite branch November 19, 2013 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64 support in core filters, part 1 #12

x64 support in core filters, part 1 #12

tp7 commented Nov 16, 2013

x64 support in core filters, part 1 #12

x64 support in core filters, part 1 #12

Conversation

tp7 commented Nov 16, 2013

Important changes

Notes on performance

To do