Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64 support in core filters, part 1 #12

Merged
merged 88 commits into from Nov 17, 2013
Merged

Conversation

tp7
Copy link
Member

@tp7 tp7 commented Nov 16, 2013

This is the first and the largest part of x64 support in core filters.

Important changes

  • Almost all filters now have plain C, MMX/ISSE and SSE2 versions, implemented using intrinsic functions (old code used a mix of inline asm, Softwire and external asm). Some filters also optimized for SSSE3 where it provided noticeable speadup and was easy enough to add.

Exceptions: YUY2 Tweak path doesn't have an SSE2 version. It is triggered only in certain conditions and disabled by default and I believe it should be removed.

RGB24<->RGB32 conversions are optimized only for MMX and SSSE3 because fast SSE2 implementation is not trivial and most likely will not be faster than the optimized C path.

  • MMX support on the core level is dropped if there's a faster ISSE version. This affects mostly Pentium II and maybe some latest models of Pentium I.
    Affected filters: YUY2 <-> YV12 conversion, Overlay, all Layer modes on RGB32 and some modes on YUY2.
  • Mode 1 is removed from TemporalSoften. Mode parameter is simply ignored.
  • MMX parameter in Blur is ignored.
  • ISSE version of HorizontalReduceBy2 YUY2 filter was removed because it was slower than the C code.
  • Due to SSSE3 optimizations, minimal required version of Visual Studio is 2008.

Notes on performance

All P4-specific optimizations were removed. The general rule was "not slower than original on Nehalem+ CPUs", which holds true for all modified filters. MMX code is not always faster than the original but SSE2 is. New SSE2 implementation is 10-150% faster than the best old version, depending on the particular filter. Turn functions got additional optimizations so you can expect 4-6 times better performance for those on planar colorspaces and RGB32.

I did not test it on CPUs older than Nehalem. I expect performance to get worse on P4 and I'm not sure about some filters on Core 2. Additional testing is needed but I can't do that myself and I don't think we should spend time optimizing for marginal profit on older than 5 years CPUs.

To do

  • Slight refactoring when we accept some kind of a coding guidelines. Right now the codebase is not consistent at all.
  • MMX code of audio filters is not ported. C path should still work.
  • Softwire dependency cannot be removed yet since resizers depend on it.
  • Some Turn functions could be optimized a bit more if it is needed for resizers.

tp7 added 30 commits October 18, 2013 22:06
Keep the parameter for compatibility with older scripts.
Also remove isse version of accumulate_line for mode 2. The only difference was quite useless prefetch instruction.
Also fix compilation in 64bit mode.
Remove mmx parameter
Also fix a bug with C version of blur not processing the leftmost pixel
Rewrite MMX version to not be in-place. This simplifies code a lot, speed things un a bit on newer processors but might make it slower on the old ones where unaligned reads are slow.
tp7 added 28 commits November 9, 2013 18:40
Remove two useless instructions in SSE2 and MMX versions, add SSSE3 version.
…ion directly instead of shuffling the register afterwards.

Also remove SSSE3 version since SSSE3 is not needed anywhere.
It's pretty much useless, just for completeness.
@pylorak pylorak merged commit fef5ab7 into AviSynth:master Nov 17, 2013
@tp7 tp7 deleted the asm-rewrite branch November 19, 2013 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants