forked from jeeb/avisynth
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64 support in core filters, part 1 #12
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Keep the parameter for compatibility with older scripts. Also remove isse version of accumulate_line for mode 2. The only difference was quite useless prefetch instruction.
Also fix compilation in 64bit mode.
Remove mmx parameter
Also fix a bug with C version of blur not processing the leftmost pixel
Rewrite MMX version to not be in-place. This simplifies code a lot, speed things un a bit on newer processors but might make it slower on the old ones where unaligned reads are slow.
Remove two useless instructions in SSE2 and MMX versions, add SSSE3 version.
…ion directly instead of shuffling the register afterwards. Also remove SSSE3 version since SSSE3 is not needed anywhere.
It's pretty much useless, just for completeness.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the first and the largest part of x64 support in core filters.
Important changes
Exceptions: YUY2 Tweak path doesn't have an SSE2 version. It is triggered only in certain conditions and disabled by default and I believe it should be removed.
RGB24<->RGB32 conversions are optimized only for MMX and SSSE3 because fast SSE2 implementation is not trivial and most likely will not be faster than the optimized C path.
Affected filters: YUY2 <-> YV12 conversion, Overlay, all Layer modes on RGB32 and some modes on YUY2.
Notes on performance
All P4-specific optimizations were removed. The general rule was "not slower than original on Nehalem+ CPUs", which holds true for all modified filters. MMX code is not always faster than the original but SSE2 is. New SSE2 implementation is 10-150% faster than the best old version, depending on the particular filter. Turn functions got additional optimizations so you can expect 4-6 times better performance for those on planar colorspaces and RGB32.
I did not test it on CPUs older than Nehalem. I expect performance to get worse on P4 and I'm not sure about some filters on Core 2. Additional testing is needed but I can't do that myself and I don't think we should spend time optimizing for marginal profit on older than 5 years CPUs.
To do