Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greg/gsdx 64b #1664

Merged
merged 20 commits into from Nov 19, 2016
Merged

Greg/gsdx 64b #1664

merged 20 commits into from Nov 19, 2016

Conversation

gregory38
Copy link
Contributor

AVX/x64 implementation of GSdx JIT compiler. So far, perf sucks. -20% to -10% versus 32 bits.

With hard work, it might be possible to reach the 32 bits performance level (or close enough) but it is really a low priority. At least, we can run, test and dev on a 64 bits GSdx.

@willkuer
Copy link
Contributor

Do you run the core in interpreter? How do you benchmark? In GS replayer?

@gregory38
Copy link
Contributor Author

gregory38 commented Nov 19, 2016

Yes I benchmarked without the core through the replayer. Interpreter will be too slow and I have additional timing info on the replayer.

Here an example on HW renderer (but it is the same).

Performance Profile for 53 frames:
Min  1.40 ms    (711.80 fps)
Mean 7.50 ms    (133.31 fps)
Max  13.54 ms   (73.88 fps)
SD   3.72 ms

Frame Repartition
  0 ms =>   2 ms       1
  2 ms =>   4 ms      16
  4 ms =>   6 ms       9
  6 ms =>   8 ms       0
  8 ms =>  10 ms       7
 10 ms =>  12 ms      15
 12 ms =>  14 ms       5
 14 ms =>  16 ms       0

Edit: I benchmarked with the turbo off to avoid variation.

@willkuer
Copy link
Contributor

Interesting that the histogram shows two maxima. Maybe one can improve the bottleneck of the slower one.

Have you tried with different compiler? I guess compiler can be important here.

Also do you know the bottleneck is in gsdx and not in the gpu driver? Maybe calling the driver from 64bit results in differences?

@gregory38
Copy link
Contributor Author

Above was an example on the hw renderer (32 bits). I didn't test the hw renderer but it could be interesting. It is less critical on the hw renderer (often gpu limited)

Some frames are nearly empty (internal 30 fps). So they are quick to render.

@gregory38
Copy link
Contributor Author

By the way, if someone can test the SW renderer on 32 bits with all ISA. I want to avoid regression due to code factorization.

@gregory38
Copy link
Contributor Author

I pushed additional change to select the ISA (SSE2/SSSE3/SSE41/AVX1) at runtime.

It only impact the self-generated code (AKA the SW renderer). Nevertheless (if I manage to compile it on VS), it make

  • SSSE3 build useless
  • AVX1 build mostly useless

So in the future, we could limit it to SSE2/SSE4.1/AVX2.

@gregory38
Copy link
Contributor Author

gregory38 commented Nov 19, 2016

@turtleli
Could you help me to fix the compilation issue on Windows? Are AVX files (aka *.x86.avx.cpp) compiled on SSE windows build ?

Edit: it seems there are some black-magics in ./GSdx.vcxproj

@turtleli
Copy link
Member

Yeah, it's the exclude build stuff.

Fix for 32-bit Windows build (64 bits doesn't compile because __x86_64__ in stdafx.h isn't defined on Windows) - https://gist.github.com/turtleli/a05906730b9b885d5dde2e862440238f

@gregory38
Copy link
Contributor Author

Thanks for the patch.

gregory38 and others added 20 commits November 19, 2016 17:00
1/ Check all "levels"
2/ requires AVX for 64 bits
Very useful to stop the JIT
Allow to compare 32/64 bits (and all ISAs too)
Allow to breakpoint (int3)
Print selector info
Print size of buffer and start (disabled by default)
Based on Gabest's work.

* Miss mipmap

Note: dithering info
It is a bit tricky as a2 on linux was rdx register which overlap with fzm (dh/dl)
It might require dedicated windows code
mov with the stack pointer require less bytecode
It will requires a generic (register naming) linear interpolation to use it properly
Gather instruction requires an extra mask register therefore all registers name will be shuffled

Perf wise, initial haswell implementation seems to be microcode emulated.
…scanline)

It won't give the full SSE41 speed boost but it is better than nothing
The JIT will automatically select the best ISA (only AVX1 so far)
@gregory38 gregory38 merged commit 58c3794 into master Nov 19, 2016
@gregory38 gregory38 deleted the greg/gsdx-64b branch November 19, 2016 17:12
@gregory38
Copy link
Contributor Author

Let's go. The best way to have some test coverages 👍

@willkuer
Copy link
Contributor

willkuer commented Nov 19, 2016

Did I understand correctly that this pr solved #796?

Ah I see. It only solves it partially.

@gregory38
Copy link
Contributor Author

No. It is only a step forward. You have the C code, and you have the code generated by the C itself which is then executed. The latter is used by the SW renderer.
In case of SSE2 up AVX1 build, the generated code will be based on your CPU capabilities. However all C code will depend on the compilation flag.

However, I think SSSE3 and AVX1 build are useless now (I hope I won't have AVX penalty).

@mirh
Copy link

mirh commented Nov 19, 2016

Mhh.. is this affecting #357 then?

@gregory38
Copy link
Contributor Author

Well it was only a stub (on gsdx sw) to allow the compilation (used to crash). New code is working.
The limitation

  • mipmap isn't implemented
  • no AVX2
  • no SSE but any CPU that support AVX will pick up the AVX version (~70% of the users)

Tbh, AVX2 will be done. But I don't think pure SSE will be done. It is too slow to worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants