Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental opencl renderer #302

Closed
wants to merge 13 commits into from
Closed

experimental opencl renderer #302

wants to merge 13 commits into from

Conversation

gabest11
Copy link
Contributor

No description provided.

…e cache, mipmap, aa1, device selection). Needs any OpenCL SDK for the common headers and stub lib to compile, tested with AMD and Intel. Too bad it is not part of the Windows SDK yet.

- Renumbered renderer ids, compatible with old numbering, but it does not follow the mod3 logic anymore.
@ramapcsx2
Copy link
Member

Well, this PR is a problem on several things :p
First, it requires everyone to install a new SDK to compile PCSX2:
fatal error C1083: Cannot open include file: 'CL/cl.hpp': No such file or directory
Second, there are some unrelated changes such as the refactoring. Those could be
okay, dunno.

@gabest11
Copy link
Contributor Author

Yep, install your favorite opencl sdk, it's awesome! Anyway, it's full of bugs, but I'm weeding them out, will be usable in a few days. The other changes are there because no one told me about github in time :)

@neobrain
Copy link

Heh, I always wanted to do something similar for Dolphin too - so nice to see someone try it out :)

Just wondering, how's performance with this code compared to the GL/software renderers? In particular I was thinking alpha blending and depth testing (or any other output-merger tasks) would be a pain to get done efficiently.

@degasus
Copy link

degasus commented Sep 17, 2014

As you iterate over all primitives in all pixels, how many triangles were rendered on how many pixels? I'm wondering how much of the time most of the gpu's execution engines just iterate over skipped primitives.

@gabest11
Copy link
Contributor Author

Perfomance on a radeon 270X is about 80% of my i7-4770 on 4 threads, no idea about nvidia, I heard it was worse for gpu computing, but maybe not. I still have ideas how to optimize the rendering, there are always shortcuts to find for special drawing cases. The main slowdown happens when the game switches between too many kernel types, it is more sensitive to it then d3d or opengl. What was really surprising to me, the host memory speed. I don't even have to ping-pong anything, it just fetches texture data and draws right in the puss.. I mean system memory through PCIe.

The output merger is the main reason we need our own rasterizer, since it's fixed function in the accelerated apis. There is virtually no memory access in the kernel, read the target into a register once, iterate over the triangles, write it out. Merging happens to the register. PS2 may use the same memory for frame and z buffers, exactly or overlapped, that may be bogus currently. I have to find an example game and add a work-around, but that's not a huge problem.

The rendering is split into 16x16 tiles and batches of 4096 primitives, each tile gets as many prims as the bbox and other tests find, so it varies. One computing unit does one tile, they can go ahead each other in a batch, and grab another tile when ready, but at the end of the batch, they are synchronized by the kernel call boundaries. My software renderer does not have this limitation, only synced when the render target changes to another which is still being rendered, I suspect that's what gives it an advantage over opencl currently. I could do similar by launching the drawing kernels parallel, each queue having its own part of the screen in a check board style, but then I need to use events to chain the rendering queues, and events are dog slow. Even waking up the queue once it goes to sleep costs a full millisecond sometimes, when there are only 20ms in a 50fps game per frame.

@yxmline
Copy link

yxmline commented Sep 18, 2014

not compatible NVIDIA OPENCL ⊙▂⊙

@gabest11
Copy link
Contributor Author

Do you see any compiling errors in the console window?

@yxmline
Copy link

yxmline commented Sep 18, 2014

The compiler is no problem
Runtime error:
Runtime Error!

Program:

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

This is a log:
PCSX2 1.3.0-20140913142241 - compiled on Sep 14 2014
Savestate version: 0x9a0a0000

Host Machine Init:
Operating System = Microsoft Windows 7 Ultimate Edition Service Pack 1 (build 7601), 64-bit
Physical RAM = 20437 MB
CPU name = Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Vendor/Model = GenuineIntel (stepping 09)
CPU speed = 3.391 ghz (8 logical threads)
x86PType = Standard OEM
x86Flags = bfebfbff 7fbae3ff
x86EFlags = 28100000

x86 Features Detected:
SSE2.. SSE3.. SSSE3.. SSE4.1.. SSE4.2.. AVX

Reserving memory for recompilers...

Loading plugins...
Binding GS: D:\Winlinux\Msys64\home\Administrator\pcsx2\tmp\pcsx2-vs12\plugins\GSdx32-AVX.dll
Windows 6.1.7601 (Service Pack 1 1.0)
Binding PAD: D:\Winlinux\Msys64\home\Administrator\pcsx2\tmp\pcsx2-vs12\plugins\LilyPad.dll
Binding SPU2: D:\Winlinux\Msys64\home\Administrator\pcsx2\tmp\pcsx2-vs12\plugins\SPU2-X.dll
Binding CDVD: D:\Winlinux\Msys64\home\Administrator\pcsx2\tmp\pcsx2-vs12\plugins\CDVDiso.dll
Binding USB: D:\Winlinux\Msys64\home\Administrator\pcsx2\tmp\pcsx2-vs12\plugins\USBnull.dll
Binding FW: D:\Winlinux\Msys64\home\Administrator\pcsx2\tmp\pcsx2-vs12\plugins\FWnull.dll
Binding DEV9: D:\Winlinux\Msys64\home\Administrator\pcsx2\tmp\pcsx2-vs12\plugins\DEV9null.dll
Plugins loaded successfully.

(GameDB) 9660 games on record (loaded in 147ms)
HLE Notice: ELF does not have a path.

Initializing plugins...
Init GS
Windows 6.1.7601 (Service Pack 1 1.0)
Init PAD
Init SPU2
Init CDVD
Init USB
Init FW
Init DEV9
Plugins initialized successfully.

Opening plugins...
Opening GS
Opening PAD
Opening SPU2
NVIDIA Corporation GeForce GTX 760 OpenCL C 1.1 GPU
Intel(R) Corporation Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz OpenCL C 1.2 CPU
Intel(R) Corporation Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz OpenCL C 2.0 CPU
Opening CDVD
Opening USB
Opening FW
Opening DEV9
Closing plugins...
Closing DEV9
Closing FW
Closing USB
Closing CDVD
Closing SPU2
(pxActionEvent) The MTGS thread has become unresponsive while waiting for the GS plugin to open.(thread:EE Core)
Closing PAD
Closing GS

windbg log:
WARNING: Continuing a non-continuable exception
(2448.25b4): C++ EH exception - code e06d7363 (first chance)
(2448.25b4): C++ EH exception - code e06d7363 (first chance)
(2448.24b0): C++ EH exception - code e06d7363 (first chance)

@gabest11
Copy link
Contributor Author

I can't see any "OpenCL C 1.2 GPU" device, that's the problem. Don't know why it crashed, should have just thrown an exception to GSOpen and return an error code from there. Real question is, why nvidia has no 1.2 support. I'm trying to find information on what cards has which version. Might have to lower the version requirements if this turns out to be the case.

@Silanda
Copy link

Silanda commented Sep 18, 2014

Gabest, if you're building using the AMD SDK for OpenCL 1.2, it's not going to work on Nvidia hardware at all. To say Nvidia have been dragging their heels when it comes to OpenCL is an understatement, they only support 1.1. Building against Nvidia's OpenCL.lib, using the Khronos 1.2 headers, leads to four unresolved externals:

GSRendererCL.obj : error LNK2001: unresolved external symbol _clEnqueueMarkerWithWaitList@16
1>GSRendererCL.obj : error LNK2001: unresolved external symbol _clEnqueueBarrierWithWaitList@16
1>GSRendererCL.obj : error LNK2001: unresolved external symbol _clReleaseDevice@4
1>GSRendererCL.obj : error LNK2001: unresolved external symbol _clRetainDevice@4

Of course, I'm a neophyte at this so I could be screwing up somehow.

@gabest11
Copy link
Contributor Author

I'm using the Intel SDK currently, there is no difference, they all link to the common opencl.dll, even the headers are the same.

Those functions are 1.2 and not in the nvidia sdk. Wikipedia only lists 1.1 for all nvidia cards :(

@Silanda
Copy link

Silanda commented Sep 18, 2014

Yup, Nvidia are being real dicks over this. On a positive note, I tested to see if it works on the Intel HD Graphics 4000. It does! Horrifyingly slow (as expected) but it works!

@ZironZ
Copy link

ZironZ commented Sep 18, 2014

Nvidia essentially wants OpenCL to die because they are pushing CUDA instead. Their OpenCL 1.1 implementation gets decent performance, but they haven't really done anything outside of bug fixes to it in years.

They did say the Tegra K1 is "OpenCL 1.2 capable" a few months back, but it doesn't sound like they ever actually plan to add support.
http://www.nvidia.com/content/PDF/tegra_white_papers/tegra-K1-whitepaper.pdf

… use its sdk to compile gsdx, cl.hpp is missing there. Intel or amd is ok.
@gabest11
Copy link
Contributor Author

Had to put back my old 460 GTX to verify 1.1. It only has 7 computing units, slower than a cpu emulation, what a beast.

@yxmline
Copy link

yxmline commented Sep 19, 2014

NVIDIA card can work (¬_¬)
Not reported to the runtime error

@ramapcsx2
Copy link
Member

This new renderer still needs some kind of define if we want to include it.
We want to decrease dependency on extra sdks, not add new ones.
It's fine if people want to play with OpenCL stuff but the default build has to work without it.

@Silanda
Copy link

Silanda commented Sep 19, 2014

I can't get this to work on a GTX 780Ti. It seems to compile correctly but will just sit there with a black screen until it eventually craps out with a runtime error. It still works on the Intel 4000 though.

I have the two graphic devices running together so I wondered if there could be a clash between them. If there is I haven't been able to resolve it.

Ok, I have now been able to run OpenCL examples on my system so I'm not sure what's up

@gabest11
Copy link
Contributor Author

The headers are publicly available here: https://www.khronos.org/registry/cl/, I could remove the sdk dependency if we can add them under 3rdparty. Copyright message at the top seems to allow that.

@gregory38
Copy link
Contributor

For what it worth they already outdated (opengl) khronos file in 3rdparty/GL. So I see no objection to add opencl too (opengl headers are opensource compatible, so it must be the same for opencl)

Note: it would be nice to split first commit in two (last chapter of https://github.com/PCSX2/pcsx2/wiki/Git-survival-guide).
Note2: before any merge, I would need to check the compilation on linux (potentially openCL too but nvidia here...)

@gabest11
Copy link
Contributor Author

How do you link to opengl on linux? Do you need an import lib, too? On windows I can just compile this lib from a .def file, that lists all the the exports of opencl.dll, but no idea about linux.

@gregory38
Copy link
Contributor

For openGL. There is a lib.so that contains the base of opengl (opengl 1.x). Then you need to fetch all function pointer manually (there is a kind of "high level" dlopen/dlsym). For openCL I don't know yet, I don't think they use the same mechanism, because you know people always complains about openGL function pointer.

@gabest11
Copy link
Contributor Author

I could GetProcAddress instead of the importlib, but that means no c++ wrapper, that uses the extern function decls of cl.h.

@gregory38
Copy link
Contributor

Don't worry I will manage the linux details. I'm nearly sure that I can use directly the .so function name directly (equivalent to .def on windows).

@degasus
Copy link

degasus commented Sep 23, 2014

For OpenGL, this RW buffers are described here: http://www.opengl.org/wiki/Image_Load_Store
Also look at the "Memory qualifiers" chapter.

@gregory38
Copy link
Contributor

Image load and store allow to use RW on the "input" ressource of any shader. The new extension allow a limited RW access to the output of the fragment shader. It allows a kind of programmable blending. Only a couple of PS2 blending mode aren't supported. So even if it isn't fast, it might be enough for those corner cases.

It doesn't help for the depth. By the way I manage to support DATE with image_load_store (uav) (only enabled on nvidia opengl). At least it seems to work on a couple of testcase. Do we have others limitation on the output merger?

If it can help there are also some papers (thesis?) on a cuda rasteriser too.

@gabest11
Copy link
Contributor Author

DATE needs to be done per triangles, else you are testing against the last batch, not the last triangle, and drawing one-by-one is not possible of course.

The output bitmask and its fake 16 bit abuse is the last one I think.

@gregory38
Copy link
Contributor

For DATE, I "render" once to search the first primitive that will change the destination alpha value test. Then I redraw the n first primitive (for each fragments).

First pass

#if PS_DATE == 1 && !defined(DISABLE_GL42_image)
    // DATM == 0
    // Pixel with alpha equal to 1 will failed
    if (c.a > 127.5f / 255.0f) {
        imageAtomicMin(img_prim_min, ivec2(gl_FragCoord.xy), gl_PrimitiveID);
    }
    //memoryBarrier();
#elif PS_DATE == 2 && !defined(DISABLE_GL42_image)
    // DATM == 1
    // Pixel with alpha equal to 0 will failed
    if (c.a < 127.5f / 255.0f) {
        imageAtomicMin(img_prim_min, ivec2(gl_FragCoord.xy), gl_PrimitiveID);
    }
#endif

2nd pass

#if PS_DATE == 3 && !defined(DISABLE_GL42_image)
    int stencil_ceil = imageLoad(img_prim_min, ivec2(gl_FragCoord.xy));
    // Note gl_PrimitiveID == stencil_ceil will be the primitive that will update
    // the bad alpha value so we must keep it.

    if (gl_PrimitiveID > stencil_ceil) {
        discard;
    }
#endif

@gabest11
Copy link
Contributor Author

That's a good one. Do you know if barrier is necessary with atomics? I was wondering myself.

@gabest11 gabest11 closed this Sep 23, 2014
@gabest11 gabest11 reopened this Sep 23, 2014
@gregory38
Copy link
Contributor

I really don't know. The spec is very confusing. For sure somethings must be done to ensure that atomic are done before the 2nd draw. I declare the memory as coherent, not sure it is enough. Reading again the wiki, they seem to imply that a barrier is mandatory.

I declare the memory as coherent. I don't know if I need to call a barrier between 2 draw calls. I just did a quick benchmark, I didn't see any difference with the barrier. Strangely I remember that the barrier was very costly (maybe it was a driver bug or my brain is just too old).

@gregory38
Copy link
Contributor

By the way to reduce the number of fragment, I still use the stencil method. I didn't do any benchmark so potentially it can be dropped. It would be also interesting to check the performance impact to always have a stencil buffer (even if it's enabled). Maybe it would be globally faster to keep the atomic method even for basic case (no write of alpha).

@danilaml
Copy link

http://support.amd.com/en-us/kb-articles/Pages/OpenCL2-Driver.aspx

So AMD and soon latest Intel iGPU support openCL 2.0 while nvidia is still stuck at 1.1

@gabest11
Copy link
Contributor Author

Support OpenCL 2.0 Core Features:

  • Shared Virtual Memory – Coarse Grain
  • C1x atomics and memory ordering
  • Pipes
  • Precision for Math built-in native functions
  • Program Scope Variables
  • Subgroups
  • New built-in functions
  • Generic Address Space
  • Images
  • Flexible ND-Range
  • Dynamic Parallelism

@gabest11
Copy link
Contributor Author

Time for a rewrite! :( :) :( :)

@Dokman
Copy link
Contributor

Dokman commented Sep 26, 2014

i am waiting for it changes to test it with a R9 290X :3

PS: @gabest11 did you try even a day to fix mipmapping in hardware mode? a lot of users we are witing for it
Thanks if you will do it one day

@gregory38
Copy link
Contributor

Me too I got a question. I know that textures are converted from GS-tiled format to linear format inside GSdx. I don't know if you need to access the linear format for GSdx. But if it is only limited to the GPU rendering maybe we could defer it to a computer shader.

The idea would be to create a texture array of 2 slices. 1 slice with the linear format the others one with the tiled format.

By the way what are the blocking point for texture mipmapping? Previous method will allow to easy to generate all the mipmaped texture. The id of the texture array can be computed in the vertex shader so it might be possible to compute the LOD.

@Dokman
Copy link
Contributor

Dokman commented Sep 26, 2014

@gregory38 and now what can you do? for fixing it i don't know much c++ TT sorry i am newbie

@gabest11
Copy link
Contributor Author

In the past shaders had to be simple, but it may be possible now. Just have to keep updating all 7 possible levels like the single one now. I think a pixel shader can handle seven textures, not sure how much complexity it adds dynamically selecting and sampling the two needed.

@SerialHacker
Copy link

Can you please explain the improvements of a fully implemented Opencl renderer ? ?

@gregory38
Copy link
Contributor

What do you mean by simple? Few instruction to keep it fast. I think we have a bigger margin nowadays (better frequency help).

You can't really dynamically switch the sampler this way (it would use a very slow if..elseif...). My initial idea of texture array won't work for mipmap because it requires all layers to have the same size. That a shame because it was possible to dynamically select the layer (it uses a pointer indirection). Besides it would have been slow to filter between the mipmap layer (I don't know if game really use it) i.e. must be done manually.

Why not use a mipmapped texture directly? It is possible to update every layer manually. From a performance point of view, it will require the double of bandwidth (arg!), and the double of unsizzled pixel (arg!).

From GL wiki:

 gvec textureLod(gsampler sampler​, vec texCoord​, float lod​);

This will sample the texture at the mipmap LOD lod​. 0 means the base level (as set by the appropriate texture parameter).
Mipmap filtering can still be used by selecting fractional levels. A level of 0.5 would be 50% of mipmap 0 and 50% of mipmap 1.

@gabest11
Copy link
Contributor Author

Started it on a 6600 GT, shader model 2.0, but mostly 1.x in asm, every instruction cost like 10% of the fps. Even on my last nvidia, dependent lookups based on variables were awful.

If there are no restrictions on updating each level with different data, then one texture seems to be good. The LOD and the sampling ratio between levels is a simple calculation.

@gregory38
Copy link
Contributor

I imagine. I guess cache and scalar architecture improve the situation. Besides much more shader runs in parallel which compensate slower shader.

Anyway, if the LOD can be computed on the vertex stage, it won't cost anything on the rendering. The LOD is likely hardware accelerated on the texture unit.

@Dokman
Copy link
Contributor

Dokman commented Sep 27, 2014

i am waiting for news :3 but for example you can make an option to activate mipmapping in hacks for example and put that canbe extremely slow, but for example i have a R9 290X it has a bus of 512 bit i think that bus maybe works fine

@gabest11
Copy link
Contributor Author

LOD may be per pixel, derived from Q. Depends on the LCM flag.

@SerialHacker
Copy link

gabest ignored me ;_;

y u do dis :/

@gregory38
Copy link
Contributor

@gabest11 I potentially found an issue with the texture cache. I would like your opinion. I opened the issue #332 to avoid polluting this PR.

@gabest11
Copy link
Contributor Author

Looks like a bug.

@gregory38
Copy link
Contributor

@gabest11

Time for a rewrite! :( :) :( :)

What is your status?
Do you want that we integrate your current PR, or do you prefer finishing your rewrite?

@gabest11
Copy link
Contributor Author

AMD's beta driver is still not ready for 2.0 (only its compiler). The generic opencl.dll of Windows does not export any of the new functions, so there is no way to use them. You can merge the changes if you want, I'm not working on it currently, until this problem is resolved.

@danilaml
Copy link

Quoting some AMD dev: "I can comment immediately on the OpenCL programming manual. There will be an update soon – formal release of support for OpenCL 2.0 is planned for the near future, and the manual will update then, or soon thereafter. I know it’s being worked on."

@gregory38
Copy link
Contributor

I create the PR #367 that contains your change + a couple of linux fix (to compile without opencl). Changes are fine for me.

@gregory38 gregory38 closed this Dec 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet