New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental opencl renderer #302
Conversation
…e cache, mipmap, aa1, device selection). Needs any OpenCL SDK for the common headers and stub lib to compile, tested with AMD and Intel. Too bad it is not part of the Windows SDK yet. - Renumbered renderer ids, compatible with old numbering, but it does not follow the mod3 logic anymore.
…ll texture errors.
|
Well, this PR is a problem on several things :p |
|
Yep, install your favorite opencl sdk, it's awesome! Anyway, it's full of bugs, but I'm weeding them out, will be usable in a few days. The other changes are there because no one told me about github in time :) |
|
Heh, I always wanted to do something similar for Dolphin too - so nice to see someone try it out :) Just wondering, how's performance with this code compared to the GL/software renderers? In particular I was thinking alpha blending and depth testing (or any other output-merger tasks) would be a pain to get done efficiently. |
|
As you iterate over all primitives in all pixels, how many triangles were rendered on how many pixels? I'm wondering how much of the time most of the gpu's execution engines just iterate over skipped primitives. |
|
Perfomance on a radeon 270X is about 80% of my i7-4770 on 4 threads, no idea about nvidia, I heard it was worse for gpu computing, but maybe not. I still have ideas how to optimize the rendering, there are always shortcuts to find for special drawing cases. The main slowdown happens when the game switches between too many kernel types, it is more sensitive to it then d3d or opengl. What was really surprising to me, the host memory speed. I don't even have to ping-pong anything, it just fetches texture data and draws right in the puss.. I mean system memory through PCIe. The output merger is the main reason we need our own rasterizer, since it's fixed function in the accelerated apis. There is virtually no memory access in the kernel, read the target into a register once, iterate over the triangles, write it out. Merging happens to the register. PS2 may use the same memory for frame and z buffers, exactly or overlapped, that may be bogus currently. I have to find an example game and add a work-around, but that's not a huge problem. The rendering is split into 16x16 tiles and batches of 4096 primitives, each tile gets as many prims as the bbox and other tests find, so it varies. One computing unit does one tile, they can go ahead each other in a batch, and grab another tile when ready, but at the end of the batch, they are synchronized by the kernel call boundaries. My software renderer does not have this limitation, only synced when the render target changes to another which is still being rendered, I suspect that's what gives it an advantage over opencl currently. I could do similar by launching the drawing kernels parallel, each queue having its own part of the screen in a check board style, but then I need to use events to chain the rendering queues, and events are dog slow. Even waking up the queue once it goes to sleep costs a full millisecond sometimes, when there are only 20ms in a 50fps game per frame. |
|
not compatible NVIDIA OPENCL ⊙▂⊙ |
|
Do you see any compiling errors in the console window? |
|
The compiler is no problem Program: This application has requested the Runtime to terminate it in an unusual way. This is a log: Host Machine Init: x86 Features Detected: Reserving memory for recompilers... Loading plugins... (GameDB) 9660 games on record (loaded in 147ms) Initializing plugins... Opening plugins... windbg log: |
|
I can't see any "OpenCL C 1.2 GPU" device, that's the problem. Don't know why it crashed, should have just thrown an exception to GSOpen and return an error code from there. Real question is, why nvidia has no 1.2 support. I'm trying to find information on what cards has which version. Might have to lower the version requirements if this turns out to be the case. |
|
Gabest, if you're building using the AMD SDK for OpenCL 1.2, it's not going to work on Nvidia hardware at all. To say Nvidia have been dragging their heels when it comes to OpenCL is an understatement, they only support 1.1. Building against Nvidia's OpenCL.lib, using the Khronos 1.2 headers, leads to four unresolved externals: GSRendererCL.obj : error LNK2001: unresolved external symbol _clEnqueueMarkerWithWaitList@16 Of course, I'm a neophyte at this so I could be screwing up somehow. |
|
I'm using the Intel SDK currently, there is no difference, they all link to the common opencl.dll, even the headers are the same. Those functions are 1.2 and not in the nvidia sdk. Wikipedia only lists 1.1 for all nvidia cards :( |
|
Yup, Nvidia are being real dicks over this. On a positive note, I tested to see if it works on the Intel HD Graphics 4000. It does! Horrifyingly slow (as expected) but it works! |
|
Nvidia essentially wants OpenCL to die because they are pushing CUDA instead. Their OpenCL 1.1 implementation gets decent performance, but they haven't really done anything outside of bug fixes to it in years. They did say the Tegra K1 is "OpenCL 1.2 capable" a few months back, but it doesn't sound like they ever actually plan to add support. |
… use its sdk to compile gsdx, cl.hpp is missing there. Intel or amd is ok.
|
Had to put back my old 460 GTX to verify 1.1. It only has 7 computing units, slower than a cpu emulation, what a beast. |
|
NVIDIA card can work (¬_¬) |
|
This new renderer still needs some kind of define if we want to include it. |
|
I can't get this to work on a GTX 780Ti. It seems to compile correctly but will just sit there with a black screen until it eventually craps out with a runtime error. It still works on the Intel 4000 though. I have the two graphic devices running together so I wondered if there could be a clash between them. If there is I haven't been able to resolve it. Ok, I have now been able to run OpenCL examples on my system so I'm not sure what's up |
|
The headers are publicly available here: https://www.khronos.org/registry/cl/, I could remove the sdk dependency if we can add them under 3rdparty. Copyright message at the top seems to allow that. |
|
For what it worth they already outdated (opengl) khronos file in 3rdparty/GL. So I see no objection to add opencl too (opengl headers are opensource compatible, so it must be the same for opencl) Note: it would be nice to split first commit in two (last chapter of https://github.com/PCSX2/pcsx2/wiki/Git-survival-guide). |
|
How do you link to opengl on linux? Do you need an import lib, too? On windows I can just compile this lib from a .def file, that lists all the the exports of opencl.dll, but no idea about linux. |
|
For openGL. There is a lib.so that contains the base of opengl (opengl 1.x). Then you need to fetch all function pointer manually (there is a kind of "high level" dlopen/dlsym). For openCL I don't know yet, I don't think they use the same mechanism, because you know people always complains about openGL function pointer. |
|
I could GetProcAddress instead of the importlib, but that means no c++ wrapper, that uses the extern function decls of cl.h. |
|
Don't worry I will manage the linux details. I'm nearly sure that I can use directly the .so function name directly (equivalent to .def on windows). |
|
For OpenGL, this RW buffers are described here: http://www.opengl.org/wiki/Image_Load_Store |
|
Image load and store allow to use RW on the "input" ressource of any shader. The new extension allow a limited RW access to the output of the fragment shader. It allows a kind of programmable blending. Only a couple of PS2 blending mode aren't supported. So even if it isn't fast, it might be enough for those corner cases. It doesn't help for the depth. By the way I manage to support DATE with image_load_store (uav) (only enabled on nvidia opengl). At least it seems to work on a couple of testcase. Do we have others limitation on the output merger? If it can help there are also some papers (thesis?) on a cuda rasteriser too. |
|
DATE needs to be done per triangles, else you are testing against the last batch, not the last triangle, and drawing one-by-one is not possible of course. The output bitmask and its fake 16 bit abuse is the last one I think. |
|
For DATE, I "render" once to search the first primitive that will change the destination alpha value test. Then I redraw the n first primitive (for each fragments). First pass 2nd pass |
|
That's a good one. Do you know if barrier is necessary with atomics? I was wondering myself. |
|
I really don't know. The spec is very confusing. For sure somethings must be done to ensure that atomic are done before the 2nd draw. I declare the memory as coherent, not sure it is enough. Reading again the wiki, they seem to imply that a barrier is mandatory. I declare the memory as coherent. I don't know if I need to call a barrier between 2 draw calls. I just did a quick benchmark, I didn't see any difference with the barrier. Strangely I remember that the barrier was very costly (maybe it was a driver bug or my brain is just too old). |
|
By the way to reduce the number of fragment, I still use the stencil method. I didn't do any benchmark so potentially it can be dropped. It would be also interesting to check the performance impact to always have a stencil buffer (even if it's enabled). Maybe it would be globally faster to keep the atomic method even for basic case (no write of alpha). |
|
http://support.amd.com/en-us/kb-articles/Pages/OpenCL2-Driver.aspx So AMD and soon latest Intel iGPU support openCL 2.0 while nvidia is still stuck at 1.1 |
|
Support OpenCL 2.0 Core Features:
|
|
Time for a rewrite! :( :) :( :) |
|
i am waiting for it changes to test it with a R9 290X :3 PS: @gabest11 did you try even a day to fix mipmapping in hardware mode? a lot of users we are witing for it |
|
Me too I got a question. I know that textures are converted from GS-tiled format to linear format inside GSdx. I don't know if you need to access the linear format for GSdx. But if it is only limited to the GPU rendering maybe we could defer it to a computer shader. The idea would be to create a texture array of 2 slices. 1 slice with the linear format the others one with the tiled format. By the way what are the blocking point for texture mipmapping? Previous method will allow to easy to generate all the mipmaped texture. The id of the texture array can be computed in the vertex shader so it might be possible to compute the LOD. |
|
@gregory38 and now what can you do? for fixing it i don't know much c++ TT sorry i am newbie |
|
In the past shaders had to be simple, but it may be possible now. Just have to keep updating all 7 possible levels like the single one now. I think a pixel shader can handle seven textures, not sure how much complexity it adds dynamically selecting and sampling the two needed. |
|
Can you please explain the improvements of a fully implemented Opencl renderer ? ? |
|
What do you mean by simple? Few instruction to keep it fast. I think we have a bigger margin nowadays (better frequency help). You can't really dynamically switch the sampler this way (it would use a very slow if..elseif...). My initial idea of texture array won't work for mipmap because it requires all layers to have the same size. That a shame because it was possible to dynamically select the layer (it uses a pointer indirection). Besides it would have been slow to filter between the mipmap layer (I don't know if game really use it) i.e. must be done manually. Why not use a mipmapped texture directly? It is possible to update every layer manually. From a performance point of view, it will require the double of bandwidth (arg!), and the double of unsizzled pixel (arg!). From GL wiki: |
|
Started it on a 6600 GT, shader model 2.0, but mostly 1.x in asm, every instruction cost like 10% of the fps. Even on my last nvidia, dependent lookups based on variables were awful. If there are no restrictions on updating each level with different data, then one texture seems to be good. The LOD and the sampling ratio between levels is a simple calculation. |
|
I imagine. I guess cache and scalar architecture improve the situation. Besides much more shader runs in parallel which compensate slower shader. Anyway, if the LOD can be computed on the vertex stage, it won't cost anything on the rendering. The LOD is likely hardware accelerated on the texture unit. |
|
i am waiting for news :3 but for example you can make an option to activate mipmapping in hacks for example and put that canbe extremely slow, but for example i have a R9 290X it has a bus of 512 bit i think that bus maybe works fine |
|
LOD may be per pixel, derived from Q. Depends on the LCM flag. |
|
gabest ignored me ;_; y u do dis :/ |
|
Looks like a bug. |
What is your status? |
|
AMD's beta driver is still not ready for 2.0 (only its compiler). The generic opencl.dll of Windows does not export any of the new functions, so there is no way to use them. You can merge the changes if you want, I'm not working on it currently, until this problem is resolved. |
|
Quoting some AMD dev: "I can comment immediately on the OpenCL programming manual. There will be an update soon – formal release of support for OpenCL 2.0 is planned for the near future, and the manual will update then, or soon thereafter. I know it’s being worked on." |
|
I create the PR #367 that contains your change + a couple of linux fix (to compile without opencl). Changes are fine for me. |
No description provided.