Navigation Menu

Skip to content

Commit

Permalink
OpenGL Renderer: Partially fix rendering in the Customize screen of S…
Browse files Browse the repository at this point in the history
…ands of Destruction.

- This fix properly emulates the less-than-or-equal depth test rendering for front-facing polygons drawn on top of opaque back-facing fragments, but only if the front-facing polygon is opaque. Translucent front-facing polygons are not supported at this time due to requiring extensive changes to the rendering logic and shaders in order to emulate this extremely rare and niche NDS feature. (If you require the proper rendering of translucent front-facing polygons on top of back-facing fragments, then you must use SoftRasterizer.)
  • Loading branch information
rogerman committed Oct 31, 2018
1 parent 44ac04d commit 8944328
Showing 1 changed file with 59 additions and 1 deletion.
60 changes: 59 additions & 1 deletion desmume/src/OGLRender.cpp
Expand Up @@ -1874,7 +1874,35 @@ Render3DError OpenGLRenderer::DrawAlphaTexturePolygon(const GLenum polyPrimitive
else // Draw the polygon as completely opaque.
{
glUniform1i(OGLRef.uniformTexDrawOpaque, GL_TRUE);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);

if (isPolyFrontFacing)
{
glDepthFunc(GL_EQUAL);
glStencilFunc(GL_EQUAL, 0x40 | opaquePolyID, 0x40);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);

glColorMask(GL_FALSE, GL_FALSE, GL_FALSE, GL_FALSE);
glDepthMask(GL_FALSE);
glStencilOp(GL_KEEP, GL_KEEP, GL_ZERO);
glStencilMask(0x40);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);

glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
glDepthMask(GL_TRUE);
glDepthFunc(GL_LESS);
glStencilFunc(GL_ALWAYS, opaquePolyID, 0x3F);
glStencilOp(GL_KEEP, GL_KEEP, GL_REPLACE);
glStencilMask(0xFF);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);
}
else
{
glStencilFunc(GL_ALWAYS, 0x40 | opaquePolyID, 0x40);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);

glStencilFunc(GL_ALWAYS, opaquePolyID, 0x3F);
}

glUniform1i(OGLRef.uniformTexDrawOpaque, GL_FALSE);
}
}
Expand Down Expand Up @@ -1965,6 +1993,36 @@ Render3DError OpenGLRenderer::DrawOtherPolygon(const GLenum polyPrimitive,
glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
glDepthMask(((DRAWMODE == OGLPolyDrawMode_DrawOpaquePolys) || enableAlphaDepthWrite) ? GL_TRUE : GL_FALSE);
}
else if (DRAWMODE == OGLPolyDrawMode_DrawOpaquePolys)
{
if (isPolyFrontFacing)
{
glDepthFunc(GL_EQUAL);
glStencilFunc(GL_EQUAL, 0x40 | opaquePolyID, 0x40);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);

glColorMask(GL_FALSE, GL_FALSE, GL_FALSE, GL_FALSE);
glDepthMask(GL_FALSE);
glStencilOp(GL_KEEP, GL_KEEP, GL_ZERO);
glStencilMask(0x40);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);

glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
glDepthMask(GL_TRUE);
glDepthFunc(GL_LESS);
glStencilFunc(GL_ALWAYS, opaquePolyID, 0x3F);
glStencilOp(GL_KEEP, GL_KEEP, GL_REPLACE);
glStencilMask(0xFF);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);
}
else
{
glStencilFunc(GL_ALWAYS, 0x40 | opaquePolyID, 0x40);
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);

glStencilFunc(GL_ALWAYS, opaquePolyID, 0x3F);
}
}
else
{
glDrawElements(polyPrimitive, vertIndexCount, GL_UNSIGNED_SHORT, indexBufferPtr);
Expand Down

13 comments on commit 8944328

@Jules-A
Copy link
Contributor

@Jules-A Jules-A commented on 8944328 Nov 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogerman this patch causes an ~8% (some areas I've seen 15%) performance drop in Pokemon HG, is that meant to be happening?

@zeromus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It replaces one draw call with three draw calls and a dozen state changes. What do you think?

@Jules-A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I was expecting a hit, just not that large as HG didn't seem to be affected by the incorrect rendering before but I don't really understand OGL so I was just wondering if the hit was meant to be this big.

@zeromus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing three times the work (and maybe more) is meant to be slower than doing one times the work. The bug involves the depth of fragments. All the polygons which could possibly have fragments with that bug are drawn multiple times so that carefully chosen parameters can catch the buggy cases and fix them. A game may have no fragments with this bug but still have polygons of the type which sometimes result in fragments with that bug, and thus have its polygons rendered multiple times to fix it. This is a general technique in graphics (if you can't do certain logic per-fragment, break it into multiple passes with logic you can do)

@Jules-A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks for the explanation. I made some UI and implemented a hack to disable it for myself to claw back some performance, at least it should come in handy for testing.

@zeromus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't expect us to be interested in adding a checkbox for every commit that fixes something and costs speed. Half the code in this thing is inherently a balance between speed and quality; if we had a checkbox-for-each-policy, the code would be even more convoluted and there would be 10,000 checkboxes. It actually isn't a sustainable policy, so we shouldn't even start.

@Jules-A
Copy link
Contributor

@Jules-A Jules-A commented on 8944328 Nov 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's understandable, it's just there's quite a bit of people that still use the X432R fork because they claim it's faster. For those people I was thinking of 1 toggle which only reverts commits that cause significant performance hits to fix issues that only affect a few games with an obviously large warning (also disabling the "report bugs" button).

@zeromus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I wouldnt want you mucking up my code with that junk. But the opengl code is rogerman's, so if you can negotiate it with him I can't say no to a checkbox that says "Create more speed and also more bugs" defaulting to off, and the "report bugs" button logic is unneeded.

@rogerman
Copy link
Collaborator Author

@rogerman rogerman commented on 8944328 Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An 15% perf drop? Oh really? I use a GeForce GTX 680, so I didn't even notice it. Maybe other GPUs are more affected by this change than mine. Twoo baddness. And people still use X432R because they "claim" that it's faster? O rly? Methinks that their testing methods are all outta whack. Siiiiiiiiiigh.......

But in all seriousness, here is what is going on between X432R and Mainline in terms of performance. (I am also writing all of this stuff as notes to myself about this issue.):

1. X432R's OpenGL Renderer is less capable and therefore has an easier workload, while Mainline's OpenGL Renderer is more capable and therefore has a harder workload. People then perform improper tests by comparing X432R lesser feature set against Mainline richer feature set, and then proclaiming that X432R is "faster."

Most notably, Edge Mark is an NDS feature that must be emulated and can be a somewhat costly pass, depending on the complexity of the scene that needs to be edge marked. Since X432R doesn't support this feature, this means that X432R can seem "faster" than Mainline if you are comparing X432R without Edge Mark against Mainline with Edge Mark. This becomes very apparent when benchmarking the Edge Mark heavy scenes in the overworld of the Pokemon games. In fact, the overworld scenes in the Pokemon games can become a worst-case scenario when comparing Edge Mark performance vs. non-Edge Mark performance due to the relative complexity of some of these scenes. So by all means, keep Edge Mark OFF when doing a performance comparison!

As a tip for future testers -- Mainline has many advancements over the old X432R that may reduce performance. In order to match settings based on X432R's lesser capabilities, you must do the following at minimum:

  • When setting the 3D renderer to an OpenGL renderer, you must set Mainline's 3D renderer to "OpenGL Old". (X432R's OpenGL 3D renderer is based on OpenGL Old.)
  • When testing OpenGL 3D rendering, you must set Mainline's Edge Mark setting to OFF. (X432R's OpenGL renderer does not support Edge Mark.)
  • You must set Mainline's GPU Scaling Factor to exactly 1, 2, 3, or 4, matching whatever scaling factor is being tested on X432R. (X432R does not support scaling beyond 4x.)
  • You must set Mainline's Color Depth to 24-bit. (X432R internally converts all colors to 24-bit and does not support 18-bit or 15-bit color depths.)
  • You must set Mainline's Texture Scaling Factor to 1 and Texture Deposterize to OFF. (X432R has no support for texture upscaling and deposterization.)
  • When testing the most recent Mainline builds, you must set the Wi-Fi Emulation Level to OFF. (X432R has zero support for Wi-Fi, and running Wi-Fi can be CPU expensive.)
  • Since performance testing against X432R is predominantly graphics-related, then the remaining settings should isolate graphics processing as much as possible. This means turning Advanced Bus-Level Timing OFF, turn Dynamic Recompiler ON with a block size of 100, and turning the Sound emulation engine completely OFF.

2. For 2D graphics upscaling, X432R's framebuffer-based approach is faster but less accurate, while Mainline's scanline-based approach is slower but more accurate.

However, a hardware NDS renders 2D graphics on a per-scanline basis. It is possible for a game to change various states between scanlines that could affect the overall framebuffer rendering, such as power states, graphics states, or rendering states. It is also possible for a game to modify VRAM in between scanlines in order to do some real-time animations or special effects.

Mainline's approach to 2D graphics upscaling fully respects the NDS scanline-based rendering, and therefore does all of the upscaling work at the scanline level. Because of this approach, there are virtually no graphical bugs or glitches that occur with our 2D upscaling.

X432R's approach involves doing only the most minimal work at the scanline level and doing most of the upscaling work at the framebuffer level, which is faster. However, this same approach makes assumptions about how the various NDS states are set, which results in various graphical bugs and glitches when those assumptions are broken.

In all fairness, this are a few areas that we can improve on performance-wise that shouldn't break compatibility, and X432R does certainly have us beat here. One area of improvement would be to convert the code doing custom VRAM reads through the OBJ layer into SSE2-enabled code, which would easily see a significant performance improvement in many games.

3. X432R's OpenGL Renderer ignores the special NDS 3D rendering quirks in order to run faster and less accurate, while Mainline's OpenGL Renderer properly renders the special 3D rendering quirks in order to run more accurately, albeit slower.

The specific quirks that I'm talking about are the following:

  • Multipass shadow polygon rendering, but using the properties contained within a special NDS "attributes" buffer instead of simply using a standard 8-bit stencil buffer.
  • Per-fragment handling of translucent polygons.
  • The Depth-Equals test, but with a small tolerance.
  • The Depth-Less-Than-Or-Equals test overriding the standard Depth-Less-Than test whenever front-facing polygons are drawn on top of opaque backfacing polygons. (This special quirk is exactly what this particular commit tries to address, but only partially.)
  • The special blending condition: If the destination fragment has an alpha of 0, then the source fragment colors are written without blending.

Obviously, handling every single one of these quirks on an individual basis is a no-go, just like how @zeromus said. If I were to make some sort of UI to do an accuracy-performance tradeoff, then I would do something like a "compatibility level" gauge, ranging from 0 - 3.

Each level would mean the following:

  • 0 (All quirks unhandled): All special NDS quirks will not be handled. This can cause significant glitches in many games, but would run the fastest.
  • 1 (Common quirks only): Only the most common quirks that occur in the majority of games would be handled, fixing the most visible glitches in most games. 'Multipass shadow polygon rendering' and 'per-fragment handling of translucent polygons' would easily fall under this category.
  • 2 (Common and Uncommon quirks): This level would include uncommon quirks that still show up in a significant number of games, but not in the majority of games. 'Depth-Equals test with tolerance' and 'the special destination alpha blending condition' would both fall under this category.
  • 3 (Common, Uncommon, and Rare quirks): This level would handle all the special NDS quirks, including the rarest of quirks that would only affect the tiniest minority of games would be handled. The quirk listed under this commit would fall under this category.

Note that the compatibility level is only intended to handle the special quirks. Any 3D rendering feature that is deemed "essential" would always be rendered, regardless of the compatibility level. I'm even considering that the 'per-fragment handling of translucent polygons' should be considered an essential feature, as this can make or break certain games, but further testing would be needed on this.

4. In the Windows port only, X432R's video blitter is simply better optimized than Mainline's.

One of X432R's main advantages is that it uses a video blitter based on DirectX 9 rather than OpenGL. X432R's DX9 blitter is better at DMA'ing texture uploads directly to the GPU, making for significantly improved video performance. Mainline's OpenGL blitter does not DMA the texture to the GPU -- rather, it does a standard glTexImage2D() call (not even a glTexSubImage2D() call!!!), which causes the video driver to block on video upload, causing a loss in performance. Optimizing this by using some form of PBO-based uploading would help alleviate this, but this would entail some very extensive changes to how video is handled in the Windows port, which is probably something that no one is interested in reworking any time soon!

In addition, X432R handles video buffers a little more efficiently than we do, doing less copying of video data across the myriad of video buffers in the Windows port. Obviously, the more efficient handling of video buffers is a performance win for X432R.

5. Mainline makes use of several optimizations at the core level in order to keep up with X432R's performance differences.

Mainline has a lot of performance tricks going for it that X432R doesn't, such as working with compiler optimizations, hand-optimized SIMD routines, and better thread handling. And so despite X432R being inherently faster due to being a simpler program with less features, Mainline can keep up with most of X432R's performance advantages anyways due to Mainline's own performance enhancements. There are even certain cases where Mainline is actually faster than X432R.

And so... analyzing the performance differences between Mainline and X432R is complicated.

X432R has a lot of inherent performance advantages over Mainline throughout the entire graphics stack by making certain tradeoffs, but Mainline mitigates a lot of these performance advantages through having raw code performance without the tradeoffs. And better yet, Mainline is only faster on non-Windows machines, where a lot of the performance that you'll see in X432R is specifically tailored to the Windows port only.

@Jules-A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for getting you dragged into a comparison between the two, personally x432r has never been any use for me as official build's xBrz filtering on just textures is a massive image quality boost (for me anyways). I've only tested it for a short time but have been meaning to go through the source so your reply is really helpful. That said, I have no idea how those people tested and most likely they didn't compare fairly. It's just the same thing everytime NDS emulation comes up, many people still say to use x432r or even Drastic in an Andriod emulator as people complain of Desmume's speed. I suspect the last release version being so old doesn't help the situation either. Anyways, all this really frustrates me as I've literally witness the OGL renderer almost double in FPS in less than a year. Not only that but it's also fixed the dots and line glitches commonly seen in the Pokemon games.

Back to the original topic, 15% is an outlier but is odd in the sense that without the fixes, the area isn't really that complex in comparison to others, I'm seeing around 8% drop on average. I do however have a suspicion that it's due to the way AMD's drivers handle OGL in Windows (I've read of developers complaining about AMD choking on draw calls) as well as being CPU bottlenecked.
Your suggestion of compatibility tradeoff sounds good but having so many different levels might be a bit complicated. It may, however, be an opportunity to make things even more accurate as you wouldn't have to worry about balancing speed as it could be offloaded to a different compatibility setting.
Honestly, there's been hardly any performance regressions in the past 7 or so months I've been testing and most got fixed shortly after. Unless you go hunting down things that could be made hacky, I don't see there being many.

@rogerman
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been testing some internal code with setting a compatibility level on the OpenGL renderer, and I've found that it just isn't worth having it. In reality, I've found that it is probably better to simply have individual settings for each of the NDS rendering quirks that I've noted above.

Of the five quirks I've listed, I've found that emulating the 'per-fragment handling of translucent polygons' is truly an essential feature. There are way too many games that are dependent on this, and some games truly do become unplayable without it. Therefore, this quirk MUST be emulated at all times.

This leaves us with just the four remaining quirks. AFAIK, there are no other NDS rendering quirks that anyone else is aware of, and so I'm suspecting only one or two quirks at most could be discovered in the future that might have significant effects on OpenGL's performance, which would limit us to only six possible quirks at most that could be relevant. We can certainly design for six individual settings in the UI for OpenGL performance purposes. I would say that... hmmm... eight individual settings would be the absolute upper limit -- anything past eight would probably bog down the UI too much.

But of course, all of this is just my best guess, and so it's time for me to test a few more things before making a commit to address all of this...

@Jules-A
Copy link
Contributor

@Jules-A Jules-A commented on 8944328 Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welp... I just decided to test Pokemon Platinum for fun and I'm seeing ~35% performance hit at Valor Lakefront beach :/
Fixes on:
fixeson
Fixes off:
fixesoff

On closer inspection there is a difference, with your fixes, things are ever so slightly darker on the rocks. I can't compare with SoftRast as it's broken in that area and not rendering things.

@rogerman
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit 6f8c060.

To note: It is true that emulating this particular rendering quirk causes a guaranteed performance drop in all games. However, I also believe that this particular rendering quirk is so extremely rare that it would be more beneficial for most users to skip rendering this by default. If users want to emulate it, then they can always set CommonSettings.OpenGL_Emulation_DepthLEqualPolygonFacing to true in order to turn it back on.

Please sign in to comment.