Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSdx-TC: Palette management rework #2344

Merged
merged 32 commits into from Nov 4, 2018

Conversation

@AlessandroVetere
Copy link
Contributor

commented Mar 18, 2018

This follows the 4th point of #2310.
The idea was to rework the current palette texture management to improve performances with games that uses multiple palettes for the same data texture.

Currently the situation goes more or less like that for textures with palette:

  • 32 bits texture enabled (i.e. palettes are applied by the CPU):

    • On Source creation:
      • Clut array of 256 uint32 elements is allocated and cleared (also for textures without palette),
      • Current clut content from GSRenderer is copied to the new buffer
    • On Source cache lookup:
      • Current GSRenderer clut content is compared against the copy in the Source.
    • If palette associated to the texture changes:
      • A new Source object with a different clut buffer is computed.
    • On Source deletion:
      • Free clut buffer.
  • 8 bit texture enabled (i.e. palettes are managed by the GPU, so palette texture upload is required):

    • On Source creation:
      • Clut array of 256 uint32 elements is allocated and cleared,
      • Current clut content from GSRenderer is copied to the new buffer,
      • A new Palette GSTexture object is created and current clut content is uploaded to the GPU.
    • On Source cache lookup:
      • If first lookup then:
        • Current clut content is uploaded to the GPU,
      • else:
        • Update the Source clut buffer from GSRenderer clut, and if the content mismatches, upload the new content to the same GSTexture.
    • If palette associated to the texture changes:
      • As said above, the new content is uploaded to the GPU
    • On Source deletion:
      • Free clut buffer,
      • Recycle palette GSTexture object.

The new proposed management works like that instead:

  • 32 bits texture enabled:

    • On Source creation:
      • Clut array of 16 or 256 uint32 elements is allocated (and not cleared) only if needed (improved).
    • Rest is unchanged.
  • 8 bits texture enabled: (improved)

    • On Source creation:

      • Object is marked for needing a palette texture and clut content copy (this avoids buffer allocation and clearing on Source creation).
    • On Source cache lookup:

      • If first lookup then:
        • The PaletteMap is queried (i.e. clut hash check plus compare64) for a valid GSTexture, clut buffer pair and the references are set in the Source object using GSRenderer clut as reference,
      • else:
        • The copy of the clut referenced by the Source object is used to check against the current GSRenderer clut, and if mismatch then again the PaletteMap is queried for a valid GSTexture, clut buffer pair and references are updated.

      Please note that clut buffer copies in CPU, that before were always performed, are done only on PaletteMap miss.
      Also, CPU to GPU palette upload is done only on miss.

    • If palette associated to the texture changes:

      • As said above, the PaletteMap is in charge of either finding or creating a new GSTexture, clut buffer pair.
    • On Source deletion:

      • Do nothing.

The new management shows small to none performances improvement in almost every game in terms of FPS, and it lowers the GPU BUS usage by some percentage points in games like Baldur's Gate: Dark Alliance (9% to 7%) which uses many palettes.

The hot topic is that the performances in Zone Of The Enders 2 skyrocketed, because of the fact that the games uses many palettes and a small number of textures to render it's effects.
I measured the PaletteMap size ~450 palettes in the most concitated moments, which has almost no impact on CPU and GPU memory usage.
I'd like to leave two screenshots for performances comparison, of the game running on my machine (i5-2500k@4.3GHz, GTX770 OC, 16GB DDR3-1600, Win7 64bit), with GSdx-SSE4 32 bit D3D11, default config and 8 bit textures enabled.

Here is the before:

zoe2

Here is the after:

zoe2_repo

More details are coming as edit of this PR in the next days.

TODO: Check interactions with ICO, as the GSRendererOGL allocates a palette for that game outside the GSTextureCache scope, thus clashes with the new PaletteMap.

@lightningterror

This comment has been minimized.

Copy link
Member

commented Mar 18, 2018

That's almost double! You should have removed the framelimiter to see the max fps it can reach 😛
edit: nvm now I noticed it was removed.

@gregory38

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2018

Cool. I'm not surprised of the perf impact. The idea was for Zone of Enders. But maybe some games use a similar mega-texture pattern.
The really sad news is that we have yet another game that doesn't need Vulkan to be fast !

@lightningterror

This comment has been minimized.

Copy link
Member

commented Mar 18, 2018

Tell that to the people that keep whining with Vulkan when ? xD

std::array<Palette, MAX_SIZE> m_palettes_stack; // Stack of Palette to be used to avoid memory fragmentation

// Array of 2 multimaps, the first for 64B palettes and the second for 1024B palettes.
// Each multimap has the clut hash value as key and the relevant palettes as values (multiple values possible for each key)

This comment has been minimized.

Copy link
@lightningterror

lightningterror Mar 18, 2018

Member

Small nitpick, extra spaces here.

This comment has been minimized.

Copy link
@AlessandroVetere

AlessandroVetere Mar 19, 2018

Author Contributor

Oh I put those by purpose, but yes maybe it's better without them!

// No Palette with matching clut content hash, MISS

// Create new Palette
m_palettes_stack[m_size] = Palette(renderer, pal);

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 19, 2018

Contributor

Is the palette stack really useful ?

This comment has been minimized.

Copy link
@AlessandroVetere

AlessandroVetere Mar 19, 2018

Author Contributor

Well, I was experimenting with different things, I don't really know the performance impact of the stack, but for small objects it may be good (also increase probability of false HIT in the hash map being a cache hit).

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 20, 2018

Contributor

I understand. I think the cache is already complex enough ;) Could you test the performance without this intermediate structure.

@@ -1344,12 +1344,16 @@ void GSRendererOGL::DrawPrims(GSTexture* rt, GSTexture* ds, GSTextureCache::Sour

// We need the palette to convert the depth to the correct alpha value.
if (!tex->m_palette) {

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 19, 2018

Contributor

IMHO, new code should be kinds of a palette lookup.

The purpose of this code is to create a palette from m_mem.m_clut. Now that you separate palette from source, maybe it could be improved.

This comment has been minimized.

Copy link
@AlessandroVetere

AlessandroVetere Mar 19, 2018

Author Contributor

Well, is the code is executed also with 32 bit textures? I didn't really inspected well the code, but to me seemed so. It may be needed a bit of refactoring in any case.

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 20, 2018

Contributor

On GPU it is a depth texture (so float number) but on GS side it is a PSM_PSMT8H (i.e. 8 bits palettes + 24 bits padding).

I don't remember why I did this ! Maybe we could just always load the clut from the texture cache (even when format is special). It should be mostly free now. I really should post my gsdump somewhere.

@willkuer

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

Is GS still the bottleneck for ZoE or is it now the EE? What do the indicators show?

@atomic83GitHub

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

This PR cause a regression in Jak 2, I have random crash with this error:

capture

@lightningterror

This comment has been minimized.

Copy link
Member

commented Mar 19, 2018

Maybe also include a block dump(+savestate) so Alessandro can track down the source faster.

@gregory38

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

Code is limited to 2048 palettes which aren't enough. Tbh, I think we should remove the intermediate stack containers and multiply palette by 10 or 100. We might need to purge all palette from time to time.

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2018

@willkuer I did not put attention to the indicators in fact I was only looking at CPU and GPU usage vs FPS.
In fact OpenGL performances were not increased at all (25 fps before and after) so there are still bottlenecks (GS vs EE I will inspect) for this specific game.

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2018

@atomic83GitHub Well, if this can be solved by increasing the max number of palettes stored in the map, as @gregory38 suggested, it's no problem, but if strange effects (like shadows or flare) are emulated with clut maybe the number of palettes may be too big.
Also purging palettes requires a bit of refactoring due to the different scopes when destructing Source objects and such to keep track of the number of referencing Source(s) for each palette, but it is doable indeed.

EDIT: I'm totally positive with increasing the size of the map, the 2048 was just a first attempt to find an upper bound.
Also I'm impressed that Jak 2 uses so many palettes, this PR may increase also it's performances.

@gregory38

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

Opengl speed is strange. Did you try to change blending accuracy setting ?

@gregory38

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

Some palette are used as coefficient factor to modulate intensity. So yes you could have little variation on palette.
It depends on how many palette you could handle in the multimap. (If hash is good enough, btw sse?). Palette is 1kB. 100K will give you 100MB of memory, it should be doable even for old PC.

@@ -2043,3 +2033,119 @@ void GSTextureCache::SourceMap::RemoveAt(Source* s)

delete s;
}

// Query the PaletteMap for a valid Palette, then assign both palette texture pointer and clut copy pointer to the Source object
void GSTextureCache::SetPalette(Source* s, const GSRenderer* renderer, uint16 pal)

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 20, 2018

Contributor

Maybe name it AttachPaletteToSource or PaletteLookup.

m_renderer->m_dev->Recycle(m_palette);

_aligned_free(m_clut);
if (!m_should_have_tex_palette) {

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 20, 2018

Contributor

I don't like this solution. Desctruction should be done in Palette object. I don't know what will be the speed impact but maybe you could use shared_pointer<T> instead. That being said, we should delete palette only when there are unused from a long time (potentially palette will be reused next frame)

This comment has been minimized.

Copy link
@AlessandroVetere

AlessandroVetere Mar 21, 2018

Author Contributor

The code path which recycles palette on Source destruction is potentially hit only by ICO (palette allocated outside the GSTextureCache scope when 32 bit textures, so m_should_have_tex_palette is false), but if this is not the case when the PaletteMap is not used, then m_palette in Source is always null and no recycling outside Palette destruction happens.
Desctruction is already done in PaletteMap in fact for all the normal cases, and only on Clear, which is called when RemoveAll is called in the GSTextureCache.

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 22, 2018

Contributor

It feels way too complex for me. ICO should be updated to a GetPallette (or SetPalette). It should work as the texture cache.

Potentially it could be interesting to use some kind of shared_ptr to handle the reference counting of allocation.

This comment has been minimized.

Copy link
@lightningterror

lightningterror Apr 27, 2018

Member

Can a better detection for ICO be added, one that doesn't rely on crc lookup ?

}

// Retrieves the palette with the desired clut
GSTextureCache::Palette* GSTextureCache::PaletteMap::GetPalette(const GSRenderer* renderer, uint16 pal) {

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 20, 2018

Contributor

I will call it LookupPalette, just to highlight that it is more complex than just get a variable.

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

@atomic83GitHub If you could please share a dump I can better profile the game behavior with the new code, thanks.
If so many palette are used it is maybe to emulate intensity variation so it is a good benchmark for the PaletteMap scaling capabilities and potential performance improvement.


// Hashes the content of the clut
size_t GSTextureCache::PaletteMap::HashClut(uint16 pal, const uint32* clut) {
size_t clut_hash = 3831179159;

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 21, 2018

Contributor

Add some comments to explain where you did get those hash value.

This comment has been minimized.

Copy link
@AlessandroVetere

AlessandroVetere Mar 21, 2018

Author Contributor

Big prime number out of the magic hat 🎩
Jokes aside, I tried to build a simple hash function and this seemed to work well.
It can be improved by hashing 16 indices at a time, using different prime numbers (they have magic properties when hashing) for each index.
The hash function is critical and open to improvements.
I didn't find any SSE method which is "easy" to implement for hashing.
The use case is very specialized though, so maybe hashing 16 values (1 iteration for pal == 16, 16 iterations for pal == 256) at a time with different combinations of operations/prime numbers may give good results.

This comment has been minimized.

Copy link
@gregory38

gregory38 Mar 22, 2018

Contributor

Comment will be nice. This way, people can test various hash function to see the impact.

This comment has been minimized.

Copy link
@lightningterror

lightningterror Apr 26, 2018

Member

Also I prefer if void and similar functions being in a new line like this

void something
{
}

instead of void something {

@atomic83GitHub

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

I'm sorry I can't replay the dump, rundll32 crashes when reading the file.
I extracted the .7z and tried to run the .dump file in the folder, should I do something else?

@MrCK1

This comment has been minimized.

Copy link
Member

commented Mar 21, 2018

Oh he gave you a blockdump for diagnosing core issues. I assume you were requesting a GS dump instead?

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Mar 21, 2018

Yes I was requesting a GS Dump, my bad for being unclear, sorry.

@atomic83GitHub

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

@lightningterror

This comment has been minimized.

Copy link
Member

commented Mar 22, 2018

Hint: in the future for single frame block dumps you can just upload them on github. Just rename the file from .xz to .zip :)
They shouldn't be that large and can be done I guess most of the time.

@gregory38

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2018

Before I forget it again. OpenGL is quite slow to update the uniform buffer. ZoE rendering will ping-pong between multiple clamping region. Potentially it is the main reason why it is slow. (The other possibility is the multiple blending state change)

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2018

I'm on my way to deliver also the unordered_map container for the palette cache...it's an easier task than expected!

@lightningterror

This comment has been minimized.

Copy link
Member

commented Nov 4, 2018

Nice, I'll merge this once it's added.

GSTextureCache: Convert PaletteMap base container from unordered_mult…
…imap to unordered_map to reuse standard library logic, using custom key, hasher and comparator for such container.
@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2018

The change has been pushed!
Thanks @turtleli for the unordered_map hint, the caching solution is much clearer this way.

@lightningterror

This comment has been minimized.

Copy link
Member

commented Nov 4, 2018

Travis build seems to fail with the error

/home/travis/build/PCSX2/pcsx2/plugins/GSdx/GSTextureCache.cpp:2190:75: error: ‘palette_size’ was not declared in this scope
    GL_INS("INFO, %u-bit PaletteMap (Size %u): Cleared %u palettes.", palette_size, map.size(), cleared_palette_count);
@lightningterror

This comment has been minimized.

Copy link
Member

commented Nov 4, 2018

There's also some warnings as well
https://travis-ci.org/PCSX2/pcsx2/jobs/450450991

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2018

My bad, VS was not saying a thing about the missing variable declaration! And as for the warnings I'll fix those as well, in a couple of hours I can get to PC and finish the work.

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2018

Errors and warnings fixed! 🎉

@lightningterror lightningterror merged commit 9fa1b29 into PCSX2:master Nov 4, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@lightningterror

This comment has been minimized.

Copy link
Member

commented Nov 4, 2018

Great work, and merged it. Looking forward to your next project :)

@lightningterror lightningterror changed the title GSTextureCache: Palette management rework. Huge performance boost in ZoE2 with 8-bit textures enabled. GSdx-TC: Palette management rework Nov 5, 2018

lightningterror added a commit that referenced this pull request Nov 10, 2018

GSdx-TC: Remove virtual specifier from PaletteKeyHash operator, Palet…
…teKeyEqual operator and Palette destructor. (#2680)

Small follow up corrections from #2344 highlighted by @turtleli

arcum42 added a commit that referenced this pull request Nov 11, 2018

GSdx-TC: Palette management rework. (#2344)
This follows the 4th point of #2310.
The idea was to rework the current palette texture management to improve performances with games that uses multiple palettes for the same data texture.

The new management shows small to none performances improvement in almost every game in terms of FPS, and it lowers the GPU BUS usage by some percentage points in games like Baldur's Gate: Dark Alliance (9% to 7%) which uses many palettes.

The hot topic is that the performances in Zone Of The Enders 2 skyrocketed (2x), because of the fact that the game uses many palettes and a small number of textures to render it's effects.

For more detailed information check the PR  #2344

arcum42 added a commit that referenced this pull request Nov 11, 2018

GSdx-TC: Remove virtual specifier from PaletteKeyHash operator, Palet…
…teKeyEqual operator and Palette destructor. (#2680)

Small follow up corrections from #2344 highlighted by @turtleli

lightningterror added a commit that referenced this pull request Nov 13, 2018

GSdx-TC: Use PaletteMap also when 8-bit texture is disabled by cachin…
…g only clut copies. (#2681)

Enabled caching of clut copies with PaletteMap also in the case 8-bit texture is disabled, which is the default (on #2344 the caching of clut copies and palette textures was done only when 8-bit texture was enabled).
Brings moderate speedups ~10% in the most concitated parts of the ZoE2 Anubis benchmark, but may improve performance in all the cases when there are many Source objects created with clut copies to be stored.
The quality of the comments has been improved to better highlight the mechanics of the caching system.
@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented on plugins/GSdx/GSTextureCache.cpp in 44726af Nov 23, 2018

Maybe causes regressions, see #2706, where the CLUT might have changed in the 8-bit enabled case before assigning it in the LookupSource.

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented on plugins/GSdx/GSTextureCache.cpp in 44726af Nov 23, 2018

The codepath for 8-bit disabled case should have remained the same, in this commit and also in the following ones / PR #2681.

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented on plugins/GSdx/GSTextureCache.cpp in 44726af Nov 23, 2018

Another changed with respect to legacy code is that the m_clut variable is now assigned with the exact places required and it is not of the maximum size (256 uint32). The accesses to the array are always done taking the exact size into account, so this should not be a problem, but the comment stays.

@AlessandroVetere

This comment has been minimized.

Copy link
Contributor Author

commented on plugins/GSdx/GSTextureCache.h in 44726af Nov 23, 2018

This might need to return

samkatakouzinos added a commit to samkatakouzinos/pcsx2 that referenced this pull request Jan 6, 2019

GSdx-TC: Palette management rework. (PCSX2#2344)
This follows the 4th point of PCSX2#2310.
The idea was to rework the current palette texture management to improve performances with games that uses multiple palettes for the same data texture.

The new management shows small to none performances improvement in almost every game in terms of FPS, and it lowers the GPU BUS usage by some percentage points in games like Baldur's Gate: Dark Alliance (9% to 7%) which uses many palettes.

The hot topic is that the performances in Zone Of The Enders 2 skyrocketed (2x), because of the fact that the game uses many palettes and a small number of textures to render it's effects.

For more detailed information check the PR  PCSX2#2344

samkatakouzinos added a commit to samkatakouzinos/pcsx2 that referenced this pull request Jan 6, 2019

GSdx-TC: Remove virtual specifier from PaletteKeyHash operator, Palet…
…teKeyEqual operator and Palette destructor. (PCSX2#2680)

Small follow up corrections from PCSX2#2344 highlighted by @turtleli

samkatakouzinos added a commit to samkatakouzinos/pcsx2 that referenced this pull request Jan 6, 2019

GSdx-TC: Use PaletteMap also when 8-bit texture is disabled by cachin…
…g only clut copies. (PCSX2#2681)

Enabled caching of clut copies with PaletteMap also in the case 8-bit texture is disabled, which is the default (on PCSX2#2344 the caching of clut copies and palette textures was done only when 8-bit texture was enabled).
Brings moderate speedups ~10% in the most concitated parts of the ZoE2 Anubis benchmark, but may improve performance in all the cases when there are many Source objects created with clut copies to be stored.
The quality of the comments has been improved to better highlight the mechanics of the caching system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.