New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greg/vif hash #1706

Merged
merged 16 commits into from Dec 21, 2016

Conversation

Projects
None yet
4 participants
@gregory38
Contributor

gregory38 commented Dec 16, 2016

I reimplemented the block management of the VIF recompiler. The goal was to achieve a better speed.

Based on vtune report, it seems to be a success. Overhead goes from 14.5% to 3.8%. Hopefully it will translate to better fps. Benchmark are welcome.

@gregory38 gregory38 requested review from refractionpcsx2 and sudonim1 Dec 16, 2016

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 16, 2016

@refractionpcsx2 @sudonim1 please review my change. I tried to do small commits to ease review.

@FlatOutPS2 please test and benchmark it.

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

I'll give it a test later, I know a good few awkward unpack games :)

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 16, 2016

By the way, I did my measurement with all recommended speedhack + MTVU.

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

By the way, I did my measurement with all recommended speedhack + MTVU.

We shall see :) I'll make up some numbers Vs latest Git

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

Here's some benchmarks, looks like some mild improvement :)

tekken tag
vifhash: 82-85fps
master: 79-83fps

Spyro - Heros Tale
vifhash: 166-170fps
master: 163-168fps

MGS3
vifhash: 198-207fps
master: 194-204fps

Enthusia Professional Racing (maybe not the best test :P)
vifhash: 217-224fps
master: 217-223fps

Crash N Burn - Menu
vifhash: 51-53fps
master: 49-52fps

Gran Turismo 4 - Stopping after race start
vifhash: 173-177fps
master: 173-176fps

Grandia III
vifhash: 173-174fps
master: 168-169fps

SotC:
vifhash: 91-95
master: 89-92fps

@FlatOutPS2

This comment has been minimized.

Member

FlatOutPS2 commented Dec 16, 2016

I noticed about the same improvements of 1-3 FPS at most in the games I tried, but I didn't have MTVU enabled. I'll give it another shot with that enabled.

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

I had MTVU enabled, GSDX set to "native" with basic blending. Usual recommended speedhacks, but everything else default

@FlatOutPS2

This comment has been minimized.

Member

FlatOutPS2 commented Dec 16, 2016

I had MTVU enabled, GSDX set to "native" with basic blending. Usual recommended speedhacks, but everything else default

I used basically the same default settings, the only thing I changed was setting Renderer to OGL HW and CRC hack to Partial.

Tourist Trophy on master without MTVU was 61 FPS on average, and with MTVU 59 FPS (so slower with), but using the PR it's 61 FPS with or without the hack, so eventhough it's not faster, the slowdown seems to be gone.

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 16, 2016

I'm quite happy with the fps bonus. Code was already optimized by both Cotton & Jake. I only did some micro optimizations.

Here MTVU thread vtune report
screenshot - 12162016 - 10 47 16 pm

Potentially some atomic/locking can be improved. The real meat is the recompiler but the ring buffer overhead isn't small.

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

I think we can expect the utilisation from the recompiler to be high, that is a given (we can maybe optimise things within that, but it's a big task, means sifting through recompiled code compared to the original, not a fun job) but yeh, the ring buffers shouldn't be consuming lots of time, the less smaller tasks like that take up, the better.

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 16, 2016

Yeah, we're both on the same line. Improving the recompiler is a daunting task and it might not even be possible. That why I spent some time on the vif unpack. Which is still high (it used to be slower than the ring bug)

As you can see we have a huge lock time. But potentially it is only a sync between VU/GS and EE (there are warning on debug build). However we spend quite some time on the std::dequeue (push_back). And potentially we do too much (redundant) lock/atomic. IMHO, the dequeue could be replaced with an (no lock/atomic?) ring buffer.

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

if you think it might provide better performance without impact on compatibility, then it's certainly worth a go :)

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 16, 2016

Honestly for the perf, I don't know ;) There is another thing that bug me. There is a boolean GS_Packet structure. It increase the size from 16B to 20B. And we do various copy/memset of the structure. This boolean is basically only used as return value of path.ExecuteGSPacket(); Maybe we can just a reference to a boolean. And potentially, if the struct is aligned properly compiler might use some SSE op. Anyway, a lock free ring buffer will be nice but I will first take some vacations ;)

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

Good plan :)
BTW I was looking at your vu check code you committed, in not sure how much that will help as it only checks compilation time, not the time vu blocks take to run.

Site compilation time is slow, but it happens rarely, so it isn't really the side we need to focus on

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

Sure not site, damn I wish git mobile let you edit posts!

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 16, 2016

You mean my recent commit. It doesn't measure the compilation time. It tells the profiler that data between 0x... and 0x.... are code named VU (I could also have a block view, so you can see if the recompiler speed is mostly dependent on a single block)

pcsx2/x86/newVif_Dynarec.cpp Outdated
Perf::vif.map((uptr)v.recWritePtr, xGetPtr() - v.recWritePtr, block.upkType /* FIXME ideally a key*/);
nVif[idx].recWritePtr = xGetPtr();
return v.vifBlocks.find(block);

This comment has been minimized.

@turtleli

turtleli Dec 16, 2016

Member

I think returning &block is pretty much the same thing? (EDIT: Well, I suppose this is less prone to external code changes and compilation doesn't happen too much).

The nVif[idx].recWritePtr slightly above is a bit odd since you've used v instead of nVif[idx] everywhere else in the function.

This comment has been minimized.

@gregory38

gregory38 Dec 18, 2016

Contributor

yes interface is a bit awkward. Another possibility will be to use a void dVifCompile (const nVifBlock& block ...) interface and move startPtr/length update outside of the compilation

pcsx2/x86/newVif_HashBucket.h Outdated
}
virtual ~HashBucket() throw() { clear(); }

This comment has been minimized.

@turtleli

turtleli Dec 16, 2016

Member

This could probably be made non-virtual.

pcsx2/x86/newVif_HashBucket.h Outdated
};
#include <array>
// nVifBlock - Ordered for Hashing; the 'num' field and the lower 6 bits of upkType are

This comment has been minimized.

@turtleli

turtleli Dec 16, 2016

Member

Comment needs fixed since it's no longer just the lower 6 bits of upkType.

pcsx2/x86/newVif_HashBucket.h Outdated
u32 size = bucket_size( dataPtr );
// Warning there is an extra +1 due to the empty cell
if( (m_bucket[b] = (nVifBlock*)pcsx2_aligned_realloc( m_bucket[b], sizeof(nVifBlock)*(size+2), 16, sizeof(nVifBlock)*(size+1) )) == NULL ) {

This comment has been minimized.

@turtleli

turtleli Dec 16, 2016

Member

Since SSE isn't used anymore, maybe aligned realloc/malloc/free don't need to be used? Though it won't hurt to leave it in.

This comment has been minimized.

@gregory38

gregory38 Dec 18, 2016

Contributor

hum, the annoying part is the realloc (I'm a very lazy guy). Potentially some std::vector could be used too. However I'm afraid for the perf, I need to check the generated code.

And the code is the find is wrong. I do the ++ too soon !

This comment has been minimized.

@gregory38

gregory38 Dec 18, 2016

Contributor

Hum, std::vector is maybe not a good solution. The idea of storing only a pointer (innside m_bucket) was to reduce cache miss probability. A std::vector` contains 3 pointers on LinuxSTL.

pcsx2/x86/newVif_HashBucket.h Outdated
for (int i = 0; i < hSize; i++) {
safe_aligned_free(mBucket[i].Chain);
mBucket[i].Size = 0;
for (size_t i = 0; i < m_bucket.size(); i++)

This comment has been minimized.

@turtleli

turtleli Dec 16, 2016

Member

Ranged for? Same below.

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 16, 2016

Oh right, I understand now

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 18, 2016

@refractionpcsx2 by the way, if you look at my previous screenshot. You can see both VU and VIF entries. As you can see the time spent in VIF recompiler is really smaller. The overhead to find/jump to the block is 2 time bigger (used to be 4x-5x).

On the ring buffer topic, there are various possibilities

  • use a lock free mutex instead of std mutex
  • use a ring buffer instead of dequeue (to avoid memory allocation stuff). Need to find the best size.
  • combine 2 previous idea into a lock free ring buffer (i.e. use atomic for read/write pointer)
    Hum, that being said, there is also a semaphore semaXGkick that (at least partially) protect the structure too. Potentially a ring buffer without lock neither atomic will be enough.
@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 18, 2016

Hmm now you mention it, I believe the gs thread was a ring buffer (not sure if it still is?), I have a smelt suspicion that the ring buffer size varied depending on the game and how heavy it was on that unit.

I could be wrong, just a niggle in the back of my head

gregory38 added some commits Dec 10, 2016

vif: improve block compilation management
Safety:
* check remaining space before compilation
* clear hash if recompiler is reset

Perf:
* don't research the hash after a miss
* reduce branching in Unpack/ExecuteUnpack

Note: a potential speed optimization for dVifsetVUptr
Precompute the length and store in the cache. However it need 2B on the
nVifBlock struct. Maybe we can compact cl/wl. Or merge aligned with upkType
(if some bits are useless)
vif: don't allocate vifblock hash on the heap
Avoid an extra indirection to access the hash bucket (Find function)

gregory38 added some commits Dec 13, 2016

vif: remove the type template of HashBucket
The class is designed and optimized for the layout of nVifBlock.
Besides it will ease future improvement.
vif: new implementation of the hash bucket
Previous implementation saved the both the chain pointer and the chain size
Rational: size is useful to add new element and to detect the end of the chain
Vif cache is rarely miss. So 'add' is barely called and the end of a chain is
barely reached.

New implementation will add a null cell at the end of the chain. As a
cell contains a x86 pointer, if is null you could conclude that you
reach the end of the chain.

The 'add' function will traverse the chain to get the current size. It is
a cold path besides the chain is often short (< 4).

The 'find' function only need to check the startPtr bytes to detect the end
of the loop.

Note: SizeChain was replaced with a std::array
vif: reorganize dVifUnpack
Inline the execution part
Add a num parameter to dVifsetVUptr
Use a local variable for the nVifBlock instead of a global struct state

The goal is to ease future update of the nVifBlock struct
vif: repack nVifBlock struct
cl/wl can fit in a single byte. Add a 2B length field instead.
It will contains the pre computed length to reduce dVifsetVUptr overhead
vif: increase buckets number to 64K
It allow to compare only 8B in the lookup so SSE could be replaced with general instruction

As a bonus, it allow to compute the hash key with a mov rather than modulo (which was an 'and')
@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 18, 2016

GS.h

// Size of the ringbuffer as a power of 2 -- size is a multiple of simd128s.
// (actual size is 1<<m_RingBufferSizeFactor simd vectors [128-bit values])
// A value of 19 is a 8meg ring buffer.  18 would be 4 megs, and 20 would be 16 megs.
// Default was 2mb, but some games with lots of MTGS activity want 8mb to run fast (rama)
static const uint RingBufferSizeFactor = 19;

// size of the ringbuffer in simd128's.
static const uint RingBufferSize = 1<<RingBufferSizeFactor;

// Mask to apply to ring buffer indices to wrap the pointer from end to
// start (the wrapping is what makes it a ringbuffer, yo!)
static const uint RingBufferMask = RingBufferSize - 1;

struct MTGS_BufferedData
{
	u128		m_Ring[RingBufferSize];
	u8			Regs[Ps2MemSize::GSregs];

	MTGS_BufferedData() {}

	u128& operator[]( uint idx )
	{
		pxAssert( idx < RingBufferSize );
		return m_Ring[idx];
	}
};

extern __aligned(32) MTGS_BufferedData RingBuffer;
@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 18, 2016

Okay so 8meg was taken as the sweet spot in that scenario, fair enough :)

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 18, 2016

However I'm not 100% sure of the connection between all threads. My feeling is that

  • EE will send the command to MTVU and will send a WAIT_MTVU command to MTGS.
  • MTVU will receive the command, and post the result in dequeue
  • MTGS will wait the result in dequeue

MTVU/MTGS interconnection is quite complex.

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 18, 2016

I can imagine it isn't fun lol. passing data from one thread back to another thread to pass it to another thread can't be easy.

gregory38 added some commits Dec 15, 2016

vif: replace sse cmp code with standard cmp
Standard instruction are faster to execute besides the CPU can optimize the cmp/jne

SSE

  e0:	add    ecx,0x10
  e3:	cmp    eax,0x7
  e6:	jg     1b0 <void dVifUnpack<0>(unsigned char const*, bool)+0x1b0>
enter_loop:
  ec:	vpcmpeqd xmm0,xmm1,XMMWORD PTR [ecx]
  f0:	vmovmskps eax,xmm0
  f4:	cmp    eax,0x7
  f7:	jne    e0 <void dVifUnpack<0>(unsigned char const*, bool)+0xe0>

Standard cmp

  d8:	add    eax,0x10
  db:	mov    esi,DWORD PTR [eax+0xc]
  de:	test   esi,esi
  e0:	je     190 <void dVifUnpack<0>(unsigned char const*, bool)+0x190>
enter_loop:
  e6:	cmp    ecx,DWORD PTR [eax+0x4]
  e9:	jne    d8 <void dVifUnpack<0>(unsigned char const*, bool)+0xd8>
  eb:	cmp    DWORD PTR [eax+0x8],ebx
  ee:	jne    d8 <void dVifUnpack<0>(unsigned char const*, bool)+0xd8>

v2: use reference instead of a pointer for find parameter
vif: inline dVifsetVUptr function
It avoid a double cmp/jmp on the dynarec/interpreter mode.
vif: move back the cache seach in the unpack function
Avoid the various move to return the value (actually due to the pointer)
vif: use u32 code instead of u8/u16
It avoids memory stalls and greatly reduces the overhead of the dVifUnpack function

Here a vtune summary of this branch (done on SotC init)

dVifUnpack<1> was 14.5% of effective VU thread time
dVifUnpack<1> is now 3.8% of effective VU thread time

I hope it will translate to better fps
vif: update alignment constraint
16B alignment is now useless for nVifBlock (no more SSE)
However update the alignment of bucket to 64B. It will reduce cache miss
probability in the find loop
@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 18, 2016

@turtleli I updated the code based on your feedback. And I fixed my silly mistake. I will take 2 weeks of vacations. You're in charge of the merge, merry Christmas 😝

@turtleli

This comment has been minimized.

Member

turtleli commented Dec 18, 2016

I will take 2 weeks of vacations.

No problem, have fu

You're in charge of the merge

....N noooooooooooooooooooooooooooooooooooo! XD

Yeah sure, I'll check over the code. Have fun :)

@turtleli

I think it's fine. It should behave the same, the cache reset fixes make sense to me, and I think the code is more readable too.

I'll leave the PR open a day or two longer for anyone that wants to do extra testing (and post results).

@turtleli turtleli merged commit 10eb88f into master Dec 21, 2016

4 checks passed

continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 21, 2016

Oh yeh, forgot to test this more lol. Oh well, I'm sure we'll see if any issues pop up. Great job @gregory38 :)

@gregory38

This comment has been minimized.

Contributor

gregory38 commented Dec 22, 2016

Thanks you. I will look at the ring buffer when I come back. I think the overhead is related to the huge number of fake mtvu packets. Hopefully it will improve mtvu efficiency

@refractionpcsx2

This comment has been minimized.

Member

refractionpcsx2 commented Dec 22, 2016

That would be excellent

@gregory38 gregory38 deleted the greg/vif-hash branch Dec 30, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment