Lightweight putllc() for non-TSX if no data changed #6728

plappermaul · 2019-10-08T17:59:39Z

Another one for discussion ... Again developed on Linux.

I noticed heavy contention in SPU threads because of excessive usage
of vm::writer_lock(). It boils down to the fact that some games make
heavy use of getllar() & putllc() and my machine has no TSX. The function
pair is reading/writing 128 bytes between main memory and SPU local
memory and a writeback is encapsulated into defensive locking logic.

This patch tries to mitigate the situation by establishing a workaround.
To recover the workflow:

getllar() links to a full cache line (128 bytes) of shared data
Data is copied into local SPU memory and can be modified
Some data inside block is modified
putllc() writes the data back if main memory has not changed

When executing step 4 we can distinguish 3 cases

a. No data has been changed: Why would we want to write it back?

b. Only one of the 8 16-bytes data blocks has been changed. In this case
we can use cmpxchg16b to realize an conditional atomic writeback without
a lock. This is the lucky punch.

c. Multiple 16-bytes data blocks have been changed. For that we can
fall back to the current heavyweight implementation

For a successful patch it is essential that path a/b is in use much more
often than path c. Some numbers:

GoW 3 menu screen:

No changes: 94,7%
Single 16-bytes block changes: 4,9%
Multi block changes: 0,3%

Bomberman Ultra menu screen:

No changes: 27,6%
Single 16-bytes block changes: 60,8%
Multi block changes: 11,6%

REMARK! Of course there might be several caveats but I want to outline a
tricky one about the implementation. To enter path b the patch does not
check changes to main memory but only to the linked backup data of the
cache line. That gives a special case not covered by now. What if we
discover that local block 1 has changed and write the data back while
another thread simultanously changes block 8 of the same shared data?
From a noob & performance perspective I would say that this does not
matter.

Attached you will find a debugging enabled patch that writes some numbers
into the error log. These are:

cnt:     number of putllc() calls
success: number of single 16-bytes block writebacks
zero:    number of suppressed writebacks
multi:   number of full 128 bytes writebacks
diff:    number of changes to main memory between getllar() and putllc()
???:     number of writebacks where another block in the cache line changed

With the patch I have no problems going ingame into GoW 3. Performance
on my 2 core machine is too awful to give a good before/after comparison
(1-2fps). Maybe someone has time to check the PR and report numbers back.

Another one for discussion ... I noticed heavy contention in SPU threads because of excessive usage of vm::writer_lock(). It boils down to the fact that some games make heavy use of getllar() & putllc() and my machine has no TSX. The function pair is reading/writing 128 bytes between main memory and SPU local memory and a writeback is encapsulated into defensive locking logic. This patch tries to mitigate the situation by establishing a workaround. To recover the workflow: 1. getllar() links to a full cache line (128 bytes) of shared data 2. Data is copied into local SPU memory and can be modified 3. Some data inside block is modified 4. putllc() writes the data back if main memory has not changed When executing step 4 we can distinguish 3 cases a. No data has been changed: Why would we want to write it back? b. Only one of the 8 16-bytes data blocks has been changed. In this case we can use cmpxchg16b to realize an conditional atomic writeback without a lock. This is the lucky punch. c. Multiple 16-bytes data blocks have been changed. For that we can fall back to the current heavyweight implementation For a successful patch it is essential that path a/b is in use much more often than path c. Some numbers: GoW 3 menu screen: - No changes: 94,7% - Single 16-bytes block changes: 4,9% - Multi block changes: 0,3% Bomberman Ultra menu screen: - No changes: 27,6% - Single 16-bytes block changes: 60,8% - Multi block changes: 11,6% REMARK! Of course there might be several caveats but I want to outline a tricky one about the implementation. To enter path b the patch does not check changes to main memory but only to the linked backup data of the cache line. That gives a special case not covered by now. What if we discover that local block 1 has changed and write the data back while another thread simultanously changes block 8 of the same shared data? From a noob & performance perspective I would say that this does not matter. Attached you will find a debugging enabled patch that writes some numbers into the error log. These are: cnt: number of putllc() calls success: number of single 16-bytes block writebacks zero: number of suppressed writebacks multi: number of full 128 bytes writebacks diff: number of changes to main memory between getllar() and putllc() ???: number of writebacks where another block in the cache line changed With the patch I have no problems going ingame into GoW 3. Performance on my 2 core machine is too awful to give a good before/after comparison (1-2fps). Maybe someone has time to check the PR and report numbers back.

elad335 · 2019-10-08T18:06:05Z

It's not atomic, and probably has regressions.

RipleyTom · 2019-10-08T18:44:58Z

What if we discover that local block 1 has changed and write the data back while
another thread simultanously changes block 8 of the same shared data?
From a noob & performance perspective I would say that this does not
matter.

It actually does matter as the whole cache line is supposed to be atomic.
There is something to the idea of skipping the lock if the data is the same but apparently elad already investigated the idea and it wasn't worth it.

The first commit is overeager to reduce the putllc() writebacks. It might work but breaks atomicity. The attached fixes will compare all three data structures (main memory, local copy, new data). If they are equal we have nothing to do and will just report success to the calling SPU thread. Waiting in GoW 3 main menu we can avoid 95% of locking inside putllc(), in Bomberman Ultra we save 30%. Some perf numbers from on SPU thread in GoW 3: before: 40,50% 0,00% SPU[0x2000004] [unknown] [.] 0000000000000000 23,46% 0,60% SPU[0x2000004] rpcs3 [.] spu_thread::set_ch_value 16,83% 0,00% SPU[0x2000004] [unknown] [.] 0x8045013080450130 13,63% 13,63% SPU[0x2000004] rpcs3 [.] vm::writer_lock::writer_lock 7,52% 7,46% SPU[0x2000004] rpcs3 [.] spu_thread::do_dma_transfer 4,80% 4,72% SPU[0x2000004] rpcs3 [.] spu_thread::process_mfc_cmd after: 49,92% 0,00% SPU[0x2000002] [unknown] [.] 0x8045013080450130 22,45% 0,00% SPU[0x2000002] [unknown] [.] 0000000000000000 15,98% 1,13% SPU[0x2000002] rpcs3 [.] spu_thread::set_ch_value 9,95% 9,95% SPU[0x2000002] rpcs3 [.] vm::writer_lock::writer_lock 5,98% 5,94% SPU[0x2000002] rpcs3 [.] spu_thread::do_dma_transfer 3,70% 3,61% SPU[0x2000002] rpcs3 [.] spu_thread::set_interrupt_status 3,11% 3,11% SPU[0x2000002] [JIT] 6266 [.] 0x00007fa71d03b075 2,97% 2,91% SPU[0x2000002] rpcs3 [.] spu_thread::process_mfc_cmd

plappermaul · 2019-10-09T17:49:00Z

Looking at the massive unnecessary putllc() overhead in GoW 3 I gave a second version of the patch a try. Expecting your feedback.

elad335

It's better now, but you still need to advance rtime on success.

plappermaul · 2019-10-09T18:34:36Z

I will check it tomorrow. Btw. what about the following simplification:

...
else if (auto& data = vm::_ref<decltype(rdata)>(addr); rtime == (vm::reservation_acquire(raddr, 128) & -128) && !cmp_rdata(rdata, to_write))
...

This would be close to the old implementation but we only compare stable local data. The argumentation behind that would be:

We did not change anything inside the local block copy.
Doing a writeback can only go wrong.
Assume that we finished in zero time between getllar() and putllc()
So the writeback happened virtually while we read the data.
Do nothing and report back ok.

Thinking about it while writing feels more and more simple & plausible ...

This reverts commit 1fdc29a.

This reverts commit 3eca8f9.

In case SPU did not change any cacheline data between getllar() and putllc() just do nothing and report success. If we start a writer locked data transfer in this situation the only meaningful outcome would be that PPU has already overwritten main memory. This would be wasted time and resources.

plappermaul · 2019-10-10T12:06:24Z

This one feels better from all perspectives.

elad335 · 2019-10-10T12:13:41Z

rpcs3/Emu/Cell/SPUThread.cpp

-						res.release(old_time);
-					}
+					// No changes, just unlink data
+					result = 1;


Suggested change

result = 1;

result = cmp_rdata(rdata, data);

You must advance vm::reservation_acquire(raddr, 128) in case of success!

Could you explain this shortly and shall I use reservation_update() for that?

On success (rtime == vm::reservation_acquire(raddr, 128) && cm_rdata(rdata, data) == true), the reservation value must be incremented by 128.
result must be set accordingly to failure or success, even if to_write didn't change you still need to check rdata == data.
You don't need to use reservation_update().
use result = cmp_rdata(rdata, data) && vm::reservation_acquire(raddr, 128).compare_and_swap_test(rtime, rtime + 128);

I see.

Regarding cmp_rdata(rdata,data): Do we really need this? I'm getting headaches over this.

rdata is volatile and not read under a lock. My assumption is: No other SPU thread can have changed the data. They would have used reservation primitives and increased rtime. So only PPU could have changed it.

In case it did not change: Everything is fine.

In case data changed: If SPU had finished earlier no one had noticed the "no-op". But now we are a little late. Why not just do as we finished the "no-op" right before the PPU?

It's more than that, think of it as also a "feature" to detect any data changes that occur between GETLLAR and PUTLLC, which is what happens in some games.
It's also called reservation polling, when you constantly read from getllar and "write back" until there's a data change.

It also has accuracy reasons, reservation could be lost by a local PUT write to raddr from the same SPU for example.

Thanks for the clarification.

Changes requested by elad335

plappermaul · 2019-10-10T13:30:58Z

Patch applied.

plappermaul · 2019-10-27T20:56:30Z

Anything left, to get this into master?

NoUserNameForYou · 2019-10-29T16:39:39Z

This person build (appveyor) made my i5 4460 to go ingame in GoW: Ascension USA for the first time.

With every other build, inculding the ZeroX builds I'd at best get the main menu.

Senior devs, please listen to this person.

Nekotekina · 2019-10-31T14:54:04Z

rpcs3/Emu/Cell/SPUThread.cpp

 				}
 				else
 				{
-					res.release(old_time);
+					auto& res = vm::reservation_lock(raddr, 128);


I think it's missing additional cmp_rdata(rdata, data) check which is removed.

Nekotekina · 2019-10-31T14:56:08Z

rpcs3/Emu/Cell/SPUThread.cpp

+							result = 1;
+						}
+					}
+					res.release(old_time + (long long)(result << 7));


Don't use C style cast. It could be u64{result << 7}

There's no need for a cast here.. just parenthesses are enough.

This replaces the totally messed up PR #6728 Some games make heavy use of getllar() & putllc() without even changing data. In this case avoid unneccesary heavy locking of the PPU threads on non-TSX hosts.

elad335 requested changes Oct 9, 2019

View reviewed changes

plappermaul added 3 commits October 10, 2019 13:24

Revert "putllc() lucky punch enhancements"

ab592e3

This reverts commit 1fdc29a.

Revert "Lucky punch putllc() for non-TSX hosts"

4e72a1c

This reverts commit 3eca8f9.

plappermaul changed the title ~~Lucky punch putllc() for non-TSX hosts~~ Lightweight putllc() for non-TSX if no data changed Oct 10, 2019

elad335 requested changes Oct 10, 2019

View reviewed changes

Lightweight putllc() bugfix

1049c2a

Changes requested by elad335

elad335 approved these changes Oct 10, 2019

View reviewed changes

AniLeo added CPU Enhancement labels Oct 14, 2019

Nekotekina reviewed Oct 31, 2019

View reviewed changes

plappermaul closed this Nov 1, 2019

plappermaul deleted the lp branch November 1, 2019 11:51

plappermaul mentioned this pull request Nov 1, 2019

Lightweight putllc() for non-TSX if no data changed #6928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightweight putllc() for non-TSX if no data changed #6728

Lightweight putllc() for non-TSX if no data changed #6728

plappermaul commented Oct 8, 2019

elad335 commented Oct 8, 2019 •

edited

RipleyTom commented Oct 8, 2019 •

edited

plappermaul commented Oct 9, 2019

elad335 left a comment

plappermaul commented Oct 9, 2019

plappermaul commented Oct 10, 2019

elad335 Oct 10, 2019

elad335 Oct 10, 2019 •

edited

plappermaul Oct 10, 2019

elad335 Oct 10, 2019 •

edited

plappermaul Oct 10, 2019

elad335 Oct 10, 2019 •

edited

elad335 Oct 10, 2019 •

edited

plappermaul Oct 10, 2019

plappermaul commented Oct 10, 2019

plappermaul commented Oct 27, 2019

NoUserNameForYou commented Oct 29, 2019

Nekotekina Oct 31, 2019

Nekotekina Oct 31, 2019

elad335 Oct 31, 2019

Lightweight putllc() for non-TSX if no data changed #6728

Lightweight putllc() for non-TSX if no data changed #6728

Conversation

plappermaul commented Oct 8, 2019

elad335 commented Oct 8, 2019 • edited

RipleyTom commented Oct 8, 2019 • edited

plappermaul commented Oct 9, 2019

elad335 left a comment

Choose a reason for hiding this comment

plappermaul commented Oct 9, 2019

plappermaul commented Oct 10, 2019

elad335 Oct 10, 2019

Choose a reason for hiding this comment

elad335 Oct 10, 2019 • edited

Choose a reason for hiding this comment

plappermaul Oct 10, 2019

Choose a reason for hiding this comment

elad335 Oct 10, 2019 • edited

Choose a reason for hiding this comment

plappermaul Oct 10, 2019

Choose a reason for hiding this comment

elad335 Oct 10, 2019 • edited

Choose a reason for hiding this comment

elad335 Oct 10, 2019 • edited

Choose a reason for hiding this comment

plappermaul Oct 10, 2019

Choose a reason for hiding this comment

plappermaul commented Oct 10, 2019

plappermaul commented Oct 27, 2019

NoUserNameForYou commented Oct 29, 2019

Nekotekina Oct 31, 2019

Choose a reason for hiding this comment

Nekotekina Oct 31, 2019

Choose a reason for hiding this comment

elad335 Oct 31, 2019

Choose a reason for hiding this comment

elad335 commented Oct 8, 2019 •

edited

RipleyTom commented Oct 8, 2019 •

edited

elad335 Oct 10, 2019 •

edited

elad335 Oct 10, 2019 •

edited

elad335 Oct 10, 2019 •

edited

elad335 Oct 10, 2019 •

edited