-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lightweight putllc() for non-TSX if no data changed #6728
Conversation
Another one for discussion ... I noticed heavy contention in SPU threads because of excessive usage of vm::writer_lock(). It boils down to the fact that some games make heavy use of getllar() & putllc() and my machine has no TSX. The function pair is reading/writing 128 bytes between main memory and SPU local memory and a writeback is encapsulated into defensive locking logic. This patch tries to mitigate the situation by establishing a workaround. To recover the workflow: 1. getllar() links to a full cache line (128 bytes) of shared data 2. Data is copied into local SPU memory and can be modified 3. Some data inside block is modified 4. putllc() writes the data back if main memory has not changed When executing step 4 we can distinguish 3 cases a. No data has been changed: Why would we want to write it back? b. Only one of the 8 16-bytes data blocks has been changed. In this case we can use cmpxchg16b to realize an conditional atomic writeback without a lock. This is the lucky punch. c. Multiple 16-bytes data blocks have been changed. For that we can fall back to the current heavyweight implementation For a successful patch it is essential that path a/b is in use much more often than path c. Some numbers: GoW 3 menu screen: - No changes: 94,7% - Single 16-bytes block changes: 4,9% - Multi block changes: 0,3% Bomberman Ultra menu screen: - No changes: 27,6% - Single 16-bytes block changes: 60,8% - Multi block changes: 11,6% REMARK! Of course there might be several caveats but I want to outline a tricky one about the implementation. To enter path b the patch does not check changes to main memory but only to the linked backup data of the cache line. That gives a special case not covered by now. What if we discover that local block 1 has changed and write the data back while another thread simultanously changes block 8 of the same shared data? From a noob & performance perspective I would say that this does not matter. Attached you will find a debugging enabled patch that writes some numbers into the error log. These are: cnt: number of putllc() calls success: number of single 16-bytes block writebacks zero: number of suppressed writebacks multi: number of full 128 bytes writebacks diff: number of changes to main memory between getllar() and putllc() ???: number of writebacks where another block in the cache line changed With the patch I have no problems going ingame into GoW 3. Performance on my 2 core machine is too awful to give a good before/after comparison (1-2fps). Maybe someone has time to check the PR and report numbers back.
It's not atomic, and probably has regressions. |
It actually does matter as the whole cache line is supposed to be atomic. |
The first commit is overeager to reduce the putllc() writebacks. It might work but breaks atomicity. The attached fixes will compare all three data structures (main memory, local copy, new data). If they are equal we have nothing to do and will just report success to the calling SPU thread. Waiting in GoW 3 main menu we can avoid 95% of locking inside putllc(), in Bomberman Ultra we save 30%. Some perf numbers from on SPU thread in GoW 3: before: 40,50% 0,00% SPU[0x2000004] [unknown] [.] 0000000000000000 23,46% 0,60% SPU[0x2000004] rpcs3 [.] spu_thread::set_ch_value 16,83% 0,00% SPU[0x2000004] [unknown] [.] 0x8045013080450130 13,63% 13,63% SPU[0x2000004] rpcs3 [.] vm::writer_lock::writer_lock 7,52% 7,46% SPU[0x2000004] rpcs3 [.] spu_thread::do_dma_transfer 4,80% 4,72% SPU[0x2000004] rpcs3 [.] spu_thread::process_mfc_cmd after: 49,92% 0,00% SPU[0x2000002] [unknown] [.] 0x8045013080450130 22,45% 0,00% SPU[0x2000002] [unknown] [.] 0000000000000000 15,98% 1,13% SPU[0x2000002] rpcs3 [.] spu_thread::set_ch_value 9,95% 9,95% SPU[0x2000002] rpcs3 [.] vm::writer_lock::writer_lock 5,98% 5,94% SPU[0x2000002] rpcs3 [.] spu_thread::do_dma_transfer 3,70% 3,61% SPU[0x2000002] rpcs3 [.] spu_thread::set_interrupt_status 3,11% 3,11% SPU[0x2000002] [JIT] 6266 [.] 0x00007fa71d03b075 2,97% 2,91% SPU[0x2000002] rpcs3 [.] spu_thread::process_mfc_cmd
Looking at the massive unnecessary putllc() overhead in GoW 3 I gave a second version of the patch a try. Expecting your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better now, but you still need to advance rtime on success.
I will check it tomorrow. Btw. what about the following simplification: ... This would be close to the old implementation but we only compare stable local data. The argumentation behind that would be:
Thinking about it while writing feels more and more simple & plausible ... |
This reverts commit 1fdc29a.
This reverts commit 3eca8f9.
In case SPU did not change any cacheline data between getllar() and putllc() just do nothing and report success. If we start a writer locked data transfer in this situation the only meaningful outcome would be that PPU has already overwritten main memory. This would be wasted time and resources.
This one feels better from all perspectives. |
rpcs3/Emu/Cell/SPUThread.cpp
Outdated
res.release(old_time); | ||
} | ||
// No changes, just unlink data | ||
result = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result = 1; | |
result = cmp_rdata(rdata, data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must advance vm::reservation_acquire(raddr, 128)
in case of success!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this shortly and shall I use reservation_update() for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On success (rtime == vm::reservation_acquire(raddr, 128) && cm_rdata(rdata, data) == true), the reservation value must be incremented by 128.
result
must be set accordingly to failure or success, even if to_write didn't change you still need to check rdata == data.
You don't need to use reservation_update().
use result = cmp_rdata(rdata, data) && vm::reservation_acquire(raddr, 128).compare_and_swap_test(rtime, rtime + 128);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Regarding cmp_rdata(rdata,data): Do we really need this? I'm getting headaches over this.
rdata is volatile and not read under a lock. My assumption is: No other SPU thread can have changed the data. They would have used reservation primitives and increased rtime. So only PPU could have changed it.
In case it did not change: Everything is fine.
In case data changed: If SPU had finished earlier no one had noticed the "no-op". But now we are a little late. Why not just do as we finished the "no-op" right before the PPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more than that, think of it as also a "feature" to detect any data changes that occur between GETLLAR and PUTLLC, which is what happens in some games.
It's also called reservation polling, when you constantly read from getllar and "write back" until there's a data change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also has accuracy reasons, reservation could be lost by a local PUT write to raddr from the same SPU for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification.
Changes requested by elad335
Patch applied. |
Anything left, to get this into master? |
This person build (appveyor) made my i5 4460 to go ingame in GoW: Ascension USA for the first time. With every other build, inculding the ZeroX builds I'd at best get the main menu. Senior devs, please listen to this person. |
} | ||
else | ||
{ | ||
res.release(old_time); | ||
auto& res = vm::reservation_lock(raddr, 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's missing additional cmp_rdata(rdata, data)
check which is removed.
result = 1; | ||
} | ||
} | ||
res.release(old_time + (long long)(result << 7)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use C style cast. It could be u64{result << 7}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need for a cast here.. just parenthesses are enough.
This replaces the totally messed up PR #6728 Some games make heavy use of getllar() & putllc() without even changing data. In this case avoid unneccesary heavy locking of the PPU threads on non-TSX hosts.
Another one for discussion ... Again developed on Linux.
I noticed heavy contention in SPU threads because of excessive usage
of vm::writer_lock(). It boils down to the fact that some games make
heavy use of getllar() & putllc() and my machine has no TSX. The function
pair is reading/writing 128 bytes between main memory and SPU local
memory and a writeback is encapsulated into defensive locking logic.
This patch tries to mitigate the situation by establishing a workaround.
To recover the workflow:
When executing step 4 we can distinguish 3 cases
a. No data has been changed: Why would we want to write it back?
b. Only one of the 8 16-bytes data blocks has been changed. In this case
we can use cmpxchg16b to realize an conditional atomic writeback without
a lock. This is the lucky punch.
c. Multiple 16-bytes data blocks have been changed. For that we can
fall back to the current heavyweight implementation
For a successful patch it is essential that path a/b is in use much more
often than path c. Some numbers:
GoW 3 menu screen:
Bomberman Ultra menu screen:
REMARK! Of course there might be several caveats but I want to outline a
tricky one about the implementation. To enter path b the patch does not
check changes to main memory but only to the linked backup data of the
cache line. That gives a special case not covered by now. What if we
discover that local block 1 has changed and write the data back while
another thread simultanously changes block 8 of the same shared data?
From a noob & performance perspective I would say that this does not
matter.
Attached you will find a debugging enabled patch that writes some numbers
into the error log. These are:
With the patch I have no problems going ingame into GoW 3. Performance
on my 2 core machine is too awful to give a good before/after comparison
(1-2fps). Maybe someone has time to check the PR and report numbers back.