Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightweight putllc() for non-TSX if no data changed #6728

Closed
wants to merge 6 commits into from
Closed

Lightweight putllc() for non-TSX if no data changed #6728

wants to merge 6 commits into from

Conversation

plappermaul
Copy link
Contributor

Another one for discussion ... Again developed on Linux.

I noticed heavy contention in SPU threads because of excessive usage
of vm::writer_lock(). It boils down to the fact that some games make
heavy use of getllar() & putllc() and my machine has no TSX. The function
pair is reading/writing 128 bytes between main memory and SPU local
memory and a writeback is encapsulated into defensive locking logic.

This patch tries to mitigate the situation by establishing a workaround.
To recover the workflow:

  1. getllar() links to a full cache line (128 bytes) of shared data
  2. Data is copied into local SPU memory and can be modified
  3. Some data inside block is modified
  4. putllc() writes the data back if main memory has not changed

When executing step 4 we can distinguish 3 cases

a. No data has been changed: Why would we want to write it back?

b. Only one of the 8 16-bytes data blocks has been changed. In this case
we can use cmpxchg16b to realize an conditional atomic writeback without
a lock. This is the lucky punch.

c. Multiple 16-bytes data blocks have been changed. For that we can
fall back to the current heavyweight implementation

For a successful patch it is essential that path a/b is in use much more
often than path c. Some numbers:

GoW 3 menu screen:

  • No changes: 94,7%
  • Single 16-bytes block changes: 4,9%
  • Multi block changes: 0,3%

Bomberman Ultra menu screen:

  • No changes: 27,6%
  • Single 16-bytes block changes: 60,8%
  • Multi block changes: 11,6%

REMARK! Of course there might be several caveats but I want to outline a
tricky one about the implementation. To enter path b the patch does not
check changes to main memory but only to the linked backup data of the
cache line. That gives a special case not covered by now. What if we
discover that local block 1 has changed and write the data back while
another thread simultanously changes block 8 of the same shared data?
From a noob & performance perspective I would say that this does not
matter.

Attached you will find a debugging enabled patch that writes some numbers
into the error log. These are:

cnt:     number of putllc() calls
success: number of single 16-bytes block writebacks
zero:    number of suppressed writebacks
multi:   number of full 128 bytes writebacks
diff:    number of changes to main memory between getllar() and putllc()
???:     number of writebacks where another block in the cache line changed

With the patch I have no problems going ingame into GoW 3. Performance
on my 2 core machine is too awful to give a good before/after comparison
(1-2fps). Maybe someone has time to check the PR and report numbers back.

Another one for discussion ...

I noticed heavy contention in SPU threads because of excessive usage
of vm::writer_lock(). It boils down to the fact that some games make
heavy use of getllar() & putllc() and my machine has no TSX. The function
pair is reading/writing 128 bytes between main memory and SPU local
memory and a writeback is encapsulated into defensive locking logic.

This patch tries to mitigate the situation by establishing a workaround.
To recover the workflow:

1. getllar() links to a full cache line (128 bytes) of shared data
2. Data is copied into local SPU memory and can be modified
3. Some data inside block is modified
4. putllc() writes the data back if main memory has not changed

When executing step 4 we can distinguish 3 cases

a. No data has been changed: Why would we want to write it back?

b. Only one of the 8 16-bytes data blocks has been changed. In this case
we can use cmpxchg16b to realize an conditional atomic writeback without
a lock. This is the lucky punch.

c. Multiple 16-bytes data blocks have been changed. For that we can
fall back to the current heavyweight implementation

For a successful patch it is essential that path a/b is in use much more
often than path c. Some numbers:

GoW 3 menu screen:
- No changes: 94,7%
- Single 16-bytes block changes: 4,9%
- Multi block changes: 0,3%

Bomberman Ultra menu screen:
- No changes: 27,6%
- Single 16-bytes block changes: 60,8%
- Multi block changes: 11,6%

REMARK! Of course there might be several caveats but I want to outline a
tricky one about the implementation. To enter path b the patch does not
check changes to main memory but only to the linked backup data of the
cache line. That gives a special case not covered by now. What if we
discover that local block 1 has changed and write the data back while
another thread simultanously changes block 8 of the same shared data?
From a noob & performance perspective I would say that this does not
matter.

Attached you will find a debugging enabled patch that writes some numbers
into the error log. These are:

cnt: number of putllc() calls
success: number of single 16-bytes block writebacks
zero: number of suppressed writebacks
multi: number of full 128 bytes writebacks
diff: number of changes to main memory between getllar() and putllc()
???: number of writebacks where another block in the cache line changed

With the patch I have no problems going ingame into GoW 3. Performance
on my 2 core machine is too awful to give a good before/after comparison
(1-2fps). Maybe someone has time to check the PR and report numbers back.
@elad335
Copy link
Contributor

elad335 commented Oct 8, 2019

It's not atomic, and probably has regressions.

@RipleyTom
Copy link
Contributor

RipleyTom commented Oct 8, 2019

What if we discover that local block 1 has changed and write the data back while
another thread simultanously changes block 8 of the same shared data?
From a noob & performance perspective I would say that this does not
matter.

It actually does matter as the whole cache line is supposed to be atomic.
There is something to the idea of skipping the lock if the data is the same but apparently elad already investigated the idea and it wasn't worth it.

The first commit is overeager to reduce the putllc() writebacks. It might
work but breaks atomicity. The attached fixes will compare all three data
structures (main memory, local copy, new data). If they are equal we have
nothing to do and will just report success to the calling SPU thread.

Waiting in GoW 3 main menu we can avoid 95% of locking inside putllc(), in
Bomberman Ultra we save 30%. Some perf numbers from on SPU thread in GoW 3:

before:
40,50%   0,00%  SPU[0x2000004]   [unknown]    [.] 0000000000000000
23,46%   0,60%  SPU[0x2000004]   rpcs3        [.] spu_thread::set_ch_value
16,83%   0,00%  SPU[0x2000004]   [unknown]    [.] 0x8045013080450130
13,63%  13,63%  SPU[0x2000004]   rpcs3        [.] vm::writer_lock::writer_lock
 7,52%   7,46%  SPU[0x2000004]   rpcs3        [.] spu_thread::do_dma_transfer
 4,80%   4,72%  SPU[0x2000004]   rpcs3        [.] spu_thread::process_mfc_cmd

after:
49,92%   0,00%  SPU[0x2000002]   [unknown]    [.] 0x8045013080450130
22,45%   0,00%  SPU[0x2000002]   [unknown]    [.] 0000000000000000
15,98%   1,13%  SPU[0x2000002]   rpcs3        [.] spu_thread::set_ch_value
 9,95%   9,95%  SPU[0x2000002]   rpcs3        [.] vm::writer_lock::writer_lock
 5,98%   5,94%  SPU[0x2000002]   rpcs3        [.] spu_thread::do_dma_transfer
 3,70%   3,61%  SPU[0x2000002]   rpcs3        [.] spu_thread::set_interrupt_status
 3,11%   3,11%  SPU[0x2000002]   [JIT] 6266   [.] 0x00007fa71d03b075
 2,97%   2,91%  SPU[0x2000002]   rpcs3        [.] spu_thread::process_mfc_cmd
@plappermaul
Copy link
Contributor Author

Looking at the massive unnecessary putllc() overhead in GoW 3 I gave a second version of the patch a try. Expecting your feedback.

Copy link
Contributor

@elad335 elad335 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better now, but you still need to advance rtime on success.

@plappermaul
Copy link
Contributor Author

I will check it tomorrow. Btw. what about the following simplification:

...
else if (auto& data = vm::_ref<decltype(rdata)>(addr); rtime == (vm::reservation_acquire(raddr, 128) & -128) && !cmp_rdata(rdata, to_write))
...

This would be close to the old implementation but we only compare stable local data. The argumentation behind that would be:

  1. We did not change anything inside the local block copy.
  2. Doing a writeback can only go wrong.
  3. Assume that we finished in zero time between getllar() and putllc()
  4. So the writeback happened virtually while we read the data.
  5. Do nothing and report back ok.

Thinking about it while writing feels more and more simple & plausible ...

In case SPU did not change any cacheline data between getllar() and putllc()
just do nothing and report success. If we start a writer locked data transfer
in this situation the only meaningful outcome would be that PPU has already
overwritten main memory. This would be wasted time and resources.
@plappermaul plappermaul changed the title Lucky punch putllc() for non-TSX hosts Lightweight putllc() for non-TSX if no data changed Oct 10, 2019
@plappermaul
Copy link
Contributor Author

This one feels better from all perspectives.

res.release(old_time);
}
// No changes, just unlink data
result = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = 1;
result = cmp_rdata(rdata, data);

Copy link
Contributor

@elad335 elad335 Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must advance vm::reservation_acquire(raddr, 128) in case of success!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this shortly and shall I use reservation_update() for that?

Copy link
Contributor

@elad335 elad335 Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On success (rtime == vm::reservation_acquire(raddr, 128) && cm_rdata(rdata, data) == true), the reservation value must be incremented by 128.
result must be set accordingly to failure or success, even if to_write didn't change you still need to check rdata == data.
You don't need to use reservation_update().
use result = cmp_rdata(rdata, data) && vm::reservation_acquire(raddr, 128).compare_and_swap_test(rtime, rtime + 128);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Regarding cmp_rdata(rdata,data): Do we really need this? I'm getting headaches over this.

rdata is volatile and not read under a lock. My assumption is: No other SPU thread can have changed the data. They would have used reservation primitives and increased rtime. So only PPU could have changed it.

In case it did not change: Everything is fine.

In case data changed: If SPU had finished earlier no one had noticed the "no-op". But now we are a little late. Why not just do as we finished the "no-op" right before the PPU?

Copy link
Contributor

@elad335 elad335 Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more than that, think of it as also a "feature" to detect any data changes that occur between GETLLAR and PUTLLC, which is what happens in some games.
It's also called reservation polling, when you constantly read from getllar and "write back" until there's a data change.

Copy link
Contributor

@elad335 elad335 Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also has accuracy reasons, reservation could be lost by a local PUT write to raddr from the same SPU for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification.

Changes requested by elad335
@plappermaul
Copy link
Contributor Author

Patch applied.

@plappermaul
Copy link
Contributor Author

Anything left, to get this into master?

@NoUserNameForYou
Copy link

This person build (appveyor) made my i5 4460 to go ingame in GoW: Ascension USA for the first time.

With every other build, inculding the ZeroX builds I'd at best get the main menu.

Senior devs, please listen to this person.

}
else
{
res.release(old_time);
auto& res = vm::reservation_lock(raddr, 128);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's missing additional cmp_rdata(rdata, data) check which is removed.

result = 1;
}
}
res.release(old_time + (long long)(result << 7));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use C style cast. It could be u64{result << 7}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for a cast here.. just parenthesses are enough.

@plappermaul plappermaul closed this Nov 1, 2019
Nekotekina pushed a commit that referenced this pull request Nov 19, 2019
This replaces the totally messed up PR #6728

Some games make heavy use of getllar() & putllc() without even changing data.
In this case avoid unneccesary heavy locking of the PPU threads on non-TSX
hosts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants