Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block gpu thread #2172

Merged
merged 6 commits into from Apr 6, 2015
Merged

Block gpu thread #2172

merged 6 commits into from Apr 6, 2015

Conversation

degasus
Copy link
Member

@degasus degasus commented Mar 5, 2015

This PR tries to use Common::Event for CPU <-> GPU synchronization. It tries to get the same speed without using busy loops all over the code. So it should run much faster on thermal bottlenecked devices and maybe dual core CPUs.

@degasus degasus force-pushed the block_gpu_thread branch 2 times, most recently from 3ef061e to 4adadef Compare March 5, 2015 20:18
@magumagu
Copy link
Contributor

magumagu commented Mar 5, 2015

I'm not really liking the proliferation of mutexes and condition variables... Fifo.cpp has four mutexes (m_csHWVidOccupied, s_video_buffer_lock, s_fifo_mutex, s_gpu_flush_mutex), and no comments describing what state they actually protect.

Is isGpuReadingData supposed to be atomic?

@@ -58,6 +60,15 @@ static u8* s_video_buffer_pp_read_ptr;
// polls, it's just atomic.
// - The pp_read_ptr is the CPU preprocessing version of the read_ptr.

// events between cpu and gpu
static std::atomic<int> s_gpu_is_running;

This comment was marked as off-topic.

@phire
Copy link
Member

phire commented Mar 5, 2015

I don't think s_gpu_is_running is needed.

Should be able to wait for s_fifo_cond_var directly.

@degasus degasus force-pushed the block_gpu_thread branch 5 times, most recently from e8ba439 to 8b02629 Compare March 7, 2015 16:00
@degasus
Copy link
Member Author

degasus commented Mar 8, 2015

@magumagu I've tried to separate the usage of those cond vars and mutex. If you want I'll just use the same for everything, but imo it will be harder to understand the communication if those variables are used on more than two places.

And yes, isGpuReadingData should be atomic, but on x86, volatile is also working. As it isn't touched in this PR, I'll leave the cleanup for another one. Maybe a merge of Fifo.cpp and CommandProcessor.cpp?

@phire This atomic is used to keep the condition variable out of the hot RunGpu code. As long as this atomic is set, the GPU thread will recheck everything. Only if it's unset, we might require to wakeup the GPU thread.

@degasus
Copy link
Member Author

degasus commented Mar 8, 2015

https://dl.dropboxusercontent.com/u/484730/GPUBlock.jpg from @JMC47

  • The latency of the condition variables (signaling to wakeup) is on average about 4us which is fine.
  • The CPU thread did spend 3% of the real time waiting for the GPU, likely because of the idle skipping syncing hack. This includes all latency issues and so it's almost nothing. But this game only syncs once per frame, others may sync more often.
  • The GPU did get idle about 80 times per frame. The average signaling time is also about 4us, but this sums up to 8% of the real time on the CPU thread. This explains the slowdown from 270 fps in master to 246 fps here.

So in my opinion, all latencies are fine, but we wake up the GPU too often. Do you think we should make a heuristic to not sleep or to not signal as early? The former may be faster, the later may safe more power.

@Sonicadvance1
Copy link
Contributor

Has it been determined how the GPU is affected when you lock it to maximum performance(Max clock speeds)?

@degasus
Copy link
Member Author

degasus commented Mar 8, 2015

@Sonicadvance1 This PR shouldn't affect the GPU itself just the GPU thread. And configuring other CPU priorities didn't affect the latency here :/

@@ -58,6 +60,26 @@ static u8* s_video_buffer_pp_read_ptr;
// polls, it's just atomic.
// - The pp_read_ptr is the CPU preprocessing version of the read_ptr.

// The next three variables are used to wakeup the GPU thread
// - There is a fast-path if the GPU is aready running:
// If the atomic is set, the polling loop assure to execute the fifo.

This comment was marked as off-topic.

@JMC47
Copy link
Contributor

JMC47 commented Mar 11, 2015

This is still up to 25% slower in Super Smash Bros. Melee.

static std::mutex s_fifo_mutex;
static std::condition_variable s_fifo_cond_var;

// The next two variables and "isGpuReadingData" are used to block the CPU thread until the GPU is idlying

This comment was marked as off-topic.

@degasus degasus force-pushed the block_gpu_thread branch 4 times, most recently from a24761d to 44597bc Compare March 13, 2015 21:53
@degasus
Copy link
Member Author

degasus commented Mar 13, 2015

Rewritten with events. The logic should be much simpler now.

@JMC47
Copy link
Contributor

JMC47 commented Mar 13, 2015

Latest build is actually faster for me than master.

On Rogue Squadron 2's opening, 76 - 78 fps on Master, 78 - 81 FPS on this build. Also faster in Melee, 400 fps in master to 420 fps on Dreamland in this Pull Request.

CPU usage is down about 25% in almost all games.

Tested on two Intel Quadcore + NVIDIA machines.

@degasus
Copy link
Member Author

degasus commented Mar 13, 2015

I've choosen to limit the amond of wait() calls. Now the GPU thread is only allowed to sleep once per milli second of emulated time. So we may spend some time in the busy loop, but we'll fall to sleep quite soon on idle skipping.

Ready for reviewing and merging in my option

@degasus
Copy link
Member Author

degasus commented Mar 14, 2015

Updated with a fix for syncgpu and deterministic dual core

@degasus degasus force-pushed the block_gpu_thread branch 5 times, most recently from f3dd175 to ffcbaaa Compare March 15, 2015 17:26
@Ofunniku
Copy link

My Xenoblade save that runs at 17-18 fps on master is 20-21 FPS on this build.

Tested on an Athlon II X2 250

@degasus degasus force-pushed the block_gpu_thread branch 2 times, most recently from b41e791 to bc38f7b Compare March 29, 2015 13:23
@Ofunniku
Copy link

Same save and settings profile as above for Xenoblade runs at 29-31 on this latest build PR build.

Tested on an Athlon II X2 250

EDIT: was this jump from the merge of PR-2192?

@degasus
Copy link
Member Author

degasus commented Mar 30, 2015

@Ofunniku yes, so we also need master to compare the performance of this PR

@Ofunniku
Copy link

For Xenoblade:
4.0-5762: 17-18
This PR (Old): 20-21
4.0-5952: 29-31
This PR (2015-03-29): 29-31

Although I do have some interesting results from SSBM (FoD, 4CPU, Link, Ness, Ice Climbers, Capt. Falcon):
4.0-5762: 43-45
This PR (Old): 53-54
4.0-5952: 55-59
This PR (2015-03-29): 57-63

@Ofunniku
Copy link

I did this to compare CPU usage between builds, this was run with [Framelimit: Auto]

Xenoblade, Bionis Leg (CPU usage)
4.0-5952: CPU1 ~68% CPU2 98%
PR-2172 (2015-03-29): CPU1 59% CPU2 48%

SSBM (FoD, 4CPU, Link, Ness, Ice Climbers, Capt. Falcon) (CPU usage)
4.0-5952: CPU1 100% CPU2 100%
PR-2172 (2015-03-29): CPU1 ~83% CPU2 ~83%

@JMC47
Copy link
Contributor

JMC47 commented Apr 3, 2015

I can't find any new slowdown in New Super Mario Bros. Wii. with EFB2RAM.

6x IR Master - 43 fps
6x IR PR2172 - 43 fps

@Ofunniku
Copy link

Ofunniku commented Apr 3, 2015

Benchmarks I've gathered so far: http://pastebin.com/vvbPX2tx
Win7, Athlon II x2 250, Radeon HD 6670

@@ -51,6 +52,7 @@ void AsyncRequests::PushEvent(const AsyncRequests::Event& event, bool blocking)

if (blocking)
{
RunGpu();

This comment was marked as off-topic.

@phire
Copy link
Member

phire commented Apr 6, 2015

LGTM

degasus added a commit that referenced this pull request Apr 6, 2015
@degasus degasus merged commit 4669b50 into dolphin-emu:master Apr 6, 2015
@dolphin-emu-bot
Copy link
Contributor

FifoCI detected that this change impacts graphical rendering. Here are the behavior differences detected by the system:

  • rs2-glass on ogl-lin-nv: diff

automated-fifoci-reporter

@Sonicadvance1
Copy link
Contributor

Breaks AArch64.

Sonicadvance1 added a commit to Sonicadvance1/dolphin that referenced this pull request May 11, 2015
Causes flickering in games ever since PR dolphin-emu#2172.
No idea why
@degasus degasus deleted the block_gpu_thread branch August 9, 2015 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
8 participants