Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Linux timers for sleeps up to 1ms #6697

Merged
merged 2 commits into from Oct 9, 2019
Merged

Use Linux timers for sleeps up to 1ms #6697

merged 2 commits into from Oct 9, 2019

Conversation

plappermaul
Copy link
Contributor

This patch is for review/discussion. I do not know if I got everything in the
right place.

The current sleep timer implementation basically offers two variants. Either
wait the specified time exactly with a condition variable (as host) or use a
combination of it with a thread yielding busy loop afterwards (usleep timer).

While the second one is very precise it consumes CPU loops for each wait call
below 50us. Games like Bomberman Ultra spam 30us waits and the emulator hogs
low power CPUs. Switching to host mode reduces CPU consumption but gives a
~50us penalty for each wait call. Thus extending all sleeps by a factor of
more than two.

The following bugfix tries to improve the system timer for Linux by using
Linux native timers for small wait calls below 1ms. This has two effects.

  • Host wait setting has much less wait overhead
  • usleep wait setting produces lower CPU overhead

Some numbers for host timer setting from my tests on a Pentium G5600, UHD
630 waiting in the Bomberman welcome screen. I shortened/lengthened the
game timer inside the emulator to get a better picture for different wait
lenghts. As you can see current implementation always produces a 50us
overhead while the new implementation mostly stays below 10us. us(er),
sy(stem), id(le) have been taken from vmstat during the tests.

sleeps of 70usec
              Calls   >=120us  <120us   <95us   <80us   <73us  us  sy  id
Master run 1: 1000000  708599  144933  114607   27954    3906  44  12  15
Master run 2: 1000000  707853  145802  114613   27757    3975  45  12  43
Patch  run 1: 1000000   24478   37779  122771  679292  135679  46  13  41
Patch  run 2: 1000000   27544   38647  120150  676306  137353  45  13  42

sleeps of 60usec
              Calls   >=110us  <110us   <85us   <70us   <63us  us  sy  id
Master run 1: 1000000  695187  167665  107111   26767    3269  42  11  47
Master run 2: 1000000  698397  166151  106322   25889    3241  42  11  46
Patch  run 1: 1000000   23266   36454  131397  651232  157650  44  12  44
Patch  run 2: 1000000   27780   41361  141313  636585  152961  45  12  42

sleeps of 50usec
              Calls   >=100us  <100us   <75us   <60us   <53us  us  sy  id
Master run 1: 1000000  690729  183766   97207   25160    3137  43  12  46
Master run 2: 1000000  689518  184570   97716   25131    3065  42  11  47
Patch  run 1: 1000000   21068   34504  124814  646399  173214  45  13  42
Patch  run 2: 1000000   22531   36852  130585  638397  171635  44  12  44

sleeps of 40usec
              Calls    >=90us   <90us   <65us   <50us   <43us  us  sy  id
Master run 1: 1000000  688084  176572  111680   20357    3306  45  12  44
Master run 2: 1000000  687553  177216  111599   20409    3223  46  12  42
Patch  run 1: 1000000   18164   31248  113778  643851  192958  44  12  44
Patch  run 2: 1000000   20985   34841  120508  633031  190635  45  12  43

sleeps of 30usec
              Calls    >=80us   <80us   <55us   <40us   <33us  us  sy  id
Master run 1: 1000000  721705  205084   60793   12060     357  44  12  45
Master run 2: 1000000  720323  205960   61524   11884     309  43  11  46
Patch  run 1: 1000000   15139  16863   101604  629094  227299  44  12  44
Patch  run 2: 1000000   18560  30207   110159  617093  223981  45  12  43

sleeps of 20usec
              Calls    >=70us   <70us   <45us   <30us   <23us  us  sy  id
Master run 1: 1000000  813648  144746   36458    5111      36  43  12  45
Master run 2: 1000000  813322  144917   36618    5097      46  45  12  43
Patch  run 1: 1000000   14073   23076   83921  635412  243517  45  13  42
Patch  run 2: 1000000   13769   23460   86245  632826  243700  44  13  43

sleeps of 10usec
              Calls    >=60us   <60us   <35us   <20us   <13us  us  sy  id
Master run 1: 1000000  864216  101101   29002    5651      29  43  12  45
Master run 2: 1000000  864896  100595   28941    5550      18  42  11  47
Patch  run 1: 1000000    7613   13301   52335  640861  285889  46  13  41
Patch  run 2: 1000000    7223   13280   52123  644643  282731  47  13  40

Comparison between host and usleep setting for game defaults of 30us waits

                   fps  us  sy  id
Mater run host  :   53  43  11  46
Patch run host  :   52  44  12  44
Mater run usleep:   49  51  18  31
Patch run usleep:   51  48  15  37

@kd-11
Copy link
Contributor

kd-11 commented Oct 3, 2019

As with any good timer patch for rpcs3, please include comparison results with this reference testcase.

@hardBSDk
Copy link

hardBSDk commented Oct 3, 2019

What about the BSDs? This could break other Unices builds that doesn't support Linux timers?

@Nekotekina
Copy link
Member

@hardBSDk no it can't break anything.
PR is incomplete since there is no way to signal thread to wake up immediately when it sleeps on timer. I'll take a look at its API.

@hardBSDk
Copy link

hardBSDk commented Oct 4, 2019

@Nekotekina Thanks! I will run RPCS3 by the first time since I take a computer.

Love your work @kd-11 @Nekotekina

@plappermaul
Copy link
Contributor Author

@kd-11 Thanks for the testcase. I will give it a try and report the results back.

@Nekotekina You are right. timerfd does not support signalling. The thread will block during execution of the sleeps. For that reason I only implemented waits for small intervals. I'm totally fine if we reduce the maximum sleep time from 1000us to lets say 250us or even down to 100us. This is essentially the time interval the optimization aims for.

Utilities/Thread.cpp Outdated Show resolved Hide resolved
Utilities/Thread.cpp Outdated Show resolved Hide resolved
Utilities/Thread.cpp Outdated Show resolved Hide resolved
@plappermaul
Copy link
Contributor Author

As with any good timer patch for rpcs3, please include comparison results with this reference testcase.

Here are the test results:

Host

Master                   | Patch
-------------------------+ -------------------------
Application started      | Application started
Calc baseline ...(0us)   | Calc baseline ...(0us)
Testing usleep(600)...   | Testing usleep(600)...
    Latency = 691us      |     Latency = 632us
Testing usleep(610)...   | Testing usleep(610)...
    Latency = 699us      |     Latency = 645us
Testing usleep(650)...   | Testing usleep(650)...
    Latency = 737us      |     Latency = 680us
Testing usleep(700)...   | Testing usleep(700)...
    Latency = 788us      |     Latency = 732us
Testing usleep(800)...   | Testing usleep(800)...
    Latency = 887us      |     Latency = 835us
Testing usleep(1000)...  | Testing usleep(1000)...
    Latency = 1092us     |     Latency = 1032us
Testing usleep(1355)...  | Testing usleep(1355)...
    Latency = 1448us     |     Latency = 1440us
Application finished     |

usleep

Master                   | Patch
-------------------------+ -------------------------
Application started      | Application started
Calc baseline ...(0us)   | Calc baseline ...(0us)
Testing usleep(600)...   | Testing usleep(600)...
    Latency = 639us      |     Latency = 613us
Testing usleep(610)...   | Testing usleep(610)...
    Latency = 649us      |     Latency = 630us
Testing usleep(650)...   | Testing usleep(650)...
    Latency = 686us      |     Latency = 667us
Testing usleep(700)...   | Testing usleep(700)...
    Latency = 738us      |     Latency = 716us
Testing usleep(800)...   | Testing usleep(800)...
    Latency = 838us      |     Latency = 823us
Testing usleep(1000)...  | Testing usleep(1000)...
    Latency = 1032us     |     Latency = 1013us
Testing usleep(1355)...  | Testing usleep(1355)...
    Latency = 1389us     |     Latency = 1413us
Application finished     | Application finished

Btw. got a mail but did not find anything here ... Someone mentioned that the patch must increase the ppu cache version. But where?

@Leopard1907
Copy link

Elad made that comment but deleted afterwards.

@kd-11
Copy link
Contributor

kd-11 commented Oct 4, 2019

Sorry, looks like I forgot to attach the base test (0-600)
Here you go
timer.zip
This test complements the other one, this one does 0-600 and the other measures 600-1000.
It looks like it should be ok from mine and another tester's results, but its still good to document the data.

@Whatcookie
Copy link
Member

Whatcookie commented Oct 4, 2019

Seems that at idle the accuracy is approx 16us, but if switch to the performance governor the accuracy is 10us. Since the load of RPCS3 will cause any system to raise to maximum power state, it makes sense to test the accuracy of this at maximum power state as well.

So I suggest simply setting the min quantum to 10us rather than 16, this will also have the benefit of "fixing" the behavior when the accuracy is set to usleep for values of 30us, since
thread_ctrl::wait_for(remaining - ((remaining % host_min_quantum) + host_min_quantum));
would evaluate to 0 for values of 30us, causing us to simply spin, and not take advantage of timerfd.

@plappermaul
Copy link
Contributor Author

I initially started with 15us quantum but felt that 16us is a little more accurate. As requested here are all the numbers. Ran the test with 16us/50us quantums from the last commit.

Host idle

           host   host usleep usleep
 usleep   patch master  patch master
      0       0      0      0      0   
      1       3     52      2      2
     10      14     63     11     11
     50      55    104     51     51
    100     109    155    101    105
    200     220    276    205    221
    300     329    377    308    328
    400     428    486    418    435
    500     533    583    519    539
special
    250     271    328    258    279
    280     305    360    290    287
    290     319    367    303    294
    299     326    379    307    302
    301     330    380    328    332    
    310     337    393    323    332
    320     346    403    336    335
    600     643    686    616    637
long
    600     632    686    612   634
    610     644    697    629   640
    650     683    737    667   687
    700     732    787    717   738
    800     835    887    820   838
   1000    1032   1091   1013  1031
   1355    1439   1445   1386  1389

Single thread load of dd if=/dev/zero of=/dev/null

           host   host usleep usleep
 usleep   patch master  patch master
      0      0       0      0      0 
      1      3      53      2      2
     10     14      63     11     11
     50     55     100     52     51
    100    104     151    101    104
    200    209     258    202    212
    300    313     356    302    312
    400    414     461    406    413
    500    512     563    504    515
special
    250    260     315    251    262
    280    292     340    282    281
    290    302     351    293    291
    299    312     359    300    300
    301    314     358    302    313
    310    321     369    313    319
    320    331     382    324    324
    600    613     666    604    615
long
    600    615     664    604    613
    610    622     673    615    619
    650    670     710    653    667
    700    719     763    702    715
    800    814     863    805    813
   1000   1010    1063   1003   1012
   1355   1416    1418   1365   1366

@Whatcookie
Copy link
Member

Looks like the timer drifts a lot more on your computer than mine.

host, this PR
Application started
Calculating baseline delay...(0us)
Testing usleep(0)...
    Latency = 0us
Testing usleep(1)...
    Latency = 11us
Testing usleep(10)...
    Latency = 12us
Testing usleep(50)...
    Latency = 62us
Testing usleep(100)...
    Latency = 110us
Testing usleep(200)...
    Latency = 209us
Testing usleep(300)...
    Latency = 309us
Testing usleep(400)...
    Latency = 408us
Testing usleep(500)...
    Latency = 508us
Doing the special tests...
Testing usleep(250)...
    Latency = 260us
Testing usleep(280)...
    Latency = 289us
Testing usleep(290)...
    Latency = 299us
Testing usleep(299)...
    Latency = 303us
Testing usleep(301)...
    Latency = 305us
Testing usleep(310)...
    Latency = 319us
Testing usleep(320)...
    Latency = 329us
Testing usleep(600)...
    Latency = 613us
Application finished
Application started
Calculating baseline delay...(0us)
Testing usleep(600)...
    Latency = 608us
Testing usleep(610)...
    Latency = 618us
Testing usleep(650)...
    Latency = 658us
Testing usleep(700)...
    Latency = 708us
Testing usleep(800)...
    Latency = 809us
Testing usleep(1000)...
    Latency = 1007us
Testing usleep(1355)...
    Latency = 1411us
Application finished 

@Nekotekina
Copy link
Member

I added an "alert" parameter to wait_for function. Timer should only be used if "alert" is false, otherwise it'll significantly increase synchronization latency.

@plappermaul
Copy link
Contributor Author

Hopefully rebased the pull request correctly to fit Nekotekinas additions.

Copy link
Contributor

@elad335 elad335 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You rebased the wrong way around, you should rebase rpcs3/master onto master.

@plappermaul
Copy link
Contributor Author

Urgh.

@plappermaul
Copy link
Contributor Author

Better now?

@MSuih
Copy link
Member

MSuih commented Oct 8, 2019

Nope, the commit list still has 20 commits instead of 2

@kd-11
Copy link
Contributor

kd-11 commented Oct 8, 2019

You may want a clean start here.

  1. Save current work with git checkout -b newbranch
  2. Reset master to upstream git reset --hard upstream/master
  3. Pick your 2 commits git cherry-pick newbranch~1 && git cherry-pick newbranch
  4. Delete newbranch git branch -D newbranch

@Megamouse
Copy link
Contributor

I think it's three commits though?

@kd-11
Copy link
Contributor

kd-11 commented Oct 8, 2019

Forgot to mention you have to checkout master between 1 and 2 to avoid losing your work.

@Megamouse
Copy link
Contributor

I properly rebased it on my plappermaul branch if you need help

@plappermaul
Copy link
Contributor Author

@kd-11 Thanks, I will try your instructions.

v1: Initial version
v2: implement review comments
v3: adapt to new API

The current sleep timer implementation basically offers two variants. Either
wait the specified time exactly with a condition variable (as host) or use a
combination of it with a thread yielding busy loop afterwards (usleep timer).

While the second one is very precise it consumes CPU loops for each wait call
below 50us. Games like Bomberman Ultra spam 30us waits and the emulator hogs
low power CPUs. Switching to host mode reduces CPU consumption but gives a
~50us penalty for each wait call. Thus extending all sleeps by a factor of
more than two.

The following bugfix tries to improve the system timer for Linux by using
Linux native timers for small wait calls below 1ms. This has two effects.

- Host wait setting has much less wait overhead
- usleep wait setting produces lower CPU overhead

Some numbers for host timer setting from my tests on a Pentium G5600, UHD
630 waiting in the Bomberman welcome screen. I shortened/lengthened the
game timer inside the emulator to get a better picture for different wait
lenghts. As you can see current implementation always produces a 50us
overhead while the new implementation mostly stays below 10us. us(er),
sy(stem), id(le) have been taken from vmstat during the tests.

sleeps of 70usec
              Calls   >=120us  <120us   <95us   <80us   <73us  us  sy  id
Master run 1: 1000000  708599  144933  114607   27954    3906  44  12  15
Master run 2: 1000000  707853  145802  114613   27757    3975  45  12  43
Patch  run 1: 1000000   24478   37779  122771  679292  135679  46  13  41
Patch  run 2: 1000000   27544   38647  120150  676306  137353  45  13  42

sleeps of 60usec
              Calls   >=110us  <110us   <85us   <70us   <63us  us  sy  id
Master run 1: 1000000  695187  167665  107111   26767    3269  42  11  47
Master run 2: 1000000  698397  166151  106322   25889    3241  42  11  46
Patch  run 1: 1000000   23266   36454  131397  651232  157650  44  12  44
Patch  run 2: 1000000   27780   41361  141313  636585  152961  45  12  42

sleeps of 50usec
              Calls   >=100us  <100us   <75us   <60us   <53us  us  sy  id
Master run 1: 1000000  690729  183766   97207   25160    3137  43  12  46
Master run 2: 1000000  689518  184570   97716   25131    3065  42  11  47
Patch  run 1: 1000000   21068   34504  124814  646399  173214  45  13  42
Patch  run 2: 1000000   22531   36852  130585  638397  171635  44  12  44

sleeps of 40usec
              Calls    >=90us   <90us   <65us   <50us   <43us  us  sy  id
Master run 1: 1000000  688084  176572  111680   20357    3306  45  12  44
Master run 2: 1000000  687553  177216  111599   20409    3223  46  12  42
Patch  run 1: 1000000   18164   31248  113778  643851  192958  44  12  44
Patch  run 2: 1000000   20985   34841  120508  633031  190635  45  12  43

sleeps of 30usec
              Calls    >=80us   <80us   <55us   <40us   <33us  us  sy  id
Master run 1: 1000000  721705  205084   60793   12060     357  44  12  45
Master run 2: 1000000  720323  205960   61524   11884     309  43  11  46
Patch  run 1: 1000000   15139  16863   101604  629094  227299  44  12  44
Patch  run 2: 1000000   18560  30207   110159  617093  223981  45  12  43

sleeps of 20usec
              Calls    >=70us   <70us   <45us   <30us   <23us  us  sy  id
Master run 1: 1000000  813648  144746   36458    5111      36  43  12  45
Master run 2: 1000000  813322  144917   36618    5097      46  45  12  43
Patch  run 1: 1000000   14073   23076   83921  635412  243517  45  13  42
Patch  run 2: 1000000   13769   23460   86245  632826  243700  44  13  43

sleeps of 10usec
              Calls    >=60us   <60us   <35us   <20us   <13us  us  sy  id
Master run 1: 1000000  864216  101101   29002    5651      29  43  12  45
Master run 2: 1000000  864896  100595   28941    5550      18  42  11  47
Patch  run 1: 1000000    7613   13301   52335  640861  285889  46  13  41
Patch  run 2: 1000000    7223   13280   52123  644643  282731  47  13  40

Comparison between host and usleep setting for game defaults of 30us waits
                   fps  us  sy  id
Mater run host  :   53  43  11  46
Patch run host  :   52  44  12  44
Mater run usleep:   49  51  18  31
Patch run usleep:   51  48  15  37
@plappermaul
Copy link
Contributor Author

got it somehow ...

@@ -118,6 +118,11 @@ class thread_base
using native_entry = void*(*)(void* arg);
#endif

#ifdef __linux__
// Linux thread timer
int m_timer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize as -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.

m_timer = timerfd_create(CLOCK_MONOTONIC, 0);
if (m_timer != -1)
{
LOG_SUCCESS(GENERAL, "allocated high precision Linux timer");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove log on success

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied

Implement Nekotekinas requests.
@Nekotekina Nekotekina merged commit 925f2ce into RPCS3:master Oct 9, 2019
Copy link
Member

@Nekotekina Nekotekina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

kd-11 pushed a commit to kd-11/rpcs3 that referenced this pull request Nov 2, 2019
* Use Linux timers for sleeps up to 1ms (v3)
The current sleep timer implementation basically offers two variants. Either
wait the specified time exactly with a condition variable (as host) or use a
combination of it with a thread yielding busy loop afterwards (usleep timer).

While the second one is very precise it consumes CPU loops for each wait call
below 50us. Games like Bomberman Ultra spam 30us waits and the emulator hogs
low power CPUs. Switching to host mode reduces CPU consumption but gives a
~50us penalty for each wait call. Thus extending all sleeps by a factor of
more than two.

The following bugfix tries to improve the system timer for Linux by using
Linux native timers for small wait calls below 1ms. This has two effects.

- Host wait setting has much less wait overhead
- usleep wait setting produces lower CPU overhead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants