Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log not buffering multiple lines due to AutoFlush #19029

Closed
anvilvapre opened this issue Jan 10, 2021 · 13 comments · Fixed by #19264
Closed

Log not buffering multiple lines due to AutoFlush #19029

anvilvapre opened this issue Jan 10, 2021 · 13 comments · Fixed by #19264

Comments

@anvilvapre
Copy link
Contributor

Version: bleed
Impact: high

Background:
The game main loop assumes the maximum fps configured is the minimum fps possible. On hosts that cannot keep up with the configured renderInterval or forcedRenders the game becomes slower and slower untill 120% CPU is used. See #18491.
One of the reason why the games becomes slower and slower is due to the increase of logging of slow events to debug and perf log channels.

Problem:
Log uses TextWriter to write text to a file. The AutoFlush causes data to be written per logged line. PerfTimer.LogLongTick logs events that take longer than 1ms. On slow systems - that can't keep up - the number of events grows. On OS level this results in continuous expensive calls to lseek and write each logic tick for each line log statement.

Example debug.log, same for perf.log. Last column in time in microseconds spent in system call.

write(6, "     1 ms [103612] Trait: Shroud"..., 33) = 33 <0.000087>
lseek(6, 16287402, SEEK_SET)            = 16287402 <0.000029>
write(6, "     8 ms [103612] Effect: Bulle"..., 34) = 34 <0.000044>
lseek(7, 2541312, SEEK_SET)             = 2541312 <0.000011>
write(7, "--Tick: 103613 (local)\n", 23) = 23 <0.000044>
lseek(6, 16287436, SEEK_SET)            = 16287436 <0.000043>
write(6, "     1 ms [103613] Activity: Fly"..., 33) = 33 <0.000081>
lseek(7, 2541335, SEEK_SET)             = 2541335 <0.000032>
write(7, "--Tick: 103614 (local)\n", 23) = 23 <0.000068>
lseek(7, 2541358, SEEK_SET)             = 2541358 <0.000037>
write(7, "--Tick: 103615 (net)\n", 21)  = 21 <0.000080>
lseek(7, 2541379, SEEK_SET)             = 2541379 <0.000027>
write(7, "ttnk #12039 killed by e3 #11941\n", 32) = 32 <0.000079>
lseek(7, 2541411, SEEK_SET)             = 2541411 <0.000040>
write(7, "--Tick: 103616 (local)\n", 23) = 23 <0.000069>
lseek(6, 16287469, SEEK_SET)            = 16287469 <0.000040>
write(6, "     1 ms [103616] Trait: LuaScr"..., 36) = 36 <0.000044>
lseek(7, 2541434, SEEK_SET)             = 2541434 <0.000066>
write(7, "--Tick: 103617 (local)\n", 23) = 23 <0.000039>
lseek(7, 2541457, SEEK_SET)             = 2541457 <0.000067>
write(7, "--Tick: 103618 (net)\n", 21)  = 21 <0.000097>

Bug:

  • Log.Write does not buffer multiple lines. Resulting in a 'storm' of system write and lseek calls on slow systems.

Possible solution:

  • Disable AutoFlush
  • Set log buffer size during construction of log text writer
  • Flush - only modified - logs each x logic ticks from the game main loop. Add Log.tick.
  • Flush only if at least X bytes of characters are logged. To avoid possible forced flush system call.
  • Investigate if TextWriter should not be some other stream. Documentation mentions TextWriter is specifically used to log to the console.
  • Add LogChannel.Enabled to alllow if(Log.Enabled("perf)) Log.Write("perf", "{0}", expensiveCalculationSlowTime);
  • Disable certain log channels in release build.
  • Change PerfTimer.LogLongTick in release builds to a much higher value i.e. 10ms.
  • Add the default log channels also as Log class field members. Allowing Log.Write(Log.Perf, "line"); Avoiding a call to Channel each time.
  • Modify Log to (optionally) prefix each line always with a logic, render and net frame.
  • Change debug.log log format. Prefix log lines with logic tick number. Remove logging lines with --Tick: each tick. Causing the tick only to be logged when relevant.

Example perf.log output:

     3 ms [21604] Trait: Shroud
     3 ms [21604] Trait: Shroud
     1 ms [21604] Trait: Mobile
     2 ms [21604] Trait: WithDecoration
     4 ms [21604] Effect: Bullet
     1 ms [21605] Activity: MoveFirstHalf
     2 ms [21605] Activity: Move
     3 ms [21605] Activity: AttackMoveActivity
     3 ms [21605] Activity: Hunt
     1 ms [21605] Trait: ActorMap
     2 ms [21605] Trait: LuaScript
     3 ms [21605] Trait: Shroud
     3 ms [21605] Trait: Shroud
     3 ms [21605] Trait: Shroud
     1 ms [21606] Activity: MoveFirstHalf
     2 ms [21606] Activity: Move
     2 ms [21606] Activity: AttackMoveActivity
     2 ms [21606] Activity: Hunt
     1 ms [21606] Activity: MoveFirstHalf
     1 ms [21606] Activity: Move
     2 ms [21606] Activity: AttackMoveActivity
     1 ms [21606] Activity: Fly
@anvilvapre anvilvapre added the Bug label Jan 10, 2021
@teinarss
Copy link
Contributor

@pchote
Copy link
Member

pchote commented Jan 10, 2021

IIRC the current setup was motivated by making sure that we didn't lose important log messages when the runtime crashes without first flushing the buffers. This was important for exceptions, and everything else just came along for the ride.

@teinarss
Copy link
Contributor

Something like this would be best for performance teinarss@29d6d58

@anvilvapre
Copy link
Contributor Author

Illustration how 28 writes per second are done to debug.log. 22 for perf. 50 OS calls times 2, one for seek, one for write.

18:20:30 loadavg: 1.22 1.07 1.16 1/1282 41596

TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
33757  mono             0      22     0       0       R perf.log
33784  threaded-ml      55     38     0       0       O [eventfd]
33757  mono             0      28     0       0       R debug.log

I think the simple solution of buffering each 50 logic ticks - or up to 4096k - would suffice together with catching top level exceptions that flush logs.

Don't use too much generics in critical code. Also i think having a log flush thread is overkill. It only results in using locks and therefore locking.

Logging debug.log info per frame on a single line also helps.

Logging all logs to a single stream may also help. You could use syslog but likely difficult to support on all platforms.

@Leonardo-Ferreira
Copy link

I believe you have the wrong interpretation for the facts at hand. The logging issue is in fact na issue because it blocks a thread that was supposed to be doing some more important work. Dispatching the log writing to another thread (an WriteLineAsync approach for example) would solve the issue at hand in a more elegant and performatic way in my opinion.

@anvilvapre
Copy link
Contributor Author

Thanks. I think I do understand it frees up the main thread. And i think a logging thread would be useful if the game would be logging many megabytes of data each second. The solution may be overkill for the problem at hand.

Using a thread and locking would also introduce similar system calls, use resources, introduces code complexity.

We are talking here about a few and often short text lines. Using default buffering - as you would use when writing to any text file - rather than interactive console - would perhaps suffice. The TextWriter that is used already does this buffering by default unless you set Autoflush which makes it flush after each newline.

It is also not about the amount of data, just that it generates quite a lot of "expensive" system calls for such a small amount of text.

@dnqbob
Copy link
Contributor

dnqbob commented Mar 20, 2021

I think something in log must change.

I know OpenRA default mods won't have too big problem on logs because AI/player won't build big army, but for user who use this engine on developing, this problem cannot be ignore.

On 400 actors, this bug on performance is critical: it absorbs 10-15% of the FPS, which is painful for light weight game/mod made by OpenRA for universal computer. I think we should solve this by either muti-thread (log thread doesn't block other threads)or player options in Debug Options Menu(default on) for one of the last method for better performance.

@teinarss
Copy link
Contributor

@dnqbob i will fill a PR with my previous idea, shortly.

@dnqbob
Copy link
Contributor

dnqbob commented Mar 20, 2021

@dnqbob i will fill a PR with my previous idea, shortly.

And I will take this PR to rv-engine and ta-engine, if upstream doesn't merge it -- it is too painful on my computer to playtest those mods on a 2v2 game, which I think is a bottom line of the performance requirement.

@anvilvapre
Copy link
Contributor Author

please consider to have the option to always log the different tick times as first log columns. makes logs more unified and easier to process with external tooling.

invite modders to not log each tick only the tick time.

@reaperrr
Copy link
Contributor

I locally increased the LongTickThresholdMs to 5 and watched a 4-AI bot skirmish.

perf.log looked like this:

    16 ms [1] Trait: ModularBot
    10 ms [2] Trait: Shroud
     6 ms [328] Trait: ModularBot
    48 ms [5537] Trait: ModularBot
     8 ms [5908] Trait: ModularBot
    11 ms [6958] Trait: ModularBot
     5 ms [10558] Trait: ModularBot
     7 ms [12761] Trait: ModularBot
     5 ms [14108] Trait: ModularBot
     7 ms [14616] Trait: ModularBot
     5 ms [14666] Trait: ModularBot
     6 ms [15816] Trait: ModularBot
     7 ms [15958] Trait: ModularBot
     6 ms [16008] Trait: ModularBot
     6 ms [17766] Trait: ModularBot
     6 ms [18466] Trait: ModularBot
    12 ms [18516] Trait: ModularBot
     5 ms [18558] Trait: ModularBot
     6 ms [18808] Trait: ModularBot
     6 ms [21158] Trait: ModularBot
     7 ms [21208] Trait: ModularBot
     5 ms [22308] Trait: ModularBot
    12 ms [22458] Trait: ModularBot
     8 ms [22966] Trait: ModularBot
    16 ms [23158] Trait: ModularBot
    12 ms [23208] Trait: ModularBot
     6 ms [24808] Trait: ModularBot
     5 ms [25108] Trait: ModularBot
     5 ms [25313] Trait: ModularBot
     6 ms [25408] Trait: ModularBot
     9 ms [25558] Trait: ModularBot
    13 ms [25858] Trait: ModularBot
    12 ms [26008] Trait: ModularBot
     6 ms [26058] Trait: ModularBot
     5 ms [26208] Trait: ModularBot
     6 ms [26308] Trait: ModularBot
     9 ms [26758] Trait: ModularBot
     7 ms [27208] Trait: ModularBot
     5 ms [27258] Trait: ModularBot
     6 ms [27358] Trait: ModularBot
     6 ms [28108] Trait: ModularBot
     6 ms [28808] Trait: ModularBot
    10 ms [29008] Trait: ModularBot
     7 ms [29058] Trait: ModularBot
     6 ms [29313] Trait: ModularBot
    11 ms [29466] Trait: ModularBot
     8 ms [29763] Trait: ModularBot
     8 ms [30366] Trait: ModularBot
     6 ms [31113] Trait: ModularBot
    11 ms [31408] Trait: ModularBot
     5 ms [31463] Trait: ModularBot
     9 ms [32008] Trait: ModularBot
     6 ms [32213] Trait: ModularBot
     7 ms [32313] Trait: ModularBot
     7 ms [32513] Trait: ModularBot
     6 ms [32608] Trait: ModularBot
     5 ms [33358] Trait: ModularBot
     7 ms [34313] Trait: ModularBot
     7 ms [34413] Trait: ModularBot
    14 ms [34463] Trait: ModularBot
     5 ms [34558] Trait: ModularBot
     7 ms [34913] Trait: ModularBot
     5 ms [35008] Trait: ModularBot
     8 ms [35308] Trait: ModularBot
    13 ms [35463] Trait: ModularBot
     5 ms [35758] Trait: ModularBot
     5 ms [35913] Trait: ModularBot
     7 ms [36163] Trait: ModularBot
     7 ms [36213] Trait: ModularBot
    14 ms [36463] Trait: ModularBot
     5 ms [36608] Trait: ModularBot
    14 ms [36658] Trait: ModularBot
     9 ms [36666] Trait: ModularBot
     7 ms [36813] Trait: ModularBot
     5 ms [36908] Trait: ModularBot
     6 ms [37058] Trait: ModularBot
    11 ms [37108] Trait: ModularBot
     7 ms [37258] Trait: ModularBot
     6 ms [37408] Trait: ModularBot
     5 ms [37558] Trait: ModularBot
     6 ms [37608] Trait: ModularBot
    13 ms [38058] Trait: ModularBot
     6 ms [38158] Trait: ModularBot
     6 ms [38166] Trait: ModularBot
     5 ms [38608] Trait: ModularBot
     6 ms [38908] Trait: ModularBot
    11 ms [38913] Trait: ModularBot
     7 ms [38958] Trait: ModularBot
     5 ms [39258] Trait: ModularBot
     5 ms [40616] Trait: ModularBot
     5 ms [40708] Trait: ModularBot
     8 ms [41213] Trait: ModularBot
    13 ms [41563] Trait: ModularBot
     6 ms [41866] Trait: ModularBot
     6 ms [43266] Trait: ModularBot
     7 ms [43313] Trait: ModularBot
     7 ms [44463] Trait: ModularBot
    10 ms [44513] Trait: ModularBot
    11 ms [46163] Trait: ModularBot
     5 ms [46913] Trait: ModularBot
     7 ms [47413] Trait: ModularBot
     7 ms [48263] Trait: ModularBot
     5 ms [49013] Trait: ModularBot
     6 ms [49113] Trait: ModularBot
     9 ms [49463] Trait: ModularBot
     9 ms [50013] Trait: ModularBot
     9 ms [50163] Trait: ModularBot
     9 ms [50813] Trait: ModularBot

Of course I know it will look quite different on slower CPUs, that's not my point.
My point is more that - except for easy spotting of infinite loops if the looped action exceeds 1 ms per loop - perf.log rarely tells us devs anything we don't already know, and dragging down performance on slow CPUs just to maybe get lucky and spot an infinite loop this way seems quite excessive.

Setting the default threshold to 10 ms to help performance on slow systems (and setting it back to 1 ms in our local settings.yaml for us devs) seems like a good compromise to me, we should still be able to spot any new simulation perf regressions and infinite loops can be debugged in other ways, too.

Same goes for the --Tick logging in debug.log. The most common sources of crashes are exceptions (which usually tell us much more than debug.log), desyncs (which also show up as exceptions and are better debugged via replays) and infinite loops (which won't show up in debug.log at all).
In other words, all the --Tick logging seems excessive and unnecessary to me. We can still make it opt-in and enable it on our systems, but again, forcing that on for players when a 1-hour game already results in a 3.6 MB or larger debug.log (even Notepad++ can't open that anymore) seems quite overkill, too.

@anvilvapre
Copy link
Contributor Author

Using LoggerMessage or similar pattern may be beneficial in future.

https://docs.microsoft.com/en-us/dotnet/api/microsoft.extensions.logging.loggermessage?view=dotnet-plat-ext-5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants