Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make game deterministic, so loaded saves always behave the same #1248

Open
jmealo opened this issue Sep 17, 2017 · 11 comments
Open

Make game deterministic, so loaded saves always behave the same #1248

jmealo opened this issue Sep 17, 2017 · 11 comments

Comments

@jmealo
Copy link

jmealo commented Sep 17, 2017

Hello!

Long time fan looking to see if I can help squash some of the remaining bugs.

Atlassian support uses an internal tool called Hercules to scan log files attached to issues to provide end users with solutions for solved issues and/or help track duplicate issues.

I thought maybe we could get some code quality metrics and track the most problematic code structures by checking the commit history and bug traces by creating a harness that can run save files at a given revision in a given environment (this doesn't seem like it'd be difficult given the availability of binary releases). If we can serialize a "logical input stream" (stream of game input, state mutations normalized between platforms) and use it with a git bisect type tool for isolating the bad state changes and/or logical input that triggers a bug. If you can step through a replay frame by frame and mark when the bug occurs you should be able to do a diff of the in game state or step through with a debugger to find the error.

Goals:

  1. Eliminate time spent writing/reading duplicate issues by gamifying/expediting the process
  2. Allow developers to load up the exact location of a bug with an automatic break point loaded from stack traces.
  3. Allow people to report game breaking bugs by playing with a debug build that allows time traveling debugging
  4. Gamify the last of the stubborn bugs by providing a leader board/achievement system awarding players and developers for fixing log standing bugs (with cascading points for fixing bugs that close/address multiple issues). (Existing contributions to date should be on a separate "All time contributors" board to encourage/incentivize lurkers who have not contributed yet cough cough)
  5. (Optional) Create a campaign/contact the community to create a prize pot/bug bounty program to award developers for fixing the bugs.

Inspiration:
There's a conference talk (can't find link) on YouTube outlining how they cleaned up Witcher 3's undrawable geometry using some cobbled together internal tooling to ensure correctness. It reminds us that sometimes it's worth it to spend time on tooling when a task is large enough, especially if that time/effort is expertly applied.

Can I help?
I've used Lua/LuaJIT professionally and tinkered on the ESP8266 but I'm no expert, I simply memorized the differences between Lua and JavaScript and learned about FFI and LuaJIT's performance characteristics versus V8. I haven't written over 1,000 lines of Lua in a single project, however, I'm pretty sure I can help setup/spec an end to end solution for all of the goals above MINUS interfacing with the game/tooling, which will be best done (or specced) by someone familiar with the project.

Is this a "bad idea", unnecessary or otherwise impossible?
Please feel free to "nope" this without explanation. I have no idea what tooling you guys have used to get this far, but, kudos to you for all of your hard work!

Please no dying in the corridors.

@TheCycoONE
Copy link
Member

That is a sequence of good ideas; and a number of the prerequisites are also needed to get networking going.

Would you be interested in refactoring input into a command pattern so that it could be recorded and replayed as well as outlining the worst of the determinism issues. At present save games do not lead to deterministic replays even with no further input - I haven't spent much time looking into why.

@jmealo
Copy link
Author

jmealo commented Sep 18, 2017

@TheCycoONE I'm getting over a head cold and up against some deadlines for work. Maybe we can Skype/Hangout/Zoom to go over the tasks for this and find a good fit.

I'm fine trying to serialize the input and/or help make it deterministic. I have OSX/Windows/Linux environments available but have not bootstrapped a CorsixTH development environment yet (docs seem good, just didn't have the time). My preferred method would be if we could get machine images I can boot on a spare hyper visor I have (gobs of ram and I think 32C/64T) so that I can run the tests in parallel and reset the environments easily. In an ideal world anyone could grab those and run them in VirtualBox or VMWare Player at home.

If we setup a Pritunl VPN we could have the VMs on the same "LAN" which will come in handy for working on the network code.

I have used GDB on Solaris in 2012 but I'm not particularly married to a C/Lua debugger. If it's easier to debug on Windows with Visual Studio I'm down with that. I'd like to use the environment "most developers" are using.

If it's worth the time, I can help setup the VPN and provide remote access to pristine VMs and we can snapshot them (on actual hardware). If I can get it to PXE boot I can give some core members VPN access but the firewall will be clamped down for outgoing connections. As long as it's under $20/mo in power I don't really mind the noise of running a server under my bed.

Random number generator seed?
We might need to seed the random number generator or perhaps make it return an incrementing number.

Where should I start reading?
While I'm thinking about it, could you point me to the relevant bits for input code and state management. Do we move the in-memory structure directly to the save game? (I'm guessing we have binary compatibility for reading Theme Hospital saves, not sure if we're trying to keep it for new saves, that could be an issue).

Append-only log
NATS is a text-based protocol and I've used it from Lua with LuaSocks. I could set up a NATS streaming server to record the event log to a spinning hard disk. (I have an 8TB one laying around). It reads at about 150MB/s+, so we should be able to replay very long gameplay sessions into RAM and run queries on it.

Data/Query ergonomics?
That being said, would you rather be writing Map/Reduce type jobs in Lua or JavaScript over a stream of events or would you rather write SQL and query them from PostgreSQL (a log table).

Unless we have a ton of custom lua code we'll need to interact with the data, I'd go the PostgreSQL route (especially with PL/Lua, that'll let you move some CorsixTH code into queries and triggers to normalize the stream).

Networking
NATS = gnatsd from Apcera
NAT = Network Address Translation

FWIW, NATS would actually be a good foundation for the networking code (simple text based protocol, written in Go, tiny, single binary). It could work on a LAN or over the internet. You could use a STUN server to poke through NAT). I'd argue it makes more sense to bundle a server binary and provide public servers and allow people to run a server on their LAN if they choose. Troubleshooting NAT issues can be a painIf there's something purpose built that might be better. If you wanted to scale a production server, I'd probably just use NGINX (it can do UDP and TCP streams now) and the Lua Streams module. You can easily use C-structs or Lua Tables for state and it'll be really easy to deploy (not as easy as a single go binary, but, it doesn't introduce a new language and easily lets you leverage lua). If not there's a bunch of nice Lua bindings over an event loop suitable for writing a network server.

Sorry, again, head cold, so there's a lot to process here, hopefully it makes sense.

@MarkL1961
Copy link
Contributor

@TheCycoONE

I have always assumed that was down to amount of randomness there is in the code that would not always be repeated when loading a save and could therefore send staff or patients into a different direction.

I could be wrong, this is just what I have assumed when trying to replicate some of the issues over the years

@jmealo
Copy link
Author

jmealo commented Sep 18, 2017

If we provide the same seed to a non cryptographic random number generator: If we call it in the same sequence we must get the same numbers. Assuming it's not timing dependent and we're managing input (repeating a sequence detached from actual user input) shouldn't we be able to achieve determism?

From an outsider's perspective being able to achieve (determinism) could possibly resolve a number of bugs in itself by eliminating side effects.

@TheCycoONE
Copy link
Member

The idea was that the game would be deterministic. The random number generator is MT, and the current seed is suppose to be saved in the save game and reloaded (I haven't personally tested that). That said, the game currently makes heavy use of floating point; some actions are based on asynchronous events (music track / sound effect ending), and some things are triggered by frames instead of in game ticks. I believe both of those things currently impact the same random number generator, if not other aspects of the game state. There is also the possibility that lua's pair ordering might impact the game, since the order of objects in a lua table is not guaranteed.

It's certainly something that needs to be addressed with more rigor than has been done until this point.

@mugmuggy
Copy link
Contributor

@TheCycoONE I just pulled out an example I could find of where the lua pair ordering affects random state.

local areas = self.research_policy
for _, info in pairs(areas) do
-- Don't touch the value "global".
if type(info) == "table" then
-- Some categories may be finished
if info.current then
-- Add new points to this category's current focus.
local research_info = self.research_progress[info.current]
local stored = research_info.points
-- Add just a little randomness
research_info.points = stored + math.n_random(1, 0.2) * points * info.frac / 100

Between a save and reload the research_policy can result in reordering when processing over a pairs loop and math.n_random calls math.random twice.

You'd want something like this to help make it deterministic.

local policy = {"cure", "diagnosis", "drugs", "improvements", "specialisation"}
for _, research_category in ipairs(policy) do
sum = sum + self.research_policy[research_category].frac

There is probably some effort in ensuring the deterministic nature

@mugmuggy
Copy link
Contributor

mugmuggy commented Nov 9, 2018

Found probably one of the main culprits of ruining the deterministic nature also.

local dx = self.screen_offset_x +
math.floor((0.5 - math.random()) * self.shake_screen_intensity * shake_screen_max_movement * 2)
local dy = self.screen_offset_y +
math.floor((0.5 - math.random()) * self.shake_screen_intensity * shake_screen_max_movement * 2)

Every GameUI:draw call, two calls to math.random - even whilst paused and no active quake.

@TheCycoONE
Copy link
Member

That one is my fault, I think we need two random number generators - one tied to the game, where we save the seed, and one for visuals or other non-behavior cases. Also those calls should be gated :)

@mugmuggy
Copy link
Contributor

I put some logic in to check shake_screen_intensity, with a hospital it did have a research dept actually researching so that surpised me a little - but only specialisation/improvements were running - so a bit of luck there.

Ran for 3 months logging every math.random call. The level was completed so the offers were coming every 3 months, when the current 3 month period ended, and game was paused with offer letter I took a save. Then reloaded that twice in 2 different sessions and just resumed it until the next 3 months. all 40k plus calls matched - ie returned value, source file called from, line of the calling function and function name if possible. Obviously I didn't interact with the game (music was disabled), but announcements/sounds were set to play and obviously no quake.

So yes isolating those math randoms for ui/visuals/music randomisation etc, should assist in tracking issues between 2 autosaves as long as the game has run with no user input for the period between 2 autosaves and an issue arose.

@mugmuggy
Copy link
Contributor

I made a simple change to test the separation out

int luaopen_random(lua_State *L)
{
    lua_getglobal(L, "math");
    lua_pushliteral(L, "lua_random");
    lua_pushliteral(L, "random");
    lua_gettable(L, -3);
    lua_settable(L, -3);
    lua_pushliteral(L, "random");
    lua_pushcfunction(L, l_random);
    lua_settable(L, -3);
    lua_pushliteral(L, "lua_randomseed");
    lua_pushliteral(L, "randomseed");
    lua_gettable(L, -3);
    lua_settable(L, -3);
    lua_pushliteral(L, "randomseed");
    lua_pushcfunction(L, l_randomseed);
    lua_settable(L, -3);
    lua_pushliteral(L, "randomdump");
    lua_pushcfunction(L, l_randomdump);
    lua_settable(L, -3);
    return 0;
}

Then just changed app.lua to seed it with math.lua_randomseed and changed GameUI:draw to use math.lua_random instead and I got some repeatability occuring. I also picked up on some more diversion.

There is this in World:onTick

local new_game_date = self.game_date:plusHours(self.hours_per_tick)
-- End of day/month/year
if self.game_date:dayOfMonth() ~= new_game_date:dayOfMonth() then
for _, hospital in ipairs(self.hospitals) do
hospital:onEndDay()
end
self:onEndDay()

Which calculates the after 'ticks' date, and if the world we be in a future day (month and year processing also) does related processing. Hospital:onEndDay at least has some calls to random, one not really deterministic, (ui message) the other might be a little more (boiler breakdown) and subsequent function calls refer to random.

This means the speed of the game changes the outcome of the random processing. My earlier comment, the game was let run with no inteference at "And then some more", but a period of processing at "And then some more" and "Speed up" at the end of a day will result in changes to the order of the random calculations.

Likewise the other one is onEndMonth calculating the future spawn dates. And individual spawn hours are affected as the date is advanced with each onEndDay call.

self.game_date = new_game_date
for i = 1, self.hours_per_tick do

At the faster speeds, the 'hours' on occasion can skip a spawn as the date will be advanced 3 or 4 hours, if onEndDay is say run at the 49th hour (actual end of day), at speed "And then some more", the current hour leading into the above loop will be 2, and if onEndDay calcuates one at 1st (currently there is a bug around the spawn hour too, 0 to 49 are valid spawn hours, but Corsix currently uses a 1 to 50 range) the spawn will not occur. So running at the lower speeds is really the only way to ensure repeatability with the current build.

I'm guessing it was done to simplify some things, but means all onEndDay/Month/Year processsing occurs before all the ticks for those entities that day have completed.

@mugmuggy
Copy link
Contributor

And there is some other code dependent on the hours_per_tick in World:tickEarthquake along with random calls, and since a high game speed calls random in a different order, it alters the outcome. Outcomes like machine damage, might be added before the ticks of other entities would process when run at the lower speeds.

@TheCycoONE TheCycoONE changed the title Incentivize developers and players to squash remaining bugs for fun and profit Make game deterministic, so loaded saves always behave the same Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants