Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another one MP desyncs debugging #9768

Open
seroperson opened this issue Apr 14, 2023 · 19 comments
Open

Another one MP desyncs debugging #9768

seroperson opened this issue Apr 14, 2023 · 19 comments

Comments

@seroperson
Copy link
Collaborator

Hello! I'm doing some step-by-step debugging right now and after fixing desyncs caused by #9767 there were desyncs caused by CvPlayer::m_iLastSliceMoved on CS AI players. I did some code digging and I don't see whether this variable used for something except of some local "stuck checking" (CvGame.cpp line 9854). It is getting set for each player with getTurnSlice() value and it differs for host and non-host players. I'm not sure if it is causing loading screens and so on, but at least state is desynced because CvPlayer::m_iLastSliceMoved on CS AI players are different for host and non-host players. Maybe we should remove this value from synchronization check at all?

Another interesting point is that this variable (according to code prior the very first commit) was synced in original dll too. I'm not tested it but I think it was so. If it is so, then such desync is caused by our new (?) m_iTurnSlice incrementation logic for host and non-host players (I tested and it really updates differently) and it can affect some other variables too. In this case it is better to find the change which caused m_iTurnSlice desynced.

Anyway I got attempt on just removing SYNC_ARCHIVE_VAR and visitor on CvPlayer::m_iLastSliceMoved and looks like there were no "Variable Out Of Sync" messages for the first 30 turns but I'm not sure if it is corrent way of removing value from synchronization check. I will do bigger test later in days.

@seroperson seroperson changed the title MP desync debugging MP CvPlayer::m_iLastSliceMoved desync debugging Apr 14, 2023
@ilteroi
Copy link
Collaborator

ilteroi commented Apr 14, 2023

there is some helpful info here: #8345

@seroperson
Copy link
Collaborator Author

seroperson commented Apr 23, 2023

So, after days of debugging, I did some progress on desyncs investigation.
Still not filled as PRs:

  • Deleted sync checks on m_iLastSliceMoved and m_aistrInstantYield values in CvPlayer. I didn't found any necessary to check if they are synced because they are used just only locally. Nothing changed in general but log is a little bit cleaner now.
  • Changed a SYNC_ARCHIVE_VAR a little bit to restore variable name printing in log again.
  • Added a lot of logs in CvDllNetMessageHandler.cpp to make it easy to track which changes are syncing between players.
  • Found but still not tested the possible desync: if you are moving great works with "optimize" button, looks like they are not moving for other players. I'm still not sure about it but looks like it so. It causes a lot of culture desyncs, culture desyncs causes some yields desyncs and so on and so on.

Filled as PRs:

The way of how I'm testing it right now (in case if you are testing solo) (maybe it will be useful for someone in future):

  • Setup some virtual machine.
  • Install stream, buy another one civilization 5 on your second steam account.
  • Setup modpack and your custom dll.
  • Enable logs and so on.
  • Start both civs and join to each other. You can start MP session from scratch or use savegame you played with your friends before.
  • Do something to trigger possible desync.
  • If there is some desync, you will notice "Variable Out Of Sync" message in net_message_debug.log, so you can check it right during the game.
  • After you found some desync in net_message_debug.log, it's time to compare logs. I'm just copying both Logs/ directories and comparing files with each other. Sometimes it helps, sometimes it not. If not, add more logs and try again.
  • We need to find initial changes which happened for some player and were not reflected on others.
  • Like example, here I'm comparing CultureAI_Log.csv from my friend (he is playing Tehuelche civ and his logs are on right side) with my version. Here we see he moved some great works to achieve theme boost but these changes are not reflected on my side (left). That's why yields got desynced for the next few turns until (looks like) loading screen.
    image

In case if you are not going to buy second civ5 and so on, you can just play with your friends until some crash or heavy desync and then collect logs from all of them and compare everything in the same way. After #9767 it plays ok at least until midgame.

So, I'm continuing on desync debugging, looking into some potential possible causes right now. Just posting here for the history and possible feedback.

@seroperson seroperson changed the title MP CvPlayer::m_iLastSliceMoved desync debugging Another one MP desyncs debugging Apr 23, 2023
@seroperson
Copy link
Collaborator Author

seroperson commented Apr 25, 2023

After some debugging I noticed another one rootcause for desyncs: std::sort usage. Looks like it can work differently time for time (link):

Sorts the elements in the range [first, last) in non-descending order. The order of equal elements is not guaranteed to be preserved.

i.e. it's confirmed while debugging desync with automated workers (elements with 'Evaluating' are sorted with std::sort by score (last value)) (on the picture you can notice different results for both players; the first one is going to handle 42:25 plot while another is going to handle 42:23 which has the same score):
image

I see there are a lot of std::sort usages across the whole codebase. It will be hard to inspect every single usage individually and I think many of them could cause desyncs. Can we replace everything which is not primitives with std::stable_sort?

@ilteroi @RecursiveVision

@ilteroi
Copy link
Collaborator

ilteroi commented Apr 25, 2023

yes that seems like a very good idea

@RecursiveVision
Copy link
Collaborator

Agreed.

@seroperson
Copy link
Collaborator Author

I'm not sure what can cause such behavior but if we just replace all the sort with stable_sort then it starting crashing randomly at turns 0-1-2 and sometimes when exiting to main menu. Any thoughts on possible reasons? Dump says it crashed somewhere in .exe with 'illegal operation', logs does not contain anything.
Maybe it could be some specific cpp behavior? I read that std::sort/std::stable_sort can crash in some conditions but I'm not a cpp expert so I don't know such things.

@ilteroi
Copy link
Collaborator

ilteroi commented Apr 27, 2023

can you create a PR for your changes then i'll take a look

@seroperson
Copy link
Collaborator Author

@ilteroi I could but there were just std::sort -> std::stable_sort replacements (something like this (link) commit but there were really just raw replacements, so they were the only reasons to crash). I'm still not sure what causing such behavior but right now I'm testing such commit (link) where I changed comparators so we still can use std::sort (but without unstable behavior because there are no equal elements) and seems like there are no such crashes but it looks so ugly ...

@ilteroi
Copy link
Collaborator

ilteroi commented Apr 29, 2023

ok so i did a search and replace, now it's stable_sort everywhere and it's running fine. no idea what the problem is on your end.

@LoneGazebo
Copy link
Owner

Strong work! For the great work desync, you might take a look at the network code for great work swapping, as we had to hijack it partially to get events to work in MP. may have unintentionally broken something else, as is tradition.

@ilteroi
Copy link
Collaborator

ilteroi commented Apr 29, 2023

that part should already be fixed: f0832f0

@LoneGazebo
Copy link
Owner

Oh nice

@LoneGazebo
Copy link
Owner

I’ll buy everyone drinks if MP desyncs actually stop.

@seroperson
Copy link
Collaborator Author

seroperson commented Apr 29, 2023

ok so i did a search and replace, now it's stable_sort everywhere and it's running fine. no idea what the problem is on your end.

@ilteroi Can you post a dll? Just tested another one dll where I again replaced all the sorts with stable_sort and it crashes again in 1x1 MP session. Caught crash when exiting to the main menu and after loading the save it is immideately crashing at next turn. Are you checking MP or SP?

edit: It defenitely caused by sort -> stable_sort replacement because we loaded the save with dll without replacement and crash gone away

edit2: Actually we loaded into 4 players MP session with dll with sort everywhere and special comparatos (which I posted above, link) and still we are crashing randomly at early turns. Idk maybe I'm doing some trivial thing wrong and actually sort changes are not the reason. We will look into possible solutions.

@JohnsterID
Copy link
Contributor

ilteroi code is on the oleole branch: 73a7cbb

@seroperson
Copy link
Collaborator Author

Well, it is definitely crashing and it is definitely stable_sort causes such crashes. We tested maybe 5-6 dlls in 4player MP session and it is stable on master, stable on master + some my fixes, but not stable if there are some new stable_sort usages. I think some memory issues or something like that can cause such behavior because in 3player session everything seems ok and right just when we are loading into 4players -> boom 2/4 players crashed at turn 1. i.e. such behavior is reproducable when adding to dll just single stable_sort in CvBuilderTaskingAI.cpp (actually maybe it's the only reason because all dlls which we tested, has this line replaced).

@seroperson
Copy link
Collaborator Author

seroperson commented May 9, 2023

@ilteroi I see you merged stable_sort everywhere but we still see crashes with such changes in 3-4 players MP session. I'm not sure but I think this commit can introduce some new instability. We will test the new version, maybe these crashes were caused by some my own build trouble or something like that.

@ilteroi
Copy link
Collaborator

ilteroi commented May 10, 2023

i think this stable_sort business is a red herring; while regular sort does not guarantee the order of equal elements the result is deterministic. we run the same code on both ends, so it should not matter. (side note: unless we are sorting raw pointers, but we don't do that and they would typically be unequal)

but there is a comment here which sounded quite positive: https://forums.civfanatics.com/threads/new-version-3-4-1-may-9-2023.683656/page-2#post-16454134 -- so something seems to have improved with your other changes!

@Cyclopia31
Copy link

Cyclopia31 commented May 21, 2023

Hi, I did a 200 turns game with 2 players and 4 AI. Also I found a Desync at Turn 0 using the Inca, If it can help here the link of the issue:
#9864

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants